Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Optimal designs for high throughput stream processing using universal RAM-based permutation network
(USC Thesis Other)
Optimal designs for high throughput stream processing using universal RAM-based permutation network
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
OPTIMAL DESIGNS FOR HIGH THROUGHPUT STREAM PROCESSING USING UNIVERSAL RAM-BASED PERMUTATION NETWORK by Ren Chen A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (ELECTRICAL ENGINEERING) May 2017 Copyright 2017 Ren Chen Acknowledgments I would like to thank my adviser, Prof. Viktor K. Prasanna for guiding and encouraging me towards doctoral degree. He has patiently trained me in various research activities, kindly support my thesis work, and wisely guided me through challenges and obstacles. His efforts on reviewing and providing constructive comments also greatly improved my thesis work. Also, many thanks to Prof. Peter Beerel and Prof. Wyatt Lloyd for serving on my thesis committee and offering me valuable advice. Additionally, I would like to acknowledge my colleagues at USC, especially Shijie Zhou, Shreyas Girish Singapura, Sanmukh Rao Kuppannagari, Chi Zhang, Charith Wickramaarachchi, Yun Qu, Andrea Sanny, and Da Tong. It has been a great pleasure for me to work with them. I also would like to give my special thanks to Diane Demetras and Kathryn Kassar for their extensive support and assistance. Finally, I would like to thank my parents Dahai Chen and Siyuan Tong, for their constant support and encouragement in my studies. My warmest thanks to my wife Kelly Yin and my daughter Huiyi Chen, who accompanied me and cheered me up to overcome various difficulties, for their love and understanding. ii Table of Contents Acknowledgments ii List of Tables vi List of Figures vii Abstract x Chapter 1: Introduction 1 1.1 Stream Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Data Permutations on Streaming Data . . . . . . . . . . . . . . . . 3 1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.4 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Chapter 2: Background 12 2.1 Classic Data Permutations . . . . . . . . . . . . . . . . . . . . . . 12 2.1.1 Stride Permutation . . . . . . . . . . . . . . . . . . . . . . 13 2.1.2 Bit Reversal . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.1.3 Bit-index Permutation . . . . . . . . . . . . . . . . . . . . 15 2.2 Multi-stage Interconnection Networks . . . . . . . . . . . . . . . . 17 2.2.1 Clos Network . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.2.2 Benes Network . . . . . . . . . . . . . . . . . . . . . . . . 19 2.3 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.3.1 FFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.3.2 Sorting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.3.3 Join . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.4 FPGA Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.5 Research Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . 29 Chapter 3: Streaming Permutation Network 31 3.1 Universal RAM-based Permutation Network . . . . . . . . . . . . . 31 3.1.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.1.2 Problem Definition and Notations . . . . . . . . . . . . . . 33 iii 3.1.3 Algorithmic Technique and Theory . . . . . . . . . . . . . . 33 3.1.4 Parameterized Architecture . . . . . . . . . . . . . . . . . . 43 3.1.5 Interconnection Optimization . . . . . . . . . . . . . . . . . 45 3.1.6 Memory Optimization . . . . . . . . . . . . . . . . . . . . 49 3.1.7 Experimental Evaluation . . . . . . . . . . . . . . . . . . . 52 3.2 Optimal Designs for Application-Specific Permutations . . . . . . . 57 3.2.1 Parallel-2 Bit Reversal . . . . . . . . . . . . . . . . . . . . 58 3.2.2 Parallel-2 k Bit Reversal . . . . . . . . . . . . . . . . . . . . 60 3.2.3 Resource Consumption . . . . . . . . . . . . . . . . . . . . 64 Chapter 4: RPN-based Optimal Designs for Streaming Applications 67 4.1 High Throughput Streaming FFT . . . . . . . . . . . . . . . . . . . 67 4.1.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.1.2 Architecture Framework . . . . . . . . . . . . . . . . . . . 68 4.1.3 Mathematic Formalization . . . . . . . . . . . . . . . . . . 71 4.1.4 Design Templates of Building Blocks . . . . . . . . . . . . 75 4.1.5 Design Optimizations . . . . . . . . . . . . . . . . . . . . . 77 4.1.6 Implementation of Illustrative Designs . . . . . . . . . . . . 82 4.1.7 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . 85 4.1.8 Design Optimization Evaluation . . . . . . . . . . . . . . . 86 4.1.9 Design Space Exploration . . . . . . . . . . . . . . . . . . 90 4.1.10 Performance Comparison . . . . . . . . . . . . . . . . . . . 92 4.2 High Throughput Streaming Sorting . . . . . . . . . . . . . . . . . 95 4.2.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . 95 4.2.2 Parameterized Architecture . . . . . . . . . . . . . . . . . . 97 4.2.3 Computation Stage Template . . . . . . . . . . . . . . . . . 99 4.2.4 Communication Stage Template . . . . . . . . . . . . . . . 100 4.2.5 High Throughput Design . . . . . . . . . . . . . . . . . . . 105 4.2.6 Resource Efficient Design . . . . . . . . . . . . . . . . . . 106 4.2.7 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . 108 4.2.8 Asymptotic Analysis . . . . . . . . . . . . . . . . . . . . . 110 4.2.9 Performance of HT Design . . . . . . . . . . . . . . . . . . 110 4.2.10 Performance Comparison with the State-of-the-Art . . . . . 111 4.3 High Throughput Streaming Equi-Join . . . . . . . . . . . . . . . . 113 4.3.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . 113 4.3.2 Hybrid Design for Sorting . . . . . . . . . . . . . . . . . . 115 4.3.3 Decomposition-based Task Partition Approach . . . . . . . 117 4.3.4 Streaming Join Algorithms . . . . . . . . . . . . . . . . . . 119 4.3.5 CPU-FPGA System Implementation . . . . . . . . . . . . . 123 4.3.6 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . 125 4.3.7 FPGA Accelerator Performance . . . . . . . . . . . . . . . 126 4.3.8 CPU-FPGA System Performance . . . . . . . . . . . . . . . 127 iv 4.3.9 Comparison with Prior Works . . . . . . . . . . . . . . . . 132 Chapter 5: Conclusion 134 5.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . 134 5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 5.2.1 Optimizing Memory Performance through Dynamic Data Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 5.2.2 Design Space Exploration and Pareto Optimality . . . . . . 137 5.2.3 Application-specific Composite Metrics . . . . . . . . . . . 138 Bibliography 140 v List of Tables 3.1 Resource consumption summary . . . . . . . . . . . . . . . . . . . 52 3.2 Control bit values in different cycles for bit reversal when 2 n = 8 . . 60 3.3 Control bit values in different cycles for parallel-2 k bit reversal on 2 n data inputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 3.4 Comparison of several bit reversal circuit designs . . . . . . . . . . 66 4.1 Key features of the ZedBoard . . . . . . . . . . . . . . . . . . . . . 123 4.2 Resource consumption of the PL section on Zynq . . . . . . . . . . 127 4.3 Comparison with prior works . . . . . . . . . . . . . . . . . . . . . 133 vi List of Figures 1.1 Memory technology evolution [4] . . . . . . . . . . . . . . . . . . 2 1.2 Non-streaming 8-point FFT design . . . . . . . . . . . . . . . . . . 3 1.3 Streaming 8-point FFT design . . . . . . . . . . . . . . . . . . . . 3 1.4 Ubiquitous computation kernels in emerging applications . . . . . . 4 1.5 Data permutations in sorting network . . . . . . . . . . . . . . . . . 5 1.6 Data permutations: (a) through wires, (b) on streaming data with a fixed data parallelism . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1 Bit-index permutation on 16-element data sequence: (a) bit shuffle, (b) bit reversal, (c) vector reversal, (d) 44 matrix transpose, (e) perfect shuffle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.2 a) A 3-stageN-to-N (N = rs) Clos network, b) A multi-stage N-to-N Benes network . . . . . . . . . . . . . . . . . . . . . . . . 18 2.3 Constructing a bitonic merge network: (a) Bitonic merge network forN = 8, (b) Splitting a bitonic sequence into two bitonic sequences, (c) Constructing bitonic sorting network for N = 8 using bitonic merge networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.4 Permutation patterns in 8-input bitonic sorting network (arrows show the sorting order) . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.5 Equi-join operation . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.6 Internal organization of FPGA . . . . . . . . . . . . . . . . . . . . 28 2.7 High level abstraction of target FPGA platform . . . . . . . . . . . 29 3.1 Folding the Clos network into an SPN . . . . . . . . . . . . . . . . 34 3.2 AnN-to-N Benes Network can be decomposed into twoN=2-input Benes subnetworks, the Benes network in (a) has been recursively decomposed for logp times, (b) shows the generated datapath by vertically folding the Benes network in (a). . . . . . . . . . . . . . . 38 3.3 An example of generatingSPN n;2 . . . . . . . . . . . . . . . . . . 42 3.4 Proposed permutation network . . . . . . . . . . . . . . . . . . . . 44 3.5 Configuration bits of switch box in different states . . . . . . . . . . 46 3.6 Permutation in time on 4-key data sequences: a) using a dual-port memory of size eight b) using a single-port memory of size four . . 50 vii 3.7 Resource consumption for variousn andp . . . . . . . . . . . . . . 53 3.8 Performance results for variousn andp . . . . . . . . . . . . . . . 56 3.9 Parallel-2 bit reversal design for input size 2 n (A 0 andA 1 are data buffer addresses) . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.10 Control bit values in different states . . . . . . . . . . . . . . . . . 58 3.11 Functional structure of the employed data buffer and its behavior in the read-before-write mode . . . . . . . . . . . . . . . . . . . . . . 59 3.12 Data flows in parallel-2 bit reversal for input size 2 n = 8 . . . . . . 60 3.13 (a) Circuit design for parallel-2 k bit reversal, (b) A block denoted asM a consisting of a data buffer, a 2-to-1 multiplexer, and a 1-to-2 de-multiplexer, (c) A block denoted asM b having a data buffer, (d) Fixed wire connectionC . . . . . . . . . . . . . . . . . . . . . . . 62 3.14 Parallel-4 bit reversal design for input size 2 n . . . . . . . . . . . . 63 4.1 Architecture framework . . . . . . . . . . . . . . . . . . . . . . . . 69 4.2 From an 8-input FFT network to its mathematical representation . . 72 4.3 (a) Radix-2 block, (b) Radix-4 block, (c) Parallel-to-serial (PS) mul- tiplexer (d) Serial-to-parallel (SP) multiplexer, (e) 2-input Twiddle Factor Computation (TWC) unit (e) 4-input Twiddle Factor Compu- tation (TWC) unit . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 4.4 Memory activation scheduling . . . . . . . . . . . . . . . . . . . . 81 4.5 High throughput design for FFT . . . . . . . . . . . . . . . . . . . 82 4.6 Overall architecture of resource efficient design . . . . . . . . . . . 84 4.7 Memory and logic resource used by the HT Design and the base- line . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 4.8 Energy efficiency of the HT Design and the baseline . . . . . . . . 88 4.9 Power evaluation for variousp andN . . . . . . . . . . . . . . . . 89 4.10 Design space exploration for (a)N = 64 and (b)N = 4096 . . . . 90 4.11 Memory efficiency comparison of various designs . . . . . . . . . . 93 4.12 Energy efficiency comparison . . . . . . . . . . . . . . . . . . . . . 94 4.13 Architecture Framework . . . . . . . . . . . . . . . . . . . . . . . 98 4.14 CAS units: for generating a) ascending order, b) for descending order, c) either order . . . . . . . . . . . . . . . . . . . . . . . . . 99 4.15 Template Design ofD1 block . . . . . . . . . . . . . . . . . . . . . 102 4.16 Example: Routing the Clos network for D1 to realize P (8;2) and p = 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 4.17 High throughput design for sorting . . . . . . . . . . . . . . . . . . 105 4.18 Overall architecture of resource efficient design . . . . . . . . . . . 107 4.19 Memory efficiency comparison of various designs . . . . . . . . . . 109 4.20 Energy efficiency comparison . . . . . . . . . . . . . . . . . . . . . 112 4.21 Hybrid Sorting Design . . . . . . . . . . . . . . . . . . . . . . . . 114 viii 4.22 (a) Compare-and-switch (CAS) unit, (b) Data buffer, (c) Connection network, (d) Parallel-to-serial/serial-to-parallel MUX (PS/SP) . . . 115 4.23 Data permutation in the data buffers for 16-key sorting . . . . . . . 115 4.24 A fully pipelined high throughput bitonic sorter . . . . . . . . . . . 117 4.25 Decomposition based task partition approach . . . . . . . . . . . . 118 4.26 Block diagram of the complete system design on Zynq . . . . . . . 123 4.27 Performance comparison to merge-sort design . . . . . . . . . . . . 126 4.28 Throughput comparison for various input sizes . . . . . . . . . . . . 128 4.29 Throughput performance of the SBNL-based design and the SSMJ- based design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 4.30 Execution time breakdown of the SSMJ-based design . . . . . . . . 131 5.1 Pareto frontier for energy efficiency and throughput . . . . . . . . . 138 ix Abstract Streaming architectures are widely employed in hardware design particularly when high throughput is desired. A streaming architecture takes input and produces output at a fixed rate with no gap between consecutive data streams. Applications implemented with streaming architectures are usually composed of computation stages separated by data permutations, which is a fixed reordering of the data elements. Data permutation is a fundamental problem in various research areas, including sig- nal processing, machine learning, and data analytics. Well known permutations include stride permutation (perfect shuffle, corner turn, etc.), bit reversal, and the Hadamard reordering. Data permutation can be simply realized by reordered hardware wire con- nections if all data inputs are available concurrently. Nevertheless, such approach is not desirable for large input size due to high interconnection area and complexity. Instead, this overhead could be highly reduced in streaming architectures. However, when the streaming width (the number of input/output data per cycle) is non-trivial (2 or more), designing a streaming permutation structure becomes challenging. In this thesis, we develop a universal RAM-based Permutation Network (RPN) for realizing permutations on streaming data. The RPN is universal as it can realize arbitrary fixed permutation on a given data sequence for a given data parallelism. The key idea is to construct RPN through a divide-and-conquer based mapping algorithm utilizing the x classic multi-stage Clos and Benes networks. We further develop a RPN-based frame- work for generating optimal designs for specific permutations arising in well-known data intensive algorithms. The designs are optimal as follows: they consume minimum number of memory words and minimum number of multiplexers that are necessary for realizing a particular permutation, and the designs have minimum latency. In particular, we make the following contributions: (1) a universal RAM-based permutation network to perform any given fixed permutation with a flexible data parallelism, (2) a divide-and- conquer based mapping algorithm such that the RPN is parameterized to accommodate any input size and data parallelism, (3) algorithmic-level optimizations for reducing the memory and the multiplexers consumed by the RPN, (4) RPN-based optimal designs for well known permutations arising in data-intensive applications including sorting, Equi- join, Fast Fourier Transform (FFT), etc. and (5) highly optimized RPN-based streaming architectures for these applications on Field-Programmable Gate Array (FPGA). We evaluate our research contributions through post place-and-route experiments on FPGA. We provide detailed experimental results of the RPN-based designs for stream- ing permutation using various performance metrics including memory efficiency and interconnection complexity. We also present evaluation results which demonstrate that our proposed highly optimized streaming architectures can achieve high performance with respect to throughput, energy efficiency and memory efficiency. As future work, we discuss opportunities in applying our proposed RPN in data layout optimization for new memory technologies. xi Chapter 1 Introduction 1.1 Stream Processing Stream-oriented computing is characterized by high data-rate flow, compute-intensive operations on data streams, and little data reuse (due to streaming nature) [31, 71, 78]. Stream processing is widely used in parallelizing applications that exhibit compute intensity, data parallelism and data locality [29, 31, 78]. Scalable, massively paral- lel techniques are becoming increasingly important in the era of big data, which is dominated by stream processing to cope with intensive large-scale data traffic. Recon- figurable hardware, such as FPGAs, is becoming increasingly prevalent in achieving scalable performance improvements in the new era of parallel computing [26, 69, 77]. Hardware acceleration through FPGA has been effective over the years in a variety of domains [16, 25, 34, 33, 58, 80]. Considering its low data reuse rate, stream process- ing is a perfect fit to FPGAs massively parallel reconfigurable logic and limited on-chip memory. Nowadays, memory technology is continuously advancing thereby leading to scaling up of memory bandwidth available to processors and hardware accelerators. Compared to the DDR3-1333 2GB module, the state-of-the-art Hybrid Memory Cube (HMC) tech- nology can provide more than 10 folds bandwidth (Fig. 1.1). On a system platform inte- grating HMC and accelerators such as FPGA, streaming architectures are expected to achieve high throughput by fully utilizing the available memory bandwidth, considering 1 (a) New memory technology (b) Scaling up memory bandwidth Figure 1.1: Memory technology evolution [4] little data reuse in most streaming applications. However, most prior approaches to par- allelization and performance tuning of hardware accelerators have targeted computation and I/O bottlenecks, data reuse, and effective utilization of on-chip memory. Due to limited bandwidth interfaces to external memory in isolation, the performance gain is limited in the overall performance improvement they can achieve. Usually only a portion of the data is available for computation in stream processing. As the size of the data to be processed is massive, execution is performed on a sequence of data elements made available over time. Fig. 1.2 and 1.3 show the non-streaming design and streaming design for Fast Fourier Transform (FFT), respectively. The non- streaming FFT design is realized through straightforward mapping of FFT networks consisting of computation stages separated by permutation stages. By vertically folding the FFT networks, streaming FFT designs with flexible data parallelism can be obtained. In streaming architectures, efficient parallel processing of data in motion is non-trivial, albeit imperative to achieve performance improvements as ongoing technology trends are accelerating the generation of enormous quantities of streaming data. 2 Figure 1.2: Non-streaming 8-point FFT design Figure 1.3: Streaming 8-point FFT design 1.2 Data Permutations on Streaming Data Data permutations are needed in a wide variety of algorithms such as FFT, parallel sort- ing networks, and Gray encoding method [36, 41]. A data permutation is a fixed reorder- ing of a given number of data elements. For example, given a stridet, a stride permuta- tion reorders anN-element data vector, such that data elements with an index distance oft are moved into adjacent locations. Classic data permutations include stride permu- tation and bit reversal in FFT networks, bit-index permutation in Sorting networks, as well as the Hadamard reordering [47, 60]. Efficient hardware solution for performing data permutations is critical for parallel hardware implementations of various applica- tions [28, 47, 60, 71]. When all data inputs are available concurrently, data permutation 3 In-Memory Database Frequency Domain Analysis on Graphs Fast Training of Convolutional Neural Network using FFTs Streaming Graph Processing Edge-centric Graph Processing Sorting FFT Sorting Matrix Manipulation FFT Matrix Transformation Figure 1.4: Ubiquitous computation kernels in emerging applications can be simply realized by reordered hardware wires. However, for large input size, wire based approach becomes infeasible due to high routing area and complexity. As shown in Fig. 1.4, FFT, sorting networks and matrix manipulations are ubiquitous computation kernels employed by various emerging applications including database, graph processing, and convolutional neural networks [69, 80]. Data permutations are performed frequently in these kernel algorithms. Fig. 1.2 depicts an 8-point FFT net- work using Cooley-Tukey algorithm [41], where the blue stages represent data permu- tations, and the purple stages correspond to the twiddle factor coefficient computations. Fig. 1.5 shows the data permutations in an 8-input bitonic sorting network. To map large size FFT or sorting networks onto hardware, due to limited I/O bandwidth, a more desir- able approach is to feed the data inputs into the hardware over several cycles. Pipelined datapaths are one method in which a single pipeline accepts a sample after some clock cycles. Another method of satisfying throughput constraints is to increase the level of parallelism of the datapath. Using this approach, multiple data elements are input to a parallel pipeline such that streaming data enter into and flow out from the pipeline 4 Input Figure 1.5: Data permutations in sorting network design at a fixed rate. To permute streaming data, a group of data elements are fed as input in each cycle. After some delay depending on the permutation to be realized, the input sequence is reordered as specified by the given permutation and output over several consecutive cycles. Computations of streaming data can be easily performed by time multiplexing the key computation units such as comparison units in sorting or butterfly computation units in FFT. However, designing a specialized parallel pipeline design for permuting streaming data is challenging as data elements need to be moved across the temporal boundary, i.e., a data element previously input in the first input cycle needs to be output in the fourth output cycle. This feature requires register files or mem- ories to be employed so that a data element can be output after a specific amount of delay. Fig. 1.6(a) shows an example of stride permutation on 8 data inputs per cycle. This permutation is simply implemented by a reordering of hardware wires. Fig. 1.6(b) shows an example of realizing the permutation in Fig. 1.6(a) on streaming data. In Fig. 1.6(b), data inputs are not concurrently available, therefore, they need to be fed into the design in four cycles. Note that continuous data streams are input to the designs in Fig. 1.6. Previous hardware implementations for data permutations can be classified into memory-based designs and register-based designs [47]. In the traditional VLSI imple- mentation for FFT, register-based delay feedback and delay commutator are proposed 5 Data permutation x 14 x 00 x 04 x 10 x 15 x 11 x 05 x 01 x 16 x 12 x 06 x 02 x 17 x 13 x 07 x 03 Data seq. 4 inputs per clock cycle ... Data seq. Continuous data sequences x 12 x 00 x 02 x 10 x 16 x 14 x 06 x 04 x 13 x 11 x 03 x 01 x 17 x 15 x 07 x 05 Data seq. Continuous data sequences Data seq. 4 outputs per clock cycle ... x 01 x 00 x 02 x 03 x 04 x 05 x 06 x 07 x 04 x 00 x 01 x 05 x 02 x 06 x 03 x 07 (a) (b) x 11 x 10 x 12 x 13 x 14 x 15 x 16 x 17 x 14 x 10 x 11 x 15 x 12 x 16 x 13 x 17 Input Cycle 1 Cycle 0 Input Output Cycle 1 Output Cycle 0 ... ... Figure 1.6: Data permutations: (a) through wires, (b) on streaming data with a fixed data parallelism to perform stride permutation and bit reversal [88]. In these designs, a single input data sequence is broken into several parallel data streams flowing forward with proper delays. High computational performance per unit area is achieved by using such a pipelined design. However, the design throughput is limited by the data rate of single input in each cycle [88]. To obtain higher throughput, a more feasible approach is to feed and process the input data in a streaming manner to increase the level of parallelism [36]. In [47], memory-based and register-based stride permutation networks for array pro- cessors are presented. Using their design approach [47], stride permutation networks supporting processing streaming data can be realized for any given data parallelism. High throughput performance can be achieved by increasing the data parallelism. How- ever, their approach is only applicable to stride permutations. Motivated by the idea 6 of streaming design in [47], a memory-based 3-stage structure is developed in [71] to realize a specific family of permutations called bit index permutations. An optimization method for reducing the number of switches in the proposed 3-stage structure is further presented in [78]. Many key permutations including bit reversal, stride permutation, and Gray code reordering all fall into bit index permutations. All the above proposed approaches only provide design methods for a particular class of permutations. In [59], the authors develop a memory-based technique that automatically generates a stream- ing datapath for any fixed permutation. Such a design is capable of processing massive parallel streaming data, however, at the expense of memory. 1.3 Contributions In this thesis, we propose a RAM-based permutation network (RPN) for streaming per- mutation which is parametrized to accommodate arbitrary size data sequence and data parallelism. A novel divide-and-conquer based mapping algorithm is developed to build the datapath of the RPN as well as compute the control bit information for perform- ing a particular permutation. Such mapping algorithm utilizes the classic multi-stage interconnection networks as input. Furthermore, algorithmic-level optimizations are proposed to reduce the logic resource consumed by RPN. We also show that optimal designs for specific well-known permutations can also be derived using our proposed RPN design. To obtain high throughput streaming architectures on FPGA, we introduce an architecture framework which mathematically transforms algorithms including FFT and sorting into data permutation problems and fundamental computations. Benefiting from the design flexibility of RPN, various FFT and sorting networks can be converted into a hardware implementation within our proposed architecture framework. Using the 7 framework, we demonstrate the trade-offs between energy, area, and time on a parame- terized architecture. In this dissertation, we make the following contributions: 1. Universal RAM-based Permutation Network: We propose a novel universal RAM-based Permutation Network (RPN) to realize inter-stage communication between adjacent computation stages in streaming applications. A noteworthy feature of the proposed RPN is that it can permute streaming data for any given data permutation pattern. The datapath of the RPN is built by vertically folding the classicNN permutation networks [40, 20], thus inheriting the minimal con- nection cost feature of permutation networks. We further develop optimizations to improve resource efficiency of the RPN such that it consumes minimum number of memory words and multiplexers for realizing a particular permutation. Detailed experimental results show that by optimizing the resource efficiency, our proposed RPN outperforms the baselines with respect to both throughput and energy effi- ciency. 2. Streaming Architecture Framework: We propose a streaming architecture framework capable of converting various streaming applications into data permu- tation problems with computations. Based on this framework, a complete design for a given streaming application can be constructed for given input design param- eter values, including the problem size and the data parallelism. At the algorithmic level, memory optimizations for processing continuous data streams is incorpo- rated into the proposed framework. At the architecture level, hardware binding algorithms are developed to improve the design energy efficiency. Experimental results show significant performance improvements using our proposed optimiza- tion techniques. We also demonstrate a design space that shows the effect of the design parameters on the EnergyAreaTime (EAT) composite metric and the energy efficiency. Compared with the state-of-the-art designs, the designs 8 obtained by our proposed architecture framework achieve significant improve- ments in throughput, memory efficiency and energy efficiency. 3. High Throughput Streaming FFT: We revisit the classic Fast Fourier Trans- form (FFT) for high throughput stream processing on FPGA. Optimal circuits for realizing specific data permutations in FFT algorithms are derived from the pro- posed RPN. To optimize the memory efficiency, we develop an algorithmic level memory optimization technique which results in 50% reduction in memory con- sumption for processing continuous data streams compared with state-of-the-art. Furthermore, automatic hardware binding is incorporated into our design frame- work to obtain energy efficient designs on FPGA. We evaluate our designs on a state-of-the-art FPGA using post place-and-route results. From the experimental results, a design space is created to demonstrate the effect of the design parame- ters on the various performance metrics and the trade offs between energy, area and time. For N-point FFT (16 N 4096), our designs achieve up to 47% and 223% improvement in energy efficiency and memory efficiency, respectively, compared with the state-of-the-art designs. 4. High Throughput Streaming Sorting: We propose a systematic methodology for mapping large-scale bitonic sorting networks onto FPGA. By utilizing the pro- posed RPN for data permutation, we develop highly optimized streaming architec- tures for parallel sorting on FPGAs. We demonstrate tradeoffs among throughput, latency and area using two illustrative sorting designs including a high throughput design and a resource efficient design. With a data parallelism ofp (2pN=2), the high throughput design sorts anN-key sequence with latency 6N=p +o(N), throughputp results per cycle and uses 6N +o(N) memory. This achieves optimal memory efficiency (defined as the ratio of throughput to the amount of on-chip 9 memory used by the design). Experimental results show that our designs achieve 49% to 112% improvement in energy efficiency and 56% to 430% higher memory efficiency compared with the state-of-the-art. 5. Streaming Database Applications on CPU-FPGA Platform: Equi-join is one of the key database operations whose performance depends highly on sorting. We speed-up Equi-Join using a hybrid CPU-FPGA heterogeneous platform. To alleviate the performance impact of limited on-chip memory, we propose a merge sort based hybrid design where the first few sorting stages in the merge sort tree are replaced with “folded” bitonic sorting networks. These “folded” bitonic sorting networks operate in parallel on the FPGA. The partial results are then merged on the CPU to produce the final sorted result. Based on this hybrid sorting design, we develop two streaming join algorithms by optimizing the classic CPU-based nested-loop join and sort-merge join algorithms. On a range of data set sizes, our design achieves throughput improvement of 3.1x and 1.9x compared with software-only and FPGA only implementations, respectively. Our design sustains 21.6% of the peak bandwidth, achieving 3.9x improvement compared with the state-of-the-art FPGA Equi-Join implementation. 1.4 Overview The rest of the thesis is organized as follows: Chapter 2 covers the background of the data permutation problem, and introduces the challenges in realizing data permutation on streaming data. Background of algorithms including FFT, sorting and Join are also presented. 10 Chapter 3 details the algorithm for constructing a universal RAM-based permuta- tion network and experimental results of the design on FPGA. Chapter 4 presents parameterized streaming designs for FFT, sorting and Join on FPGA. Performance comparison with state-of-the-art designs are also presented. Chapter 5 concludes the thesis and presents the future research directions. 11 Chapter 2 Background In this Chapter, we formally define the problems we solve in this thesis and introduce the background knowledge and assumptions used in the thesis. 2.1 Classic Data Permutations Data permutation, in general, is defined the arrangement of data numbers in all pos- sible orders, one after the other; any one of such possible arrangements [60]. Data permutations can be represented using several different ways. One method is based on permutation matrices. Definition A permutation matrix denoted asP N is anNN matrix with all elements either 0 or 1, with exactly one 1 at each row and column [60]. As an example, a matrixP 3 , P 3 = 0 B B B @ 0 1 0 0 0 1 1 0 0 1 C C C A is a permutation matrix. Letx be a matrix x = 1 2 3 T ThenP 3 x is a permuted version ofx: P 3 x = 2 3 1 T 12 The product of permutation matrices is another permutation matrix [60]. Given anN- element data vectorx, the data vectory which is produced by the permutation overxcan be given as y =P N x; (2.1) 2.1.1 Stride Permutation Most of the data permutations in FFT algorithms are stride permutations. The matrix transpose is also known as a stride permutation: stride-byt permutation of anN-element vector can be realized by dividing the vector into t sub-vectors, organizing them into t (N=t) matrix form, transposing the obtained matrix, and rearranging the result back to the vector presentation [60]. Here it requirest to be divisible byN. One commonly used method to represent stride permutation is to use the address of the data elements in the source and permuted vectors. A permutation is considered as a one-to-one mapping of addresses from the set (0; 1;:::;N 1) to itself. Therefore, a stride-by-t permutation of anN-element sequence can be represented as r j =s (j) c j ; j = 0; 1;:::;n 1 (2.2) where s n1 s n2 :::s 0 and r n1 r n2 :::r 0 (n = log 2 N) are binary bit representations of source addresss and target addressr, respectively; denotes a bitwise XOR operation and 8 > > < > > : (j) =j +smodn c j = 0 j = 0; 1;:::;n 1 (2.3) Using the permutation matrix method, given an m-element data vector x and a stride t (1tm 1), the data vectory produced by the stride-by-t permutation overx is 13 given asy =P m;t x, whereP m;t is a permutation matrix.P m;t is an invertiblemm bit matrix such that P m;t [i][j] = 8 < : 1 ifj = (ti)modm + (bti=mc) 0 otherwise (2.4) where mod is the modulus operation andbc is the floor function. For example, P 4;2 performsx 0 ;x 1 ;x 2 ;x 3 !x 0 ;x 2 ;x 1 ;x 3 and can be represented as y 0 y 1 y 2 y 3 T = 0 B B B B B B B @ 1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1 1 C C C C C C C A x 0 x 1 x 2 x 3 T 2.1.2 Bit Reversal Bit reversal is an algorithm that permutes a set of indexed data according to a reversing of the bits of the index [88]. As an important permutation pattern, it has been exten- sively used to sort out the output frequencies in FFTs [88]. Bit reversal has also been well studied in previous works on FFT implementations [89, 21]. Bit reversal can be represented using permutation matrixB N such thaty =B N x.B N maps each element x[i] in x to the position given by reversing the log 2 N-bit binary representation of the indexi. Assumingy[j] =x[i] and the binary representation ofi isb 0 b 1 :::b log 2 N1 , then: j =(b 0 b 1 :::b log 2 N1 ) =b log 2 N1 :::b 1 b 0 (2.5) 14 (a) (b) (c) (d) (e) Figure 2.1: Bit-index permutation on 16-element data sequence: (a) bit shuffle, (b) bit reversal, (c) vector reversal, (d) 44 matrix transpose, (e) perfect shuffle where() is the bit reversing operation andN requires to be a power of two. Bit reversal permutation matrix is also known as the base-r digital reversal permutation whent = 2. The permutation matrix of the base-t digital reversal permutation is denoted asB N;t : B N;t = log t N Y i=1 (I N=log i t N P log i t N;t ) (2.6) Bit reversal is most important for radix-2 Cooley-Tukey FFT algorithms, where bit reversal of the inputs or outputs is recursively applied after every butterfly stage of the algorithm [41]. Similarly, base-r digital reversal permutations arise in radix-r Cooley- Tukey FFTs [41]. 2.1.3 Bit-index Permutation Bit-index permutation (BIP) was studied in [71]. In this permutation, given a data setx of sizeN, the element at indexi is swapped with the element at indexj =i XORk,k is a given parameter. N must be divisible by the smallest power of 2 which is greater 15 thank. Important permutations including bit reversal and stride permutation are special cases of Bit-index permutations. Equation 2.2 using address mapping can also be used to describe the bit-index per- mutations. A bit shuffle permutation is an example of BIP where the number of binary bits in addresses log 2 N is even. The bit shuffle permutation can be represented using: 8 > > < > > : (j) = 2jmodn c j = 0 j = 0; 1;:::;n 1 (2.7) Fig. 2.1(a) shows a bit shuffle permutation on a 16-element data sequence. Similarly, bit reversal permutations belong to BIP. The bit reversing of the source addresses can be illustrated using: 8 > > < > > : (j) =n 1j c j = 0 j = 0; 1;:::;n 1 (2.8) An example of bit reversal permutation forN = 16 is illustrated in Fig. 2.1(b). Vector reversal permutations are another class of permutations which are also subsets of BIP. In vector reversal permutations, the source address i is mapped to the target address N 1i,i = 0; 1;:::;N 1, which can be expressed as: 8 > > < > > : (j) =j c j = 1 j = 0; 1;:::;n 1 (2.9) Fig. 2.1(c) gives an example of Vector reversal permutation forN = 16. 16 A matrix transpose can also be considered as BIP. Given a matrix of sizeab, the transpose can be expressed with anab vector as follows; first read all the data elements of the matrix in column-wise, then reorder the elements according to the permutation: 8 > > < > > : (j) =j + log 2 bmod log 2 a + log 2 b c j = 0 j = 0; 1;:::;a +b 1 (2.10) The permuted data elements are finally written back into aab matrix in column-wise. Fig. 2.1(d) shows an example of 4 4 matrix transpose. Perfect shuffle permutations also fall into a class of of BIP. A perfect shuffle permu- tation can be defined as: 8 > > < > > : (j) =j + 1 c j = 0 j = 0; 1;:::;n 1 (2.11) An example of perfect shuffle permutation forN = 16 is depicted in Fig. 2.1(e). The above classic permutation patterns have been widely used in data or signal pro- cessing algorithms. Hardware architectures for realizing these permutation patterns in various applications have been extensively studied [36, 47, 71]. 2.2 Multi-stage Interconnection Networks Multi-stage interconnection networks have been extensively studied in the network area for decades [92]. Classic interconnection networks such as Clos network, Benes network and Banyan network have been widely employed in shared-memory multi- processor systems, telecommunication system, and building switch fabrics [92]. Until 17 ... ... ... ... ... ... ... ... ... ... s s s d r s s s s x d crossbars r x r crossbars d x s crossbars d d r r ... ... ... ... d d d (a) (b) r d r Ingress stage Middle stage Egress stage … … … … … … …... …... …... …... …... …... …... …... …... …... …... …... … Figure 2.2: a) A 3-stageN-to-N (N = rs) Clos network, b) A multi-stageN-to-N Benes network now, there is still a wide variety of applications where the realizations of data intercon- nection are needed. A typical interconnection network consists of a number of switching elements and interconnecting links. Interconnection functions are realized by properly setting control of the switching elements. In general, the interconnection networks with the capability of passing all theN! permutations onN elements in one pass through the network are known as rearrangeable networks [40]. These rearrangeable network are also called as permutation networks [47]. Crossbar network is an example of permutation network performing arbitrary per- mutations between the input and output. To connect theN inputs and theN outputs, the paths in the network are realized withN 2 crosspoint switches, which makes the network infeasible for large systems [40]. 18 2.2.1 Clos Network Clos network is a multi-stage interconnection network first developed in the 1950s for telephone switching systems [40]. Now Clos network is still widely used in the design of switching systems such as IP routers, data center network, and VLSI interconnection network [40]. Fig. 2.2a shows the basic structure of an N-to-N Clos network having three stages including the ingress stage, the middle stage, and the egress stage. r rep- resents the number of crossbars in the ingress or egress stage. s represents the number of inputs which feed into each of r ingress stage crossbar switches. N = sr is the network input size. Each ingress stage crossbar switch hasd outputs. Each ofd middle stage crossbar switches has exactly one connection with each ofr ingress stage switch. rds crossbar switches exist in the egress stage. Each middle stage switch is connected exactly once to each egress stage switch. The key advantage of Clos network is that the logic consumption for hardware implementation is far fewer than the case if the entire network is implemented with one large crossbar switch. 2.2.2 Benes Network An N-to-N Benes network can be derived from Clos network by setting s = d = 2, and r = N=2, and recursively decomposing the two middle stage N=2N=2 sub- networks [20]. Fig. 2.2b illustrates the structure of an N-to-N Benes network. Thus, the Benes network contains 2 logN 1 stages of 2-to-2 switch elements. A Benes network can be routed using the looping algorithm [68], which decomposes a given permutation into two sub-permutations that can be routed independently in the sub- networks. Both Clos network and Benes network are capable of handling all possible permutations onN-input [92], therefore they are also called as permutation networks. Benes network also achieves a theoretical lower bound of 2 log 2 N 1 stages of 2 2 19 switches which is known in [87]. Benes network is also well known for its advantages in timing philosophy, switching methodology, and control strategy [20]. There are also other interconnection networks proposed for a specific class of permu- tations. For example, shuffle/exchange (SE) networks for perfect shuffle permutations are presented in [81]. A log 2 N-stage SE network is also called as Omega network [53]. The drawback of switch only interconnection networks is that they are not applicable for permutations performed in parts with less number of input/output ports than the size of total input. Using the switch only interconnection networks, moving streaming data across both time and space boundary is impossible. Therefore, the permutation networks where either registers or memories are used for delaying the data elements are proposed and will be discussed in Chapter 3. 2.3 Applications 2.3.1 FFT Given N complex numbers x 0 ;:::;x N1 , Discrete Fourier Transform (DFT) is com- puted as: X k = P N1 n=0 x n e i2k n N ; k = 0;:::;N 1. Radix-x Cooley-Tukey FFT is a well known decomposition based algorithm for N-point DFT. The description of Radix-4 FFT is presented in Algorithm 1. In terms of the number of real operations, the computational complexity ofN-point Radix-4 FFT isO(N log 4 N). The algorithm performsN-point FFT inN=m (m<N) cycles usingm Input/Output ports (I/Os) and log 4 N radix blocks, which are used for butterfly computations. The algorithm itera- tively decomposes the entire problem into four subproblems. Similarly, radix-2, radix-8, and radix-16 FFT algorithms can be derived using this decomposition based approach. This feature enables us to map the algorithm by folding the FFT architecture vertically or horizontally, thus providing much freedom to implement various designs on FPGAs. 20 Algorithm 1 Radix-4 FFT Algorithm 1: q =N=4;d =N=4; 2: forp := 0 tolog 4 N do 3: fork := 0 to4 p 1 do 4: l = 4kq=4 p ;r =l+q=(4 p 1); 5: tw 1 =w[k];tw 2 =w[2k];tw 3 =w[3k]; 6: fori :=l tor do 7: t 0 =i;t 1 =i+d=4 p ;t 2 =i+2d=4 p ;t 3 =i+3d=4 p ; 8: do parallel 9: f p+1 [t 0 ] =f p [t 0 ]+f p [t 1 ]+f p [t 2 ]+f p [t 3 ]; 10: f p+1 [t 1 ] =f p [t 0 ]jf p [t 1 ]f p [t 2 ]+jf p [t 3 ]; 11: f p+1 [t 2 ] =f p [t 0 ]f p [t 1 ]+f p [t 2 ]+jf p [t 3 ]; 12: f p+1 [t 3 ] =f p [t 0 ]+jf p [t 1 ]f p [t 2 ]jf p [t 3 ]; 13: end parallel 14: do parallel 15: f p+1 [t 0 ] =f p+1 [t 0 ]; 16: f p+1 [t 1 ] =tw 1 f p+1 [t 1 ]; 17: f p+1 [t 2 ] =tw 2 f p+1 [t 2 ]; 18: f p+1 [t 3 ] =tw 3 x p+1 [t 3 ]; 19: end parallel 20: end for 21: end for 22: end for Existing work has mainly focused on optimizing the performance, power and area of the design at the circuit level. An energy-efficient 1024-point FFT processor was developed in [18]. Cache-based FFT algorithm was proposed to achieve low power and high performance. Energy-time performance metric was evaluated at various processor operation points. In [88], a high-speed and low-power FFT architecture was presented. They presented a delay balanced pipeline architecture based on split-radix algorithm. Algorithms for reducing computation complexity were explored and the architecture was evaluated in area, power and timing performance. Based on Radix-x FFT, various pipeline FFT architectures have been proposed, such as Radix-2 single-path delay feed- back FFT [89], Radix-4 single-path delay commutator FFT [21], Radix-2 multi-path delay commutator FFT [73], Radix-2 2 single-path delay feedback FFT [46], and Radix- 2 k multi-path delay forward FFT [45]. These architectures can achieve high throughput per unit area with single-path or multi-path pipelines, but design flexibility is limited 21 and energy efficiency has not been explored in these works. In [66], a parameterized soft core generator for high throughput DFT was developed. This generator can auto- matically produce an optimized design with user inputs for performance and resource constraints. However, energy efficiency is not considered in this work. In [39], the author presented a parameterized energy efficient FFT architecture. Their design is opti- mized to achieve high energy efficiency by varying the architecture parameters. Some energy efficient design techniques, such as clock gating and memory binding, are also employed in their work. The author in [28] extends the work in [39] to identify the effect of both the algorithmic mapping parameters and the architecture binding parameters on energy efficiency through design space exploration. However, in both [39] and [28], only radix-4 FFT algorithm has been considered. Other than FPGA, there are also some techniques for energy efficient FFT presented based on other different platforms [82, 51]. However, it is not clear how to apply these techniques on FPGAs. In this work, we develop a hardware generation framework for FFT which converts decomposition based FFT algorithms into the mathematical rep- resentation of data permutation matrices. Algorithmic level memory and energy opti- mizations are incorporated such that high throughput and energy efficient designs can be obtained on FPGA. 2.3.2 Sorting The sorting problem is, given a large set of (out-of-order) data sequence x 0 ;x 1 ;:::;x N1 , sort the data into a monotonic sequence. Sorting is a key kernel in numerous big data application including database operations, graphs and text analytics. Due to low control overhead, parallel bitonic sorting networks are usually employed for hardware implementations to accelerate sorting. Although a typical implementation of 22 Monotonic Sorted Stage 1 of BM network Bitonic Sequence log 2 N stages, each having N/2 comparators x 0 x 7 Stage 1 of BM network (N = 8) N Bitonic N/2 N/2 2 bitonic sequences (a) (b) (c) BM (2) BM (4) BM (2) BM (2) BM (4) BM (2) BM (8) Figure 2.3: Constructing a bitonic merge network: (a) Bitonic merge network forN = 8, (b) Splitting a bitonic sequence into two bitonic sequences, (c) Constructing bitonic sorting network forN = 8 using bitonic merge networks merge sort network can lead to low latency and small memory usage, it suffers from low throughput due to the lack of parallelism in the final stage. Parallel sorting networks such as bitonic sorting networks are widely employed in hardware implementations for sorting due to their high data parallelism and low control overhead. The key building blocks of a bitonic sorting network are the bitonic merge (BM) networks which rearrange bitonic sequences to be ordered. A bitonic sequence is a sequence such that n 0 ::: n k ::: n N1 for some k (0 k N 1), or if it can be circularly shifted to monotonically increase and then monotonically decrease [19]. Throughout this paper, we useN to denote the size of a data sequence to be sorted. Without losing generality, we assume N is a power of two. Given a bitonic sequence of N keys, we can use a column of N=2 comparators to split it into two bitonic sequences of N=2 keys each [19]. Fig. 2.3b shows the first stage of BM network for splitting an 8-key bitonic sequence. By recursively splitting the sequence to be sorted, a BM network can rearrange anN-key bitonic sequence into sorted order 23 P 4,2 P 4,2 P 4,2 P 4,2 P 8,4 P 8,2 Input Output P 4,2 P 4,2 P 4,2 P 4,2 Q 8 Figure 2.4: Permutation patterns in 8-input bitonic sorting network (arrows show the sorting order) in logN stages 1 , where each stage consists of N=2 comparators. Fig. 2.3a shows the BM network for sorting an 8-key bitonic sequence. A bitonic sorting network for N-key sequences can be built using two bitonic sorting networks for N 2 -key sequences and a BM network forN-key bitonic sequences. After recursively applying this rule, a bitonic sorting network can be built with (logN)(logN + 1)=2 stages of comparators; each stage consists of N=2 parallel comparators. Fig. 2.3c presents a simple example forN = 8. Fig. 2.3 shows that in the first stage of the BM network, to sortN-key sequences, P N; N 2 and P N;2 are performed at the input and the output respectively. At the output, two bitonic sequences are generated and are then permuted using PN 2 ; N 4 . Therefore, the permutation pattern at the output can be represented as (I 2 PN 2 ; N 4 )P N;2 = Q N , whereI 2 is the identity matrix and is the tensor (or Kronecker) product [24]. By using the divide-and-conquer method for constructing a bitonic sorting network discussed in Section 2.3.2, we therefore obtain a total of (logN)(logN + 1)=2 permutation patterns having 2 logN unique patterns. All the permutation patterns can be realized using Clos 1 In this paper, all logarithms are to the base 2. 24 network. However, it is infeasible to map Clos network onto hardware directly for a large problem size. Fig. 2.4 shows the permutation patterns in an 8-input bitonic sorting network. Sorting on hardware accelerators has received a lot of attention recently [19, 54, 84, 94, 36]. Several sorting algorithms, including merge sort and bitonic sort, have been implemented on parallel architectures [25, 36]. Merge sort can sort a list of size N using a merge tree of depth logN; at the root of the tree no parallelism can be exploited due to serial nature of the algorithm. Bitonic sort exploits high parallelism [54, 84, 94] to sortN elements usingO(log 2 N) stages inO(log 2 N) time usingO(N log 2 N) com- parisons. FPGAs offer a desirable platform for implementation of sorting architectures due to effective trade-off between energy and performance [10, 36]. However, when the problem sizeN is large it is not possible to realize the entire network on hardware due to limited amount of resources on FPGA. In our preliminary work [36], we have demonstrated an FPGA implementation of bitonic sort which achieves a high through- put of p for data parallelism p and problem size N, while incurring 6N memory and 6N=p latency. Our key innovation is to “fold” the bitonic merge network so that it does not requiren data items to be streamed in each cycle; StreamingN data items is unre- alistic even for small datasets. Hardware implementation of bitonic sorting has been extensively studied in the literature, especially in VLSI [38, 64, 84]. 2.3.3 Join Join is one of the most fundamental database operator for relational database manage- ment system [23]. The execution of join operation is potentially very expensive, and yet it is almost required in all practical queries. Join techniques are usually also applicable to other primitive database operations including union, intersection and difference. In join operation, cross product of payload values needs to be generated when there are 25 duplicate matching keys. For example, given two data columns having two entires of Level VR ID d 10 e 20 f 10 10 Jim 20 Ryan VR Level d 10 f 10 e 20 VS Jim Jim Ryan R S VS R equi-join VR=VS S ID 10 10 20 Figure 2.5: Equi-join operation keym and four entries of keym, then the output sequence will have eight entires of key m. In this paper, we are focused on Equi-Join, which is a specific type of comparator- based join using only equality comparisons in the join-predicate. Figure 2.5 depicts the essence of Equi-Join operation. Tuples in R and S are joined to form a new tuple if the attribute value VR in R is equivalent to the attribute value VS in S. The well known algorithms for join include sort-merge join, nested-loop join and hash join [23]. The sort-merge join algorithm can be realized by the sequential execution of sorting, merge- join, and selection operations, described as below: 1) Sorting: given an unsorted data sequence, rearrange the data elements so that the output sequence is in either increasing or decreasing order. 2) Merge-join: given two sorted sequences of fixed-width keys with associated payload values, obtain an output sequence including all the keys that the two sequences have in common, with the payload values. 3) Selection: given a column of data elements stored as an array of equal data width and bit masks of selected elements, the data output are selected elements based on the bit masks. Without performing the sorting, the nested-loop join algorithm joins two data columns by using two nested loops for scanning and merge-join. Block nested-loop join is an improved version of the nested-loop algorithm reducing the memory access cost [23]. Hash join is similar with nested-loop but uses join attributes as hash keys in both R and S. 26 2.4 FPGA Technology State-of-the-art FPGA devices provide dense logic units, large on-chip memory, and high-bandwidth interfaces for various external memory technologies. As shown in Fig 2.6, most modern FPGA devices are composed of Lookup Tables (LUTs) and on- chip block memory (BRAM) based on the Static Random Access Memory (SRAM) technology. The LUTs can be used to implement any combinational logic; each LUT is paired with one or more flip-flops. The BRAM provides high-bandwidth memory storage with configurable word width to the LUTs. In addition, an SRAM-controlled routing fabric directs signals to appropriate paths to produce the desired architecture. FPGA technology has become an attractive option for prototyping / implement- ing real-time stream processing engines [33, 80, 93]. For instance, Xilinx Virtex- 7 XC7VX1140T FPGA provides above 1 M logic cells, 68 Mb on-chip BRAM and 1100 user I/O pins [13]. FPGA technologies are widely used as commercial platforms (e.g.Xilinx Virtex-6 [13]) or research platforms (e.g. NetFPGA [65]) FPGAs started out as prototyping device, allowing convenient and cost-effective development of glue logic connecting discrete ASIC components. As the gate density of FPGA increased, applications of FPGA shifted from glue logic to a wide variety of high-performance and data-intensive problems, where FPGA devices are deployed in the field as final but still flexible solutions. Because the functionality of an FPGA device is configured by the on-chip SRAM, it can be altered simply by changing the state of the memory bits. This can be useful in cases where the application requires software-like data-dependent processing with ASIC-level line-rate performance. FPGA is a promising implementation technology for computationally intensive applications such as signal, image, and network processing tasks [33, 79, 27]. Increas- ing density and integration of various hardware components, such as low power DSP 27 Figure 2.6: Internal organization of FPGA IP cores and memory blocks, have made FPGAs an a attractive option for implemen- tation of those applications [13]. Compared with the general purpose processors, the state-of-the-art FPGAs have advantages in much lower cost and power with excellent performance, especially for data intensive applications. As the FPGA device offers a vast amount of routing, logic and memory resources, various design choices can be made when mapping a specific application onto FPGA. This remarkable flexibility cre- ates a huge design space, and results in a big challenge in obtaining the energy optimal design when implementing a given application. 28 Figure 2.7: High level abstraction of target FPGA platform 2.5 Research Hypothesis The main focus of this research is to develop innovative algorithmic optimizations for high throughput streaming applications in the emerging landscape of system platforms integrating new memory module and accelerators such as FPGA. We target at a sys- tematic mapping approach to build streaming permutation structures for parallelizing various applications. We formulate the problem as a mapping problem from classic spa- tial permutation networks to streaming permutation networks which reorder streaming data in both spatial and time. The goal of this thesis is to demonstrate that: Classic multi-stage interconnection networks can be utilized to build highly resource efficient streaming permutation net- works to achieve very high throughput for parallelizing various streaming applications. The streaming applications that we target include the classic Fast Fourier Trans- form (FFT), sorting kernels and Equi-join, since they are prevalent for various emerging system-level applications. The most important performance metrics we study include overall throughput and processing latency; other metrics such as energy-efficiency, and resource consumption are also investigated. While each of our application accelerators is unique in its algorithmic optimization and architectural mapping, the common RPN-based design methodologies are utilized 29 throughout this thesis. These design methodologies, as well as the techniques with which they are applied, can be useful to other research problems, enabling a broader class of performance improvement. We further explain the following terms in this hypothesis: FPGA: The high level abstraction of the target Field-Programmable Gate Array (FPGA) platform is shown in Fig. 2.7. FPGAs have LUT slices which can be configured as logic or distributed RAMs (Dist. RAM) as well as larger memory blocks called Block RAMs (BRAM). Performance: With respect to scalability, latency, throughput and power. Latency: The time interval (in clock cycles) between when the first input data point is fed in and when the first transformed data point is output. Throughput: The number of data outputs produced per second while processing streaming data (in Gbits/s). Power: The average energy consumption (in Watt or Joule/s) per unit of time. Operating Frequency: The given frequency to FPGA device during run-time. Maximum Frequency: The maximum achievable frequency for an FPGA design to ensure correct functionality. Energy Efficiency: (or power efficiency): is defined as the number of bits of the outputs per unit energy dissipated (Gbits/Joule) by the design and is calculated as the throughput divided by the average power consumed by the design. Memory Efficiency: measured as the throughput achieved divided by the amount of on-chip memory used by the design (in bits). 30 Chapter 3 Streaming Permutation Network In this chapter, we present our proposed universal RAM-based permutation network to the problem of data permutation on streaming data. We present the divide-and-conquer based mapping algorithm for constructing the proposed RPN. We also develop several high-level optimizations to improve the resource efficiency of RPN. 3.1 Universal RAM-based Permutation Network 3.1.1 Related Work Existing work on hardware implementation of data permutation in the literature are mostly focused on a specific family of permutations such as stride permutations, bit- reversal, and bit-index permutations [29, 48, 59, 61]. In [88], delay commutators are developed to perform stride permutation in the pipelined design for split-radix FFT. The folded FFT architecture achieves high computational performance per unit area. In [29], streaming permutation structures for a subset of stride permutations are pro- posed to obtain high performance hardware implementations for the fast Fourier trans- form. In [47], a design approach to achieve streaming implementations of stride per- mutations is developed. Memory-based permutation networks are considered to real- ize stride permutations for large problem sizes. Their proposed design achieves high resource efficiency, however, requiring a large amount of independent dual-port mem- ory blocks. By expanding this work, a streaming structure is proposed to perform any fixed permutation which should be a linear mapping on the bit representation of the 31 data locations [71]. Automatic generation tools employing their proposed permutation networks have been published and available on [9]; their tool is capable of generating hardware designs described using high level hardware description language for applica- tions including FFT, sorting, and LPDC. Their design supports problem size and data parallelism that are powers of two. An extension of this approach is presented in [59]. This design is capable of performing arbitrary fixed permutation for a given data par- allelism. However, hardware implementation of their design is expensive with respect to memory. To optimize the memory efficiency, a memory optimized design is devel- oped in [31] for arbitrary fixed permutation, however, interconnection complexity has not been considered. Compared with the design in [59], our proposed design supports arbitrary fixed permutation with half memory consumption, furthermore, we propose a design technique such that the interconnection logic can be highly reduced [34]. To the best of our knowledge, our work is the first one to build the connection between the classic multi-stage interconnection network and the architecture for per- mutation on streaming data. Specifically, our main contributions are the following: Systematic Mapping Approach: We propose an systematic mapping approach to obtain a streaming architecture for arbitrary fixed permutation using multi-stage interconnection networks. The constructed streaming architecture is parameteriz- able with respect to data parallelism, problem size, and data width. Theoretical Results: We theoretically prove that our approach is able to construct an RPN for arbitrary fixed permutation. Memory Optimization: We develop a memory optimization algorithm for RPN. This algorithm enables the use of single-port memory rather than dual-port mem- ory and reduces the memory consumption by 50%. 32 Interconnection Optimization: We develop a heuristic algorithm in our mapping approach, such that the number of switches can be minimized in RPN. We conduct detailed experiments on a state-of-the-art FPGA device. Post place-and- route results show that our architecture demonstrates 1.3x1.6x improvement in energy efficiency and 1.5x5.3x better memory efficiency compared with the state-of-the-art designs. 3.1.2 Problem Definition and Notations To formulate this problem, we define thestreamingpermutation with data parallelismp (1pN) as follows: an input sequence is partitioned intoN=p blocks, each contains p consecutive elements. In each cycle, a block ofp elements are fed as input. After some delayt depending on the givenN andp, the input sequence is reordered as specified by the given fixed permutation and output overN=p consecutive cycles where in each cycle, consecutive p elements of the permuted input are output. Computations on streaming data can be easily performed by time multiplexing the key computation units such as comparison units in sorting or butterfly computation units in FFT. However, designing a specialized hardware block for permuting streaming data is challenging as data elements need to moved across the temporal bound, i.e., a data element previously fed in input cyclei needs to be popped out in output cyclej andi6=j. This feature requires register files or memories to be employed such that a data element can be buffered for a specific amount of delay. 3.1.3 Algorithmic Technique and Theory Datapath Generation: Our design is obtained by utilizing the classic Clos network and Benes network introduced in Section 2.2 to generate its datapath logic. We continue 33 … … … … … Switches Switches Switches … … … … … … S 2 S 1 S 2 S 1 S 1 N N Output stage Switch S 1 x S 1 Switch S 1 x S 1 Switch S 2 x S 2 Switch S 2 x S 2 S 1 x S 1 Switch Switch S 1 x S 1 Middle stage Input stage SPN ... … S 1xS 1 Connection Network S 1xS 1 Connection Network Clos Network ... ... Figure 3.1: Folding the Clos network into an SPN to employp andN to denote the data parallelism and problem size, respectively. N is supposed to be divisible by p. The solutions given by [71], [78] and [59] require p to be a power of two. Compared to their approaches, our technique supportsp to be an arbitrary non-negative integer. p 6= 2 m ;N 6= 2 n : We first introduce our datapath generation approach considering both p and N are not powers of two. As shown in Fig. 3.1, we propose an SPN by vertically “folding” the Clos network to perform the data permutations on streaming data. Data permutation on streaming input has been recently studied [71, 78]. Com- pared with previous works [71, 78, 59, 48], our mapping approach build the connection between the classic permutation network and the streaming permutation network, and demonstrate higher memory and resource efficiency for realizing streaming permuta- tion(Section 2.2). Fig. 3.1 shows the key idea of our datapath generation approach using the Clos network whered =s =S 1 ; r =S 2 . d;s andr are design parameters of 34 theN-toN Clos network introduced in Section 2.2. Then we haveN = S 1 S 2 , and a Clos network can be built to be rearrangeably non-blocking ifS 2 S 1 [40]. In this case, the Clos network can realize any given permutation between its input and output. In Fig. 3.1, the Clos network is employed to construct an SPN with a data parallelism ofp (p = S 1 ;N=p = S 2 ). The SPN is composed of twoS 1 S 1 connection networks at its input and output stages, andS 1 memory blocks at the middle stage. We use per- mutation in space to represent data permutation performed by switch elements, while permutation in time is defined as permuting temporal order of data elements in a given data sequence. As shown in Fig. 3.1, in the SPN, permutation in space is performed at the input and output stage by twoS 1 S 1 connection networks, and permutation in time is executed in the middle stage byS 1 independent memory blocks, each having a memory size ofS 2 . In the SPN, a memory conflict is said to occur if concurrent read or write access to more than one word in a memory block is performed in a clock cycle. Considering the constraint for Clos network to be rearrangebly non-blocking, we have theorem below: Theorem 3.1.1. With a data parallelism ofS 1 , the proposed SPN can realize any fixed permutation on streaming input of anN-key data sequence without any memory conflicts usingS 1 memory blocks, each of sizeS 2 , whereS 2 S 1 andN =S 1 S 2 . Proof. In the Clos network shown in Fig. 3.1, whereN =S 1 S 2 , eachS 1 S 1 crossbar switch at the ingress or egress stage has exactly one connection to each of theS 1 S 2 S 2 crossbars switch at the center stage. Similarly, there is exactly one connection between each center stage crossbar switch and each ingress or egress stage crossbar. This network is rearrangeably non-blocking whenS 1 S 2 and thus can realize arbitrary permutation between its input and its output [40]. Theorem 4.2.1 is justified if assuming any fixed connections on this network can also be realized by the proposed SPN shown in Fig. 3.1. To verify this assumption, we first introduce the definition of streaming data array. To 35 formalize this definition, we borrow the definition of reshape function used in Matlab for data array transformation [5].Y =reshape(X; [a;b]) reshapes data vectorX into a a-by-b matrix. Definition Given a one dimension data vectorA of sizeN, a streaming data arrayA p can be defined by A p = reshape(A; [N=p;p]) where p is streaming width (i.e., data parallelism), such thatX p (i; :) (ith row ofX p , 1iN=p) is streamed in/out in cycle i We assume the input and the output of the Clos network to beX andY , respectively. We use streaming data arrays X p and Y p to denote the input and the output of SPN, respectively. Therefore, we can justify Theorem 4.2.1 using Lemma below: Lemma 3.1.2. For any give permutation matrixP N , when Clos network is configured to realizeY =P N X and assumeX p =reshape(X; [N=p;p]), SPN can be dynamically configured such thatY p =reshape(Y; [N=p;p]) To justify this lemma, we usel i andr i (1iN=q) to denote theithS 1 S 1 switch at the ingress and egress stage of the Clos network, respectively. Similarly,m j (1j q) represents the jth S 2 S 2 switch at the center stage of the Clos network. We can dynamically configure the connection of the input stageS 1 S 1 connection network in SPN to be the same of thel i in the Clos network in input cyclei. Based on the definition of streaming data array, the permutation of the ingress stage in the Clos network can be realized equivalently on streaming data by the input stage of SPN. Likewise, theS 1 S 1 connection network at the output stage of SPN can also be dynamically configured to implement the permutation of the egress stage of the Clos network. Given a sequence of memory addressesA and another sequenceB obtained by reorderingA, permutation in time can be accomplished by a memory block by writing a data vector using sequence A and then read out the data vector using sequenceB. For the permutation performed by jth center stage switch in the Clos network, we can pre-compute A and B such 36 that it is equivalently realized by jth memory block at the middle stage of SPN. As introduced in Section 2.2 about the feature of the inter-stage connection in the Clos network, each ingress/egress stage switch has exactly one connection with one of the center stage switch. Therefore, the input of the jth memory block in SPN is exactly the input of thejth center stage switch in the Clos network. In sum, any permutation performed by the Clos network can be realized by the SPN using the above approach. Therefore, Lemma 3.1.2 is proved. Thus Theorem 4.2.1 is justified using Lemma 3.1.2. With a data parallelismp =S 1 , we say an SPN having the control information to per- form all the required interconnection patterns in an application is programmable. Here programming refers to switching the context of control information to realize different permutation patterns during run-time. Theorem 4.2.1 states that SPN is programmable ifS 2 S 1 . AsS 2 = N=S 1 , this constraint can be easily satisfied for a largeN and a smallS 1 , which is expected to be the most cases in reality. The control information for the SPN can be easily obtained using a routing algorithm for Clos network [40]. This feature of SPN is non-trivial as any of the previous optimizations on routing algorithms for Clos network can be reused for realizing SPN. In this paper, we adopt a well known routing algorithm for Clos network to obtain all the control information for SPN [20]. In the second stage of SPN, each memory block can be implemented with single-port memory to permute a single data sequence. However, when processing continuous data streams, dual-port memory is required as concurrent read and write access to differ- ent memory locations need to be performed. An algorithm which enables the use of single-port memory for processing continuous data streams is introduced next. p = 2 m ;N = 2 n : Now we assume bothp andN are powers of two. To illustrate our datapath generation idea, we have the following notations used in Figure 3.2: 37 … l 00 l 01 l 0(N/2-2) l 0(N/2-1) r 00 r 01 r 0(N/2-2) r 0(N/2-1) … … … … … … … … … … … … … … … … … m 0 m p-1 … … … … … … … … m 1 … … … … l 10 ~l 1(N/p-1) r 10 ~r 1(N/p-1) (a) AnN-to-N Benes NetworkB N;p decomposed forlogp times L 00 L 01 L 0d … L s0 L s1 L sd … W 0 W s … M 0 M 1 … M p-1 R s0 R s1 R sd … R 00 R 01 R 0d … W ’ s W ’ 0 … … … LB RB *s = log p - 1, d = p/2 - 1 (b) Generated datapath of our design denoted as streaming permutation network SPN N;p Figure 3.2: AnN-to-N Benes Network can be decomposed into twoN=2-input Benes subnetworks, the Benes network in (a) has been recursively decomposed for logp times, (b) shows the generated datapath by vertically folding the Benes network in (a). B N;p : AnN-to-N Benes network by recursively decomposing the Clos network for logp times (see Fig. 2.2b).B N;p consists of 2logp-2 stages of 2-to-2 switches andp middle stage sub networks. l ij : A 2-to-2 switch at theith input stage ofB N;p (see Fig. 3.2a), i = 0; 1;:::;s, s = logp 1,j = 0; 1;:::;N=2 1. 38 r ij : A 2-to-2 switch at the stage lyingi stages before the last stage ofB N;p ,i = 0; 1;:::;s,j = 0; 1;:::;N=2 1. m i : TheN=p-input sub-network at the middle stage ofB N;p ,i = 0; 1;:::;p 1. LB: Thep-input switch box consisting of 2-input switchesL ij at the input stage of our design denoted asSPN N;p (see Fig. 3.2b),i = 0; 1;:::;s,j = 0; 1;:::;p1. W i /W 0 i : Thep-to-p interconnection through wires,i = 0; 1;:::;s. RB: Thep-input switch box consisting of 2-input switchesR ij at the output stage ofSPN N;p ,i = 0; 1;:::;s,j = 0; 1;:::;p 1. M i : TheN=p-entry memory block at the middle stage of our design. A:P : A permutation matrix representing the permutation performed by a switch or a switch boxA. X;Y :X andY are the input vector and the output vector ofB N;p , respectively. X 0 ;Y 0 : X 0 andY 0 are the input vector and the output vector ofSPN N;p , respec- tively. Let a one dimensional data vector flows into the SPN N;p in N=p cycles. As a result, the one dimensional matrix can be denoted using a 2-dimensional matrix whose dimen- sions are composed of space and time. The resulted output vector is also produced by SPN N;p in N=p cycles. We use function reshape to accomplish this transform. For example, assuming X = (1; 2; 3; 4; 5; 6; 7; 8) T , p = 2, and X ` =reshape(X;p), then X ` = (1; 2; 3; 4; 5; 6; 7; 8) T . AssumingY =B N;p :PX, then we have Theorem 3.1.3. For a given fixedp, a permutation patternB N;p :P performed byB N;p and X ` =reshape(X;p), SPN N;p can be configured such that Y 0 = B N;p :PX 0 and Y ` =reshape(Y;p) whenW i :P = I 2 i P p=2 i ;2 andW 0 i :P = I 2 i P p=2 i ;p=2 i+1 (0 i logp 1). 39 Proof. Whenp = 2,SPN N;2 is composed ofL 00 ,R 00 ,M 0 andM 1 ; W 0 = W 0 0 = I 2 . B N;2 has two sub-networks includingm 0 andm 1 . LetXa =PIX, and PI = 0 B B B B B B B @ l 00 :P l 01 :P ::: l 0(N=21):P 1 C C C C C C C A (3.1) Let Xa = fXa 0 ;Xa 1 ;::;Xa N=21 g T , where Xa i = (Xa[2i], Xa[2i + 1]). When feedingX intoSPN N;2 inN=2 cycles, configureL 00 such thatL 00 :P =l 0j :P in cycle j (0 j N=2 1). Then the output of LB at cycle j will be Xa i . Let Xb = (Xa[0];Xa[2];:::;;Xa[N=2 2]) andXc = (Xa[1];Xa[3];:::;;Xa[N=2 1]) be the input vector ofm 0 andm 1 , respectively. To verify our theorem, we have the definition below: Definition LetY =PX, after writing anN-entry memoryM such thatX[i] is stored inith location (0in 1) ofM, if then readingM with dataY k (0kn 1) in cyclek, we say the permutationP is performed byM onX temporally. AsW 0 :P = I 2 , in cyclej, the data written intoM 0 will beXb[j], and the data written intoM 1 will beXc[j]. LetYb = m 0 :PXb andYc = m 1 :PXc. Based on Defini- tion 3.1.3, after writingM 0 withXb, if readingM 0 with dataYb[k] (0kN=2 1) in cycle k, the permutation m 0 :P can be performed by M 0 on Xb temporally. Simi- larly, the permutationm 1 :P can be performed byM 1 onXc temporally. Thus, injth (0 k N=2 1) cycle afterRB starts to receive input data, the input vector ofRB 40 would be (Yb[j],Yc[j]) denoted asYa j which is also the input vector of the switchr 0j . LetYa =fYa 0 ;Ya 1 ;:::;Ya N=21 g and PO = 0 B B B B B B B @ r 00 :P r 01 :P ::: r 0(N=21):P 1 C C C C C C C A (3.2) then Y = POYa. Let Y = fY 0 ;Y 1 ;::;Y N=21 g, where Y i = (Y [2i], Y [2i + 1]) (0kN=2 1). WhenRB output data results inN=2 cycles, we can configureR 00 such thatR 00 :P = r 0j :P in cyclej (0 k N=2 1), thus the output data vector of RB in cyclej isY j . As a result,Y 0 =fY 0 ;Y 1 ;::;Y N=21 g =Y . Thus, Theorem 3.1.3 is proved forp = 2. To prove Theorem 3.1.3 for anyp, we have the lemma below: Lemma 3.1.4. For a given fixedp, ifSPN N=2;p=2 can be configured to perform arbitrary B N=2;p=2 :P , thenSPN N;p can be configured to perform arbitraryB N;p :P . Proof. Letm 0 andm 1 be the upper subnetwork inB N;2 . Then any permutationm 0 :P or m 1 :P can be performed byB N=2;p=2 . LetSPN0 N=2;p=2 realizesm 0 :P andSPN1 N=2;p=2 realizesm 1 :P . Again, we can add switches includingL 0k andR 0k (0 k p=2 1) to perform the permutation patterns of l 0j and r 0j in B N;2 . We still use PI/PO to denote the permutation performed by l 0j /r 0j , Xa = PIX, and Y = POYa. We can add interconnections WI = P p;2 and WO = P p;p=2 such that the output vector ofL 0k is the input ofWI, and the output vector ofWO is the input ofR 0k . Connect WI with SPN0 N=2;p=2 and SPN1 N=2;p=2 such that the first half output vector of WI is the input of SPN0 N=2;p=2 and the second half output vector of WI is the input of SPN1 N=2;p=2 . Connect WO with SPN0 N=2;p=2 and SPN1 N=2;p=2 such that the first half input vector ofWO is the output ofSPN0 N=2;p=2 and the second half input vector 41 0 1 2 3 4 5 0 2 4 6 1 3 6 7 5 7 M 0 M 1 6 7 4 5 2 3 0 1 5 7 1 3 4 6 0 2 Time Multiplexing Cycle 0,3,4,7→ 0,4,3,7 1,2,5,6→ 2,6,1,5 m 0 r 0 r 1 r 2 r 3 m 1 l 0 l 1 l 2 l 3 L R Time Multiplexing Li 0 Li 1 3 1 0 2 Lo 0 Lo 1 Cycle 7 5 4 6 Figure 3.3: An example of generatingSPN n;2 of WO is the output of SPN1 N=2;p=2 . As a result, in each cycle, p=2 data elements are fed intoSPN0 N=2;p=2 andSPN1 N=2;p=2 , respectively. Such a design consisting of L 0k ,R 0k ,WI,WO,SPN0 N=2;p=2 andSPN1 N=2;p=2 is able to perform any permutation B N;p :P . Based on the description above, such a design can be found to be SPN n;p described in Theorem 3.1.3. Based on the proof for Theorem 3.1.3 forp = 2 and Lemma 3.1.4, Theorem 3.1.3 holds for anyN when given a fixedp. Without losing generality, we assumeN andp to be a power of 2. Lemma 3.1.5. An N-to-N Benes network B N;p can rearrangebly perform arbitrary fixed permutation [20]. 42 Theorem 3.1.6. For a given fixed p (a divisor of N), SPN N;p can be configured to realize any given permutation on streaming input of anN-input data vector without any memory conflicts usingp memory blocks, each of sizeN=p. Proof. Based on Lemma 3.1.5,B N;p is able to perform any fixed permutation between its input and output. Based on Theorem 3.1.3, any permutationB N;p :P performed by B N;p can be realized by configuring SPN N;p . Based on the definition of SPN N;p , in each cycle, only one data is written into or read from the memory block M i . Thus, Theorem 3.1.6 is proved. Fig. 3.3 shows how a permutation network withp = 2 is generated using a 8-to-8 Benes network as the input. The Benes network is routed to perform stride permutation P 8;2 . To realizeP 8;2 , the permutation network needs two 2-to-2 switches and two 4-entry memory blocks. 3.1.4 Parameterized Architecture Fig. 4.15 shows the overall architecture of the streaming design. Design parameters include problem size N and data parallelism p. The datapath consists of two p-to-p connection networks, one memory array includingp independent memory blocks, each of size N=p. The control unit has two configuration tables to reconfigure the connec- tion networks dynamically, and one address generation unit (AGU) for memory access. Each connection network has logp stages of 22 switches. Each stage hasp=2 switches. Thus, a connection network utilizes (p=2) logp 2 2 switches, which is asymptotically optimal compared with state-of-the-art [71, 29, 59]. We consider using two different memory architectures: each memory has one read port and one write port, or each memory has a concurrent read-before-write port (two data ports sharing one address port). The connection network is run time configured each clock cycle based on the 43 … p-to-p connection network p-to-p connection network Memory Block Control Unit … … 2p · log (n/p) p/2 · log p p/2 · log p Data Path Stage 0 Stage 1 1 Stage 2 1 p dual-port memory blocks, each of size 2n/p Configuration table Configuration table Address update unit … … Figure 3.4: Proposed permutation network configuration table. Each table has N=p configurations, each having (p logp)=2 bits. The configurations are statically generated for a given fixed permutation. The latency of connection network is determined by the pipeline depth chosen for implementation. As the data flows through the streaming design shown in Fig. 3.4, they are first permuted by the connection network at the input. Then the permuted data flows are written into the p memory blocks in N=p clock cycles, and read out of the memory blocks in the subsequentN=p clock cycles. All the memory addresses are generated dynamically by the AGU. Finally, data flows are permuted by the connection network at the output. 44 3.1.5 Interconnection Optimization Fig. 3.4 shows the our complete design having a three-stage structure, including two stages ofp-to-p connection networks (LB andRB) and one memory stage ofp inde- pendent memory blocks (M i ,p 1i 0). Each connection network is composed of logp stages, each stage consists ofp=2 parallel 2-to-2 switches. Therefore, (p logp)=2 control bits are required to determine the connection pattern of each connection net- work. These control bits are updated using a configuration table storing all the control information generated during design time. An address update unit is employed to update the memory addresses of thep memory blocks. Each memory block has one read port and one write port. To support processing continuous data inputs, each memory block needs to double the capacity (n=p) to be 2n=p to enable simultaneous read and write. Thus, 2p memory addresses (each of width logn=p) are needed each clock cycle. The interconnection complexity is defined as the interconnection area per through- put. The interconnection area of the proposed permutation network is mainly determined by the area of the twop-to-p connection networks. The key observation is that in ap- to-p connection networkLB orRB, it is possible that a 2-to-2 switch realize the same permutation in each cycle. Therefore, assuming the permutationL ij :P orR ij :P is fixed in every cycle, hardware wires can be used instead forL ij orR ij . In this way, the logic consumption of a 2-to-2 switch can be saved. Such a switch is called as a null switch. As introduced in Section 3.1.3, the proposedSPN is generated using a routed Benes network. Thus, we propose a heuristic routing algorithm for the Benes network such that in theSPN for realizing a fixedn-to-n permutationP , a maximum number of null switches can be obtained. The configuration bits of the routed Benes network are then taken as input to generate the configuration tables and the address update unit in the control unit ofSPN shown in Fig. 3.4. 45 Input switch box Output switch box a 0 =0 a 1 =1 Input x 1 x 0 Output y 1 y 0 b 0 =0 b 1 =1 a 3 =0 a 2 =1 x 3 x 2 y 3 y 2 b 3 =0 b 2 =1 pass pass cross cross Figure 3.5: Configuration bits of switch box in different states A 2-to-2 switch in a Benes network will be either in pass state or cross state, which can be represented by a Boolean variable. Letx andy be the set of inputs and outputs of ann-to-n Benes network respectively. Fig. 3.5 shows the values of the configuration bits of a switch in different states, wherea i is the configuration bit ofx i , andb i is the configuration bit ofy i . As shown in Fig. 3.5, ifx i is forwarded to the first output (upper side) of the switch, thena i = 0, otherwisea i = 1. Similarly, ify i is forwarded from the first input of the switch, thenb i = 0, otherwiseb i = 1. Let: x!y be an one-to-one input-output mapping specifying the given fixed permutation, such thaty (i) = x i . For example, for the stride permutation shown in Fig. 1.6,(0) = 0,(1) = 2,(2) = 4;:::, etc. Algorithm 3 shows the heuristic algorithm for routing the Benes network. This algo- rithm takes as input the data parallelismp, the input and output vectorx andy (both of sizen). It first calls the procedure EER shown in Algorithm 2, which routes the first and last stages of ann-to-n Benes network for realizing the connection pattern specifying : x! y. The EER procedure returns four vectors including a;b;x 0 ;y 0 . a and b are data vectors storing the configuration bits of 2-to-2 switch boxes in the first and last stages of Benes network, respectively. x 0 is the resulting data vector after x bypasses the first stage of the Benes network using a for configuration. y 0 is the resulting data vector by forwarding y reversely in the last stage of the Benes network using b for 46 Algorithm 2 End-to-End Routing Algorithm (EER) 1: procedure EER(x;y;i) 2: a i 0 3: while !(routing complete) do 4: b (i) a i 5: j 0 6: if i %2 = 0 then 7: b (i)+1 a i 8: selectj such that(j) =(i)+1 9: a j b (i)+1 10: else 11: b (i)1 a i 12: selectj such that(j) =(i)1 13: a j b (i)1 14: end if 15: ifx i andx j belong to the same switch box then 16: randomly select a newi and initializea i ifa i = NULL 17: else 18: i j 19: end if 20: end while 21: x 0 forwardx usinga for configuration 22: y 0 forwardy reversely usingb for configuration 23: returna;b;x 0 ;y 0 24: end procedure configuration. x 0 andy 0 determine the connection patterns to be realized by the subnet- works in the middle stage. The RT procedure will be repeatedly called to decompose the problem into smaller subproblems. After running algorithm 3, we obtain A k and B k (0 k 2p 3) including all the required configuration bits which determine the connection patterns to be realized by the twop-to-p connection networks in theSPN. 2p data vectors, includingX k andY k for some values ofk, are obtained in the recursive calls wherep = 1 (not the original given data parallelism). These vectors are then used to generate the memory addresses of thep memory blocks. Assuming for someX k and Y k ,X k = 0; 1; 2;:::;n=p, let k :X k !Y k represents one-to-one input-output mapping indicating connection request. A memory block to be written with a word using address X k [j] in cycle j(0 j n=p 1), will then be read using address v k [j] in cycle j +n=p. v k is called as the memory address vector includingn=p memory addresses, each having a bit width of log(n=p). For 0in=p 1,(v k [i]) =X k [i]. If assum- ing we store all the memory addresses using on-chip memory, then the total memory 47 Algorithm 3 Heuristic Routing Algorithm (HR) 1: Global variables:k 0,A i ;B i ;X i ;Y i Null,i = 0;1;:::;2p3 2: procedure RT(x;y;p;i) 3: (a;b;x;y) EER(x;y;i); 4: X k x;Y k y,A k a;B k b 5: k k+1; 6: ifp> 1 then 7: s size ofX k 8: x (X k [0];X k [1];:::;X k [s=21]) 9: y (Y k [0];Y k [1];:::;Y k [s=21]) 10: i rand()%s 11: RT(x;y;p=2;i); 12: x (X k [s=2];X k [s=2+1];:::;X k [s1]) 13: y (Y k [s=2];Y k [s=2+1];:::;Y k [s1]) 14: i rand()%s 15: RT(x;y;p=2;i); 16: else 17: return 18: end if 19: end procedure 20: procedure HR(x;y;p) 21: r 0;u 0;umax 0;result Null 22: whiler<runtimes do 23: Obtain configuration bits forl ij andr ij through RT(x;y;p;r++%n); 24: fori = 0 top=21 do 25: forj = 0 tologp1 do 26: Check ifL ij is anull switch 27: Check ifR ij is anull switch 28: Update the number ofnull switchesu 29: end for 30: end for 31: ifumax ==0 then 32: umax u 33: else ifumax<u then 34: umax u 35: result Configuration bits forL ij andR ij and memory address vectorsv k for memoryM k ,k =0;1;:::;p1 36: end if 37: end while 38: returnresult 39: end procedure consumption of a memory address vector is n=p log(n=p) bits. The memory address vectors are stored in the address update unit. For stride permutation or bit reversal, instead of storing the memory addresses using on-chip memory, the address update unit will dynamically update the memory address based on some initial memory address to save memory resource consumption. RT procedure will be called byruntimes times to search for the optimal solution of RT(x;y;p;i) such that the maximum number of null switches recorded byumax can be obtained using the mapping approach introduced in Section 3.1.3. 48 3.1.6 Memory Optimization Reducing memory consumption for data permutation is crucially important for improv- ing performance. To illustrate our proposed memory optimization technique, we define permutation in space as data permutation performed by wires or switches in one clock cycle;permutation in time is defined as permuting a data sequence temporally through a memory block. For a permutation in spaceP m ,y =P m x andx(y) is the input(output) vector. Given a memory block of sizem, we can writex into the memory serially, and then read the stored data out likewise, such thaty[i] is output atith output cycle. Thus, we say the memory block realizes the permutationP m temporally. Each memory block in the RPN can be implemented with single-port memory to permute a single data sequence temporally. However, when processing continuous data sequences, write operation on a new data sequence is performed simultaneously with read operation on the current data sequence. Therefore, dual-port memory with double size is required as concurrent read and write access to different memory locations need to be performed. To reduce the memory consumption, we develop an in-place permu- tation in time algorithm, which enables the use of single-port memory for processing continuous streaming inputs consisting of multiple data sequences. Fig. 3.6 illustrates how permutation in time on continuous data streams is performed. Continuous input data sequences includingx 0 ,x 1 ,...,x i ;::: (each of length four) are suc- cessively permuted temporally. For each data sequence x j (j 0), a permutation of [x j0 ;x j1 ;x j2 ;x j3 ] ! [x j0 ;x j3 ;x j2 ;x j1 ] is performed. Fig. 3.6a shows the permuta- tion in time through a dual-port memory, which has been divided into two separate memory partitions. The two memory partitions are alternately read and written dur- ing consecutive time periods. For each data element x jk , k represents the temporal order of x jk in x j when it is written into a memory partition. We can see that the data elements in each data sequence are reordered in time. In each cycle, read and 49 x 00 x 01 x 02 x 03 x 10 x 11 x 12 x 13 Time x 00 x 03 x 02 x 01 … x i0 x i1 x i2 x i3 x (i-1)0 x (i-1)3 x (i-1)2 x (i-1)1 x 00 x 01 x 02 x 03 x 10 x 11 x 12 x 13 x 00 x 03 x 02 x 01 … x i0 x i1 x i2 x i3 x (i-1)0 x (i-1)3 x (i-1)2 x (i-1)1 x (i+1)0 x (i+1)1 x (i+1)2 x (i+1)3 x i0 x i3 x i2 x i1 (a) (b) Time … x (i-1)0 x (i-1)1 x (i-1)2 x (i-1)3 x (i-2)0 x (i-2)3 x (i-2)2 x (i-2)1 … … … Figure 3.6: Permutation in time on 4-key data sequences: a) using a dual-port memory of size eight b) using a single-port memory of size four write operations are executed concurrently using different memory addresses. Fig. 3.6b shows permutation in time on a single port memory using our proposed in-place algo- rithm. Read and write operations are performed simultaneously using one address in each cycle. When permuting a data sequence, if address sequencef0; 1; 2; 3g is used for memory access, then for the next data sequence, address sequencef0; 3; 2; 1g will be used. address sequencef0; 1; 2; 3g andf0; 3; 2; 1g are alternatively employed for mem- ory access. These two address sequences are used alternately for permuting continuous data streams. To dynamically generate the two address sequences, each of length 4, we can employ a 2-bit up counter and a 2-bit down counter. Note that the single-port memory should support concurrent read-write operation (read first) in a single memory access cycle [14]. Data remapping overhead: To implement the proposed in-place algorithm, address sequences need to be calculated for data remapping purpose. Algorithm 3 requires addr i;j to be computed in advance. In hardware implementation, we can either use 50 LUTs on FPGA to storeaddr i;j or dynamically update the memory address using cus- tomized logic unit. To estimate the data remapping cost, we need to evaluate the value range ofn 1 . We still useP i to represent the permutation to be performed on the memory blocki. Assumingaddr i;0 = [0; 1; 2;:::;S 2 1] T , to implement the in-place algorithm, we need to iteratively computeaddr i;j (0jn 1 1) using the following equation: addr i;j+1 =P i addr i;j ; 0jn 1 1 (3.3) such thataddr i;0 =P i addr i;n 1 . Using this equation, we conclude thatn 1 is a constant determined byP i and the theorem below can be obtained: Theorem 3.1.7. The proposed in-place algorithm for arbitrary permutation in time only requires a constant number of address sequences for data remapping. Proof. Note that some power (a constant) of a permutation matrix is the identity matrix [24]. As addr i;0 = P i addr i;n 1 = P 2 i addr i;n 1 1 = ::: = P n 1 +1 i addr i;0 , we getP n 1 +1 i =I. Thus we conclude thatn 1 is a constant. The value ofn 1 depends on the permutation matrix P i . The address sequences addr i;0 ;addr i;1 ;:::;addr i;n 1 are the required address sequences for accessing memory blocki. As each memory block has a size ofS 2 , the memory address width is logS 2 . Accord- ing to [24], stride permutation is periodic, hence when P i is a stride permutation, the number of address sequencesn 1 for each memory block is logS 2 . For sorting problems, our routing results show thatP i is usually a stride permutation or a cyclic shift, or a com- bination of both. Hence, we conclude thatn 1 is (logS 2 ). Based on the analysis above, whenN = 1024,S 1 = 8, andS 2 = 128 for the SPN, the number of bits required for stor- ing all the address sequences is logS 2 logS 2 S 2 S 1 = 49kbits. In actual imple- mentation, it is not required to store the entire address sequence as we can update the addresses dynamically using some initial addresses. In these cases, the memory needed 51 for storing all the address sequences in a SPN is logS 2 logS 2 S 1 = 0:38kbits. Note that, in the state-of-the-art device each block RAM (BRAM) has a memory capac- ity of 18kbits [14]. In our implementations, we decide whether to apply the in-place algorithm or not based on the value ofS 2 of a SPN. 3.1.7 Experimental Evaluation Experimental setup: By varying the parametersn andp, we performed detailed exper- iments for our proposedSPN design. We implemented all our design on Virtex-7 FPGA (XC7VX980T) using Xilinx Tool Set Vivado 15.2. We choose 32-bit fixed point data vectors as input. In the experiments, for power evaluation, the input test vectors were randomly generated with an average toggle rate of 25% (pessimistic estimation) We used the VCD (value change dump)file as input to Vivado Power Analyzer to obtain accurate power dissipation estimation [14]. Implementations developed in [71] and [59] are employed as baseline. The Verilog implementations of the baselines were available through [8] and [7]. In the experiments, we randomly choose permutation patterns used in the signal and data processing algorithms for variousn andp. The same permutations are also employed for evaluating our baselines. Table 3.1: Resource consumption summary Designs Amount of memory Memory type Size of each memory # of mux 1 Supported permutation This work p Dual-port or Single-port N=p or2N=p 02plogp Any fixed [59] 2p Dual-port 2N=p 2plogp 2p+2 Any fixed [71] p Dual-port 2N=p 2plogp BIP 2 [29] p Single-port N=p 2plogp Stride permutation 1 2-to-1 mux, 2 Bit-index permutation [71] 52 Number of BRAMs,n = 1024 4 8 16 32 64 0 10 20 30 40 50 60 70 4 8 16 32 64 4 8 16 32 64 4 6 0 0 0 p Our Design P¨ uschel [71] Milder [59] (a) Number of BRAMs,n = 8192 4 8 16 32 64 10 20 30 40 50 60 32 32 32 32 64 32 32 32 32 64 12 12 18 20 38 p Our Design P¨ uschel [71] Milder [59] (b) Number of BRAMs,p = 4 256 512 1024204840968192 0 5 10 15 20 25 30 35 4 4 4 8 16 32 4 4 4 8 16 32 0 2 4 6 8 12 n Our Design P¨ uschel [71] Milder [59] (c) Number of BRAMs,p = 16 256 512 1024204840968192 0 20 40 60 80 16 16 16 16 16 32 16 16 16 16 16 32 0 0 0 10 10 18 n Our Design P¨ uschel [71] Milder [59] (d) Number of LUTs,n = 1024 4 8 16 32 64 0 0:5 1 1:5 10 4 p Our Design P¨ uschel [71] Milder [59] (e) Number of LUTs,n = 8192 4 8 16 32 64 0 1 2 3 4 10 4 p Our Design P¨ uschel [71] Milder [59] (f) Number of LUTs,p = 4 256 512 1024204840968192 0 0:5 1 1:5 2 2:5 3 10 4 n Our Design P¨ uschel [71] Milder [59] (g) Number of LUTs,p = 16 256 512 1024204840968192 0 0:5 1 1:5 2 2:5 3 10 4 n Our Design P¨ uschel [71] Milder [59] (h) Figure 3.7: Resource consumption for variousn andp Resource consumption evaluation: Table 3.4 summarizes the resource consumption of various designs for realizing permutations with a fixed data parallelism. Memory size refers to the size of each independent memory block storing 32-bit words. The design 53 in [59] can realize any fixed permutation, however, it requires 2x total memory com- pared with our design. A single-port memory based technique is proposed in [29] for improving memory efficiency, however, their design only supports stride permutation. Matrix manipulations are employed to reach a solution of a hardware structure for data permutation in [71], however, the interconnection complexity is not considered. Multi- plexers are the logic resource consumed by the connection networks. In our proposed permutation network, the number of required multiplexers can be reduced to zero the- oretically. In this way, an improvement on throughput can be obtained due to higher maximum achievable design operating frequency. Furthermore, the energy efficiency is also improved as a result of less dynamic power consumption. Fig. 3.7a and Fig. 3.7b show the BRAM consumption for various p forn = 1024 and n = 8192. Each BRAM denotes a simple dual-port 18kb BRAM (BRAM18E). The design in [59] consumes less BRAMs than our design as small size memory blocks in [59] are implemented using distributed RAMs (LUTs). Fig. 3.7b shows that the num- ber of BRAMs is not affected byp whenp 32 for our designs. Note that regardless of how small the required memory capacity is, a BRAM has to be assigned. Therefore, in our design, 16 BRAMs are able to meet the memory requirement forn = 8192 when p 32. Then the number of BRAMs doubles when p = 64. Similarly, in Fig. 3.7c, the number of BRAMs is not affected byn until the amount of BRAMs currently used fails to meet the memory resource requirements of the design. Fig. 3.7d shows that the number of BRAMs is not affected byn for our design whenn< 8192. This is due to the fact thatp independent memory blocks are always required for a specificp. For exam- ple, 16 memory blocks are needed in our design forp = 16, a single BRAM can meet the memory requirement of each memory block for 256 n < 8192. Although the experimental results for other values ofp andn are not presented here, similar BRAM 54 consumption increasing rate is expected. Note that the maximum problem sizen sup- ported by our design is determined by the available on-chip memory resource on FPGA and can be more than 8192. Fig. 3.7e and Fig. 3.7f present the LUT consumption, including LUTs consumed for memory and logic, for variousp withn = 1024 andn = 8192, respectively. A signifi- cant amount of LUTs are consumed in [59] as distributed RAMs have been employed. In our design and [71], LUTs are mainly consumed by the connection networks. Therefore, the figures actually demonstrate hown andp affect the LUT consumption of the inter- connection logic. The figures show that the LUT consumption increases significantly with p. This matches with the theoretical result in Table 1; the number of required multiplexers is O(2p logp) (Table. 3.4). For various p, our design reduces the LUT consumption by 22.1%65.7% and 59.0%96.4%, compared with the design in [71] and the design in [59], respectively. Fig. 3.7g and Fig. 3.7h show the LUT consump- tion for variousn withp = 4 andp = 16, respectively. Our design reduces the LUT consumption by 27.3%75.8% and 42.2%92.3%, compared with the design in [71] and the design in [59], respectively. The above results show the optimized interconnec- tion complexity in our design than the technique used in [71]. However, it is difficult to quantitatively compare the interconnection complexity between our design and the design in [59] which employs different memory implementation approach. Therefore, we further present experimental results of throughput and energy efficiency for perfor- mance comparison. Performance evaluation: To evaluate the throughput of our designs on FPGA, we assume data memory for storing original data input can be either BRAM or DRAM depending on the data set size. We assume the data memory bandwidth can be fully utilized as the data memory access behavior is known to be sequential. The throughput 55 0 0:2 0:4 0:6 0:8 1 10 4 4 6 8 10 12 14 DDR3 peak bandwidth Problem sizen Throughput (GBytes/s) Our Design P¨ uschel [71] Milder [59] (a) 0 20 40 60 80 0 20 40 60 DDR3 peak bandwidth Data parallelismp Throughput (GBytes/s) Our Design P¨ uschel [71] Milder [59] (b) 0 0:2 0:4 0:6 0:8 1 10 4 0 500 1;000 1;500 Problem sizen Energy efficiency (Gbits/Joule) Our Design P¨ uschel [71] Milder [59] (c) 0 20 40 60 0 50 100 150 200 250 Data parallelismp Energy efficiency (Gbits/Joule) Our Design P¨ uschel [71] Milder [59] (d) Figure 3.8: Performance results for variousn andp of our design is determined byp and the maximum operating frequency reported by the post place-and-route results. Fig. 3.8a demonstrates the design throughput for various n with p = 4. The throughput decreases due to the lower achievable operating frequency of the design as n increases. Our design is able to sustain 71%80% of the DDR3 peak bandwidth, (approximately 10 GBytes/s [57]) for various n with p = 4. Fig. 3.8a shows that our design improves the throughput by 5.3%73.2% compared with [71] and 27.4% 129% compared with [59], with more throughput improvement for large problem sizes. 56 Such performance improvement can be expected considering less memory consumption (than [59]) and reduced interconnection complexity in our design. Fig. 3.8b demon- strates the design throughput for variousp withn = 8192. The throughput of our design increases almost linearly withp whenp < 64. Such an observation is consistent with the lower growing rate of LUT consumption in our design shown in Fig. 3.7f. Fig. 3.8b shows that our design improves the throughput by 5.2%31.7% and 4.6%100%, com- pared with the design in [71] and [59], respectively. Fig. 3.8b also shows that the DDR3 peak bandwidth can be easily saturated whenp> 16. In such cases, we can only employ the on-chip BRAMs with extremely high bandwidth as the data memory. We also evaluate the energy efficiency of our design focusing on the dynamic power consumption. We employed a balanced pipelining approach for the purpose of trade off between energy efficiency and throughput. Fig. 3.8c demonstrates the energy efficiency for variousn (1288192) withp = 4. It can be observed that the energy efficiency drops significantly forn = 2048. This is due to the fact that the number of BRAMs used starts to increase withn whenn 2048. The result also shows that asn is varied, our design achieves 2.4x3.5x (1.2x1.5x) improvement in energy efficiency compared with the design in [59] ([71]). Fig. 3.8d demonstrates the energy efficiency when varyingp for n = 8192. The energy efficiency decreases significantly with p. For various p, our design improves energy efficiency by 2.1x3.3x compared with [59] and 1.4x1.5x compared with [71]. 3.2 Optimal Designs for Application-Specific Permuta- tions In this section, we present RPN-based optimal designs for application-specific permuta- tions. Particularly, we develop optimal designs for parallel-2 bit reversal and parallel-2 k 57 2 n-2 -entry buffer 2 n-2 -entry buffer c 0 c 1 s 0 s 1 A 0 A 1 Figure 3.9: Parallel-2 bit reversal design for input size 2 n (A 0 and A 1 are data buffer addresses) 0 1 Bypass Cross (a) 2-to-2 switch 0 1 (b) 1-to-2 de-multiplexer 0 1 (c) 2-to-1 multiplexer Figure 3.10: Control bit values in different states bit reversal. The proposed optimal designs are optimal in terms of the number of mem- ory words necessary for realizing bit reversal. 3.2.1 Parallel-2 Bit Reversal To illustrate the proposed parallel-2 k bit reversal design, we first present the circuit design for parallel-2 bit reversal and then extend the approach top = 2 k . Given a data vector of size 2 n , the inputs can be can be divided into several 2 n1 -element sub-vectors. These sub-vectors are fed into the parallel-2 bit reversal design over 2 n1 consecutive cycles. After a specific amount of latency, during the output cycles, 2 outputs are pro- duced per cycle. Fig. 3.9 shows the proposed parallel-2 bit reversal design, which is parameterizable with respect to input size 2 n . It only requires one 1-to-2 de-multiplexer, one 2-to-1 multiplexers, two 2-to-2 switches and two data buffers, each of size 2 n2 . Fig. 3.11 shows the data buffer structure which has one address port, one read data port and one write data port. To realize the bit reversal, our design requires the data 58 Address Write data Read data Clock m-entry data buffer 4 6 5 7 In cycle 0 In cycle 1 Address Write data Read data 1 6 5 0 1 2 3 Buffer[1] 1 8 6 5 5 ... Figure 3.11: Functional structure of the employed data buffer and its behavior in the read-before-write mode buffer to be accessed in the read-before-write mode, i.e., given a buffer address, the old data previously stored at the write address appears first on the output latches, while the input data is being stored in the buffer. The access behavior of the buffer in the read-before-write mode is shown in Fig. 3.11. Fig. 3.10 shows the circuit behaviors of the design components in different states specified by the value of the control bit. As shown in Fig. 3.10 (a), when the control bit value is zero, the 2-to-2 switch bypass the inputs, otherwise exchange the inputs and route them to the output. Such a switch can be implemented using two 2-to-1 multiplexers. Fig. 3.10 (b) and (c) show the control bit values of the multiplexer or the de-multiplexer in different states. Table 3.2 presents the values of the control bits includingc 0 ,s 0 ,A 0 ,A 1 ,c 1 ,s 1 shown in Fig. 3.9 for computing the bit reversal. Input data are fed into the circuits starting from cycle 0. For the first two cycles, the inputs bypass the 2-to-2 switch asc 0 = 0. For the next two cycles, the two inputs are first swapped and then routed to the output by the 2-to-2 switch asc 0 = 1 . During cycle 0 and 1, both the data buffers. In the subsequent two cycles, data stored in the upper buffer are read out and written with new data words. Simultaneously, data from the lower output of the input switch directly bypass the lower data buffer and enter the output switch ass 1 = 1 during cycle 2 and 3. In cycle 4 and 5, ass 1 = 0, data words read from the two buffers are routed to the input of the output 59 0 Parallel-2 Bit Reversal 2 n = 8 Input Cycle 1 0 1 2 3 2 3 4 5 6 7 0 Output Cycle 4 0 1 2 3 2 6 1 5 3 7 Figure 3.12: Data flows in parallel-2 bit reversal for input size 2 n = 8 switch. In cycle 2 and 3 the output switch is in the bypass state and it then switch to the cross state in cycle 4 and 5. The complete data flow for realizing the bit reversal in the parallel-2 design could be obtained using Table 3.2. For simplicity, we show the data flow during the input cycles and the output cycles in Fig. 3.12. We assume write or read operation to the data buffer is enabled when its address has a valid value instead of x. Table 3.2: Control bit values in different cycles for bit reversal when 2 n = 8 Time (cycles) Values of control bits c 0 s 0 A 0 A 1 s 1 c 1 0 0 0 0 0 x 1 x 1 0 0 1 1 x x 2 1 1 0 x 1 0 3 1 1 1 x 1 0 4 x x 0 0 0 1 5 x x 1 1 0 1 6 x x x x x x 7 x x x x x x 1 x represents “do not care” 3.2.2 Parallel-2 k Bit Reversal Similarly, the 2 n inputs can be can be divided into several 2 k -element sub-vectors, which are fed into the parallel-2 k bit reversal design over 2 nk consecutive cycles. Therefore, data parallelism p = 2 k . As the proposed designs can be fully pipelined, during the 60 output cycles, 2 k outputs are produced per cycle. Fig. 3.13 (a) illustrates our proposed designs for realizing parallel-2 k bit reversal. The notations used in Fig. 3.13 are illus- trated as below: L i : a 2-to-2 switch at the input stage of the circuit design for parallel-2 k bit rever- sal,i = 0; 1;:::;d. The 2-to-2 switch has been introduced in Fig. 3.10. R i : a 2-to-2 switch at the output stage of the parallel bit reversal design, i = 0; 1;:::;d. W i /W 0 i : 2 k -to-2 k fixed wire interconnection,i = 0; 1;:::;s. M i : hardware blocks at the middle stage of the circuit design. For i = 0; 1;:::;p=2 1, M i employs the design block M a shown in Fig. 3.13 (c). For i =p=2;p=2 + 1;:::;p 1, the design blockM b shown in Fig. 3.13 (b) is used for implementingM i . C: Fixed wire connection shown in Fig. 3.13 (d) for swapping the two inputs. The proposed parallel-2 k bit reversal design is parameterizable with respect to input size 2 n and data parallelism 2 k . It only requires multiplexer, de-multiplexer and data buffers. Each data buffer has a design structure shown in Fig. 3.11. Fig. 3.14 shows the parallel-4 bit reversal design. The control bits for the design in Fig. 3.13 are denoted using the notations below: c 0 : one bit control determining the state of the switches includingL 0 , L 1 ,...,L d . Its value is either 0 or 1. Note that thed + 1 switches sharec 0 . s 0 : one bit control determining the state of the 1-to-2 de-multiplexer in the blocks including M p=2 , M p=2+1 , ..., M p1 . Its value is either 0 or 1. Note that the p=2 blocks shares 0 . A 0 : nk 1 bits used as the data buffer address, which is shared by the blocks includingM 0 ,M 1 , ...,M p=21 . 61 L 0 L 1 L d C C C W 0 W s Input stage Output stage *s = log 2 p - 1, d = p – p = 2 k C C C W 1 M 0 M 1 M 2 M 3 M p-2 M p-1 C C C R 0 R 1 R d W' 0 C C C W' 1 2 n-3 -entry buffer 2 n-3 -entry buffer (b) (a) (c) (d) W' s Figure 3.13: (a) Circuit design for parallel-2 k bit reversal, (b) A block denoted asM a consisting of a data buffer, a 2-to-1 multiplexer, and a 1-to-2 de-multiplexer, (c) A block denoted asM b having a data buffer, (d) Fixed wire connectionC A 1 : nk 1 bits used as the data buffer address, which is shared by the blocks includingM p=2 ,M p=2+1 ,...,M p1 . s 1 : one bit control determining the state of the 2-to-1 multiplexer in the blocks including M p=2 , M p=2+1 , ..., M p1 . Its value is either 0 or 1. Note that the p=2 blocks share the one bit controls 1 . c 1 : one bit control determining the state of the switches includingR 0 ,R 1 ,...,R d . Its value is either 0 or 1. Thed + 1 switches sharec 1 . Note that the control bits are not shown in Fig. 3.13 for simplicity, and the blocks including C, W i and W 0 i having only fixed wire connections require no control bit. Table 3.3 shows the values of the control bits for calculating the parallel-2 k bit reversal. 62 Table 3.3: Control bit values in different cycles for parallel-2 k bit reversal on 2 n data inputs Time (cycles) Values of control bits c 0 s 0 A 0 A 1 s 1 c 1 0 0 0 0 0 x 1 x 1 0 0 1 1 x x ... 0 0 ... ... x x 2 nk1 -1 0 0 2 nk1 -1 2 nk1 -1 x x 2 nk1 1 1 0 x 1 0 2 nk1 +1 1 1 1 x 1 0 ... 1 1 ... ... ... ... 2 nk -1 1 1 2 nk1 -1 x 1 0 2 nk x x 0 0 0 1 2 nk +1 x x 1 1 0 1 ... x x ... ... ... ... 32 nk1 -1 x x 2 nk1 -1 2 nk1 -1 0 1 1 x represents “do not care” 2 n-3 -entry buffer 2 n-3 -entry buffer 2 n-3 -entry buffer 2 n-3 -entry buffer Figure 3.14: Parallel-4 bit reversal design for input size 2 n As shown in the table, the value of c 0 or s 0 is constantly 0 during cycle 0 to 2 nk1 - 1, and 1 during during cycle 2 nk1 to 2 nk -1. The value ofs 1 is constantly 1 during cycle 2 nk1 to 2 nk -1, and 0 during cycle 2 nk to 3 2 nk1 -1. The value of c 1 is constantly 0 during cycle 2 nk1 to 2 nk -1, and 1 during cycle 2 nk to 3 2 nk1 - 1. The value of A 0 is i during cycle i (0i2 nk1 -1), and i-2 nk1 during cycle i 63 (2 nk1 i2 nk -1). The value ofA 1 isi during cyclei (0i 2 nk1 -1), andi-2 nk during cyclei (2 nk i3 2 nk1 -1). The generation of the above control bit values only require very simple logic circuits. In the actual implementation, values ofA 0 and A 1 are generated using an increment adder with a reset control bit; values ofc 0 ands 0 are updated using a one-bit input signal; values ofs 1 andc 1 are updated using another one-bit input signal. The total number of delays of the circuitT (n;k) is: T (n;k) = 2 nk1 (3.4) Furthermore, the throughput of the designTh(n;k) and the total number of control bits Ctrl(n;k) are: Th(n;k) = 2 k ;Ctrl(n;k) =nk + 3 (3.5) Although regular control bits such as buffer read or write enable bits are not considered in this calculation, the result still covers the major portion of the control logics and indicates the optimum of the control mechanism used in our designs. With regarding to resource consumption, the number of multiplexersMu(n;k) and the number of data buffersBu(n;k) are: Mu(n;k) = 3 2 k ;Bu(n;k) = 2 k (3.6) 3.2.3 Resource Consumption The comparisons with several parallel bit-reversal circuits are shown in Table 3.4.N and p represents the input size and the data parallelism, respectively. Designs in [17, 43] only support serial input data. No less thanN memory words are required in [44, 90, 91]. The approach in [37] realizes the bit reversal using only a total memory of N words. Our approach reduces the total number of memory words fromN toN=2. The latency 64 of their circuit design is N=pY . This latency is reduced to N=2p in our proposed optimal design. Besides, the required number of 2-to-1 multiplexers in [37] and [56] are 2p 2 p and 4p 2 4p + 2p log 2 (N=p 2 )=2, respectively. Our proposed design only uses 3p multiplexer, which results in a significant reduction in resource consumption. 65 Table 3.4: Comparison of several bit reversal circuit designs Supported FFT architecture Supported data parallelism Input data pattern Size (words) /port number # of data buffers # of mux 1 Throu- ghput Latency [43] SDF Serial data Bit-reversed ( p N1) 2 /two-port 1 Not shown 1 ( p N1) 2 [17] Real-valued FFT Serial data Bit-reversed (5N=8)3 /two-port 1 Not shown 1 (5N=8)3 [44] MDC Bit-reversed (2/4/8/16) Bit-reversed N /two-port Not shown Not shown p Not shown [90] MDC Parallel (8) Specific pattern (9=8)N +192 /two-port 8+3+3+64 Not shown p Not shown [91] MDC Parallel (8) Bit-reversed pN /two-port 8 Not shown p Not shown [37] MDC or MDF Parallel (2/4/8) Bit-reversed N /single-port 2p 2p 2 p p (N=p)Y 2 [56] MDC Power of two Bit-reversed N p 2N p N=2+p /two-port 2p X 3 p N=p p 2N=p p N=2=p+1 This work MDC Power of two Bit-reversed N=2 /single-port p 3p p N=2p 1 2-to-1 mux, 2 Y =2 a 1,a = (log 2 (N=2p 2 ))=2 or(log 2 (N=2p 2 ))=2+1 3 X =4p 2 4p+2plog 2 (N=p 2 )=2, 66 Chapter 4 RPN-based Optimal Designs for Streaming Applications In this chapter, we present our proposed RPN-based designs for several streaming appli- cations including FFT, bitonic sorting and equi-Join. Performance metrics including, throughput, power and resource consumption have been employed for evaluation. Target platforms include state-of-the-art FPGAs and emerging CPU-FPGA hybrid platform. 4.1 High Throughput Streaming FFT 4.1.1 Related Work Existing work has mainly focused on optimizing the performance, power and area of the design at the circuit level. An energy-efficient 1024-point FFT processor was developed in [18]. Cache-based FFT algorithm was proposed to achieve low power and high perfor- mance. Energy-time performance metric was evaluated at various processor operation points. In [88], a high-speed and low-power FFT architecture was presented. They pre- sented a delay balanced pipeline architecture based on split-radix algorithm. Algorithms for reducing computation complexity were explored and the architecture was evaluated in area, power and timing performance. Based on Radix-x FFT, various pipeline FFT architectures have been proposed, such as Radix-2 single-path delay feedback FFT [89], Radix-4 single-path delay commutator FFT [21], Radix-2 multi-path delay commuta- tor FFT [73], Radix-2 2 single-path delay feedback FFT [46], and Radix-2 k multi-path 67 delay forward FFT [45]. These architectures can achieve high throughput per unit area with single-path or multi-path pipelines, but design flexibility is limited and energy effi- ciency has not been explored in these works. In [66], a parameterized soft core generator for high throughput DFT was developed. This generator can automatically produce an optimized design with user inputs for performance and resource constraints. However, energy efficiency is not considered in this work. In [39], the authors present a parameter- ized energy efficient FFT architecture. Their design is optimized to achieve high energy efficiency by varying the architecture parameters. Some energy efficient design tech- niques, such as clock gating and memory binding, are also employed in their work. The authors in [28] extended the work in [39] to identify the effect of both the algorithmic mapping parameters and the hardware binding parameters on energy efficiency through design space exploration. However, in both [39] and [28], only radix-4 FFT algorithm was considered. In this work, we develop a hardware generation framework for FFT which converts FFT algorithms into mathematical representations of data permutation matrices and butterfly computations. Algorithmic level optimizations are incorporated such that high throughput and energy efficient designs can be obtained on FPGA. 4.1.2 Architecture Framework Fig. 4.13 shows the architecture framework within which we develop a highly data par- allel parameterized FFT architecture. To obtain a specific FFT design, the required user input parameters are: Problem size N: The FFT algorithm is proposed to efficiently solve the DFT problem consists of a linear transform onN points specified by a typically dense NN matrix. 68 Main memory FPGA Input Output Memory interface Communication stage Computation stage … C1 I1 I1 I2 C1 I1 I1 I2 C1 I1 I1 I2 … D1 C2 … … … … … … … … … … … … Figure 4.1: Architecture framework Data parallelism p: With an available data parallelism of p (2 p N=2), p points are fed into the design in each clock cycle. Data widthw: Each input data point has a width ofw bits. Design reuse option: Either a fully pipelined design or a iteratively reused design would be generated based on the user choice. Data precision: Each data element could be represented using fixed point format or floating point format. Transform direction: Determines either forward FFT or inverse FFT to be real- ized. The architecture framework is developed based on the Radix-x algorithm. However, the classic Peace algorithm [70] is also incorporated into the proposed framework. The pro- posed architecture consists of main memory, memory interface, architectural building 69 blocks, and control unit. The architecture framework in Fig. 4.13 is illustrated using the definitions below: (1) Main memory: We assume the input consists of several data sequences to be trans- formed, each of lengthN. External main memory is employed to store the inputs. (2) Stream input/output: The input data sequences can be fed into the FPGA contin- uously in a streaming manner. The input data sequences enter the on-chip design at a fixed rate. After a specific delay, the transformed data sequences are output at the same rate. (3) Computation stage: Each stage consists of aC1 block and aC2 block.C1 is com- posed ofp=x Radix-x (R x ) blocks, andC2 includesp 1 Twiddle Factor Computation (TWC) units. BothC1 andC2 can be implemented using LUTs or IP cores and can be pipelined using flip-flops on FPGA. (4) Communication stage (D1): A communication stage denoted as aD1 block exists between adjacent computation stages. In this stage, data permutation is performed with a fixed rate ofp inputs/outputs per cycle. (5) Building blocks: I1, I2, C1 and C2 are building blocks whose design structures are mainly determined by the user input parameters. I1 is ap-to-p connection network composed of 22 switches. I2 is a single-port or dual-port memory block with size of N=p or 2N=p. Each communication stage needspI2blocks. When building an iteratively reused design, only one stage ofC1 block andD1 block would be generated and time-multiplexed for multiple times. Such design consumes least amount of hardware resource at the expense of more latency and lower throughput. To generate a fully pipelined FFT design, log x N stages of C1 blocks and D1 blocks are concatenated to maximize the throughput. Note that such a design has no feedback in the datapath, thus can be deeply pipelined to achieve high operating frequency. More implementation details are presented in Section 4.1.6. 70 4.1.3 Mathematic Formalization For given user input parameters, we first derive mathematical formula to represent the specific FFT network to be realized. An FFT network of sizeN generally consists of O(logN) stages of data permutations, butterfly computations, and twiddle factor com- putations. All the permutations and computations are able to be represented using oper- ations on matrices. We first introduce the matrix representations for data permutations in FFT. Then we summarize the formula derivation for several candidate FFT algorithms. A permutation matrix denoted asP N is anNN matrix with all elements either 0 or 1, with exactly one at each row and column [76]. Given anN-element data vectorx, the data vector y produced by the permutation over x can be given as y =P N x (4.1) Stride permutation: Most of the data permutations in FFT algorithms are stride per- mutations. Given anm-element data vectorx and a stridet (1 t m 1), the data vectory produced by the stride-by-t permutation overx is given asy = P m;t x, where P m;t is a permutation matrix.P m;t is an invertiblemm bit matrix such that P m;t [i][j] = 8 < : 1 ifj = (ti)modm + (bti=mc) 0 otherwise (4.2) where mod is the modulus operation andbc is the floor function. For example, P 4;2 performsx 0 ;x 1 ;x 2 ;x 3 !x 0 ;x 2 ;x 1 ;x 3 and can be represented as y 0 y 1 y 2 y 3 T = 0 B B B B B B B @ 1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1 1 C C C C C C C A x 0 x 1 x 2 x 3 T 71 X[0] X[1] X[2] X[3] X[4] X[5] X[6] X[7] Y[0] Y[1] Y[2] Y[3] Y[4] Y[5] Y[6] Y[7] B 8,2 P 4,2 I 4 F 2 I 2 P 4,2 I 4 F 2 P 4,2 I 2 I 4 F 2 P 8,2 T 8 T 8 P 4,2 (2) (1) Figure 4.2: From an 8-input FFT network to its mathematical representation Bit reversal: Another important permutation pattern being extensively used in FFTs is called as bit reversal [88]. Bit reversal has been well studied in previous works on FFT implementations [89, 21]. Bit reversal can be represented usingB N such thaty =B N x. B N maps each elementx[i] inx to the position given by reversing the log 2 N-bit binary representation of the indexi. Assumingy[j] =x[i] and the binary representation ofi is b 0 b 1 :::b logN1 , then: j =(b 0 b 1 :::b logN1 ) =b logN1 :::b 1 b 0 (4.3) where() is the bit reversing operation. Bit reversal permutation matrix is also known as the base-r digital reversal permutation when t = 2.The permutation matrix of the base-t digital reversal permutation is denoted asB N;t : B N;t = log t N Y i=1 (I N=log i t N P log i t N;t ) (4.4) 72 FFT networks: Before the derivation of the mathematical formula for FFT algorithms, we first introduce the well-known matrix operation tensor product (Kronecker prod- uct) [76]. If A2 C pq and B2 C mn , then the tensor product A B is the p-by-q block matrix: A B = 0 B B B B B B B @ A 1;1 B A 1;2 B ::: A 1;q B A 2;1 B A 2;2 B ::: A 2;q B : : : : : : : : : : : : A p;1 B A p;2 B ::: A p;q B 1 C C C C C C C A (4.5) For example, ifI n is thenn identity matrix [76], then I n B = 0 B B B B B B B @ B 0 ::: 0 0 B ::: 0 : : : : : : : : : : : : 0 0 ::: B 1 C C C C C C C A The traditional Fourier transform can be represented byy =F N x, where F N = [! pq ](0p;qN) and! =e 2i=n : (4.6) F N is called as the finite Fourier matrix. According to [86], the FFT is equivalent to a factorization ofF N into logN sparse matrices. For example, F 8 = (F 2 I 4 )T (1) 8 (I 2 F 2 I 2 )T (2) 8 (I 4 F 2 )B 8 whereT (1) 8 andT (2) 8 are diagonal matrices. The dataflow in this algorithm is hidden in the addressing required to compute various tensor factors: (F 2 I 4 ),I 2 F 2 I 2 , and (I 4 F 2 ). This can be made explicit by conjugating the factors by data permutations corresponding to the addressing required: F 8 =P 8;2 (I 4 F 2 )T (1) 8 (P 4;2 I 2 )(I 4 F 2 )T (2) 8 (I 2 P 4;2 )(I 4 F 2 )B 8 73 This factorization, corresponding to the 8-point FFT network using Cooley-Tukey algo- rithm [41], is depicted in Fig. 4.2, where the blue stages represent data permutations, and the purple stages correspond to the diagonal matrices T i (i = 1; 2). There have been various FFT algorithms proposed to factorize the Fourier matrix into logN sparse matrices [70, 41]. The key difference between the alternate factorizations are the differ- ent permutations that occur in the matrix factorizations. Using the approach in [86], we compute the factorization results of F N having explicit data permutations for three classic FFT algorithms: F N = log t N Y i=1 P N;t (I N=t F t )T (i) N B N;t (4.7) F N =P N;t log t N Y i=1 (IN t F t )(I t i1 G (i) N t i1 H (i) N t i1 ) B N;t (4.8) whereG is a twiddle factor related diagonal matrix,H denotes data permutation matri- ces, and (i) is a parameter employed to differentiate different matrices. F N =P N;t m(I u n F t m)P N;u nW N;u n(I t m F u n)P N;t m (4.9) where N = t m u n . Equation 4.7 computes the FFT based on the Pease FFT algo- rithm [70]. Equation 4.8 is derived based on the iterative variant of the Cooley-Tukey FFT algorithm [41]. The mixed-radix FFT [41] derived from Cooley-Tukey algorithm has the factorization shown in Equation 4.9. Other FFT algorithms can also be trans- formed into mathematical representations in a similar way. 74 + - . . x 0 x 1 (a) y 1 + + . . x 2 x 3 y 2 y 3 + + x 0 x 1 . . y 0 y 1 . . . . j -1 -1 -j -1 -j -1 j X X . . x 0 x 1 y 0 y 1 X X . . x 2 x 3 y 2 y 3 X X x 0 x 1 . . y 0 y 1 . . . . Look Up Table . Look Up Table (b) (e) (f) ... y 0 (c) (d) ... Figure 4.3: (a) Radix-2 block, (b) Radix-4 block, (c) Parallel-to-serial (PS) multiplexer (d) Serial-to-parallel (SP) multiplexer, (e) 2-input Twiddle Factor Computation (TWC) unit (e) 4-input Twiddle Factor Computation (TWC) unit 4.1.4 Design Templates of Building Blocks Computation stage template: As introduced in Section 4.1.2, the computation stage consists of C1 and C2 blocks. C1 is composed of p=x R x blocks. C2 consists of p multipliers for twiddle factor computations. Figure 4.3 shows several sample build- ing blocks developed for design templates of computation stage: Radix-x block (R x ), Parallel-to-serial (PS) multiplexer, Serial-to-parallel (SP) multiplexer, and twiddle fac- tor computation (TWC) unit. A complete design forN-point FFT can be obtained by a combination of the basic blocks. A. Radix-x block This block is composed of signed adder/subtractors for complex number. In the factorization equation ofF N , anF t orI N=t F t (t =x) could be implemented using a R x block. To realize the computation performed inI N=t F t , theR x block needs to be 75 iteratively used forN=t times through time multiplexing. In our proposed designs, radix blocks ofx = 2; 4; 8; 16 have been employed as design templates to construct the FFT design on FPGA. Fig. 4.3 (a) and (b) presents the design structure of the radix-2 block and the radix-4 block, respectively. B. PS/SP block PS/SP multiplexer is used to multiplex serial/parallel input data to output in paral- lel/serial respectively. As shown in Fig. 4.3 (d), the number of input is limited to one, but the subsequent block needs to operate on four data inputs in parallel, thus the SP block is employed to match the data rate between the input and the output. For example, when the data parallelism and the data width are respectively specified as 32 and 32-bit, such design usually exhausts the available I/O resource on FPGA, in this case, SP/PS block are needed to reduce the required amount of I/O resource. C. TWC unit The computations of multiplyingT (i) N in Equation 4.7 orG (i) N t i 1 in Equation 4.8 can be realized using TWC units. Fig. 4.3 (c) and (d) show the design of TWC units. TWC unit consists of two blocks: the twiddle factor look up table and the complex number multiplier. The twiddle factor look up table is employed to store twiddle factor coef- ficients, where the data read addresses will be updated with the control signals. The size of the lookup table increases with the problem size N. The fixed point complex number multiplier consists of three real number multipliers and three adder/subtractors. For floating point, the number of multipliers and adder/subtractors are four and two, respectively. 76 4.1.5 Design Optimizations We develop several high level optimizations to improve the memory and energy effi- ciency of the parameterized FFT designs. Memory optimization for communication stage: Most memory resources are con- sumed for data permutations in the logN communication stages in FFT. Thus, reducing memory consumption for data permutation is crucially important for improving FFT performance. To illustrate our proposed memory optimization technique, we introduce the definition below: Permutation in space: defined as data permutation performed by wires or switches in one clock cycle. Permutation in time: defined as permuting a data sequence temporally through a memory block. For a permutation in space P m , y = P m x and x(y) is the input (output) vector. Given a memory block of sizem, we can writex into the memory serially, and then read the stored data out likewise, such thaty[i] is output atith output cycle. Thus, we say the memory block realizes the permutationP m temporally. Each memory block in D1 can be implemented with single-port memory to per- mute a single data sequence temporally. However, when processing continuous data sequences, write operation on a new data sequence is performed simultaneously with read operation on the current data sequence. Therefore, dual-port memory with double size is required as concurrent read and write access to different memory locations need to be performed. To reduce the memory consumption, we develop an in-place permu- tation in time algorithm, which enables the use of single-port memory for processing continuous streaming inputs consisting of multiple data sequences. 77 Algorithm 4 In-place permutation in time Input: data vectorsx 1 ,x 2 , ...,x l ; permutation vector Output: data vectorsy 1 ,y 2 , ...,y l Constants:m (size of data vectorsx,y or);l (number of input/output data vectors) Notation: anm-entry single-port memory denoted asM 1: forj = 0 tom1 do 2: Initialize[j] such thatx i [[j]] =y i [j] for anyifInitializationg 3: end for 4: addrVec GA V()fInitialization of address vectorsg 5: s size ofaddrVec 6:fMemory access procedureg 7: i 1,k 0 8: forj = 0 tom1 do 9: fEach iteration takes one cycleg 10: writeM withx i [j] using addressaddrVec[k][j] 11: end for 12: k k+1 13: whileil do 14: forj = 0 tom1 do 15: fEach iteration takes one cycleg 16: a addrVec[k][j] 17: readM using addressa, output value equals toy i [j] 18: writeM withx i+1 [j] using addressa 19: end for 20: i i+1 21: ifk =s1 then 22: k 0 23: else 24: fUpdatek after permuting a data vectorg 25: k k+1 26: end if 27: end while 28: procedure GAV() 29: e [0;1;2;:::;m1];d 30: i 0,addrVec [[]] 31: addrVec[i] d 32: Initialize permutation matrix such that T =Pe T 33: Updated such thatd T Pd T 34: while vector equality ofd and != True do 35: addrVec[i] d,i i+1 36: updated such thatd T =Pd T 37: end while 38: returnaddrVec 39: end procedure 78 Algorithm 4 shows the key idea of our proposed in-place algorithm for permutation in time. The total number of data sequences to be permuted isl. Each data sequence containsm elements. Data elements of input data vectors are fed intoM continuously in a streaming manner. The mapping between each input vectorx i (1il) and output vector y i is statically known. Such a mapping can be represented using permutation vector . The procedure GA V() is employed to compute a two dimensional array addrVec by taking as the input. According to [24], some power (a constant) of a permutation matrix is the identity matrix. Therefore, we can always find a constanta such thatP a =I. Thus, the size ofaddrVec is also a constant. The memory blockM is accessed usingaddrVec which is pre-computed. Using the proposed algorithm, each memory location inM is written with a new value once the old value is read out, thus the proposed permutation in time algorithm is an in-place algorithm. Architecture binding: At the architecture binding level, different binding approaches affect the throughput and energy efficiency significantly. In our experiments, we con- sidered three architecture binding parameters that significantly affect the performance of our design: Type of memory resource: BRAM or distributed RAM (dist. RAM) to implement memory blocks in the communication stages. Pipeline depth: Number of pipeline stages to be inserted in the R x blocks and D1 blocks for increased performance. However, as the number of pipeline stages increases, the dynamic power also increases. Memory port type: A dual-port memory is employed as default. When using the proposed in-place algorithm for permutation in time, a single-port memory is used instead. 79 Algorithm 5 Architecture binding algorithm Input: Set of user input parametersI Output: Set of design parametersO Notation: architectural building blocksC1 i ,C2 i ,D1 i (1ilogN) 1: procedure MT(I) 2: fth is a fixed constantg 3: th memory size threshold for binding a memory to BRAM 4: fori = 1 tologN do 5: forj = 1 top do 6: s size of a memory blockM j in blockD1 i 7: ifsth then 8: memory type ofM j BRAM 9: else 10: memory type ofM j Dist. RAM 11: end if 12: Update set of design parametersO 13: end for 14: end for 15: end procedure 16: procedure MP(I) 17: fthp is a fixed constantg 18: thp threshold for binding a memory to single-port RAMs 19: fori = 1 tologN do 20: forj = 1 top do 21: permutation vector forM j in blockD1 i 22: fProcedure GA V is detailed in Algorithm 4g 23: addrVec GA V() 24: s size ofaddrVec 25: ifsthp then 26: memory port type ofM j dual-port RAM 27: else 28: memory port type ofM j single-port RAM 29: end if 30: Update set of design parametersO 31: end for 32: end for 33: end procedure According to the FPGA vendors’ data sheets [14], BRAM is more power efficient than dist. RAM when used for large size memories. However, for small memory, it is more power efficient to implement the memory with dist. RAM. This characteristic can be utilized to trade-off power and performance for various memory components. As introduced in Section 4.1.2, the size of a memory block used inD1 varies with the ordi- nal number of a particular computation stage. Hence in our implementation, we bind 80 BRAM Activation Logic … Read Address Write Address Input Data Block RAM Block RAM Block RAM Reg Reg Reg … Mux Clock . . Activation control bit Figure 4.4: Memory activation scheduling the memory blocks inD1 with different memory resources including BRAMs and dist. RAMs based on memory capacity. Procedure MT(I) in Algorithm 5 details our memory binding approach for determining type of memory resource.th is a constant determined based on the evaluation of the power of BRAMs and Dist.RAMs on target FPGA plat- form [28]. Besides, using the proposed in-place permutation in time algorithm, a mem- ory block can be implemented using a single-port RAM which is configured to be in the read-before-write mode [14]; Memory activation scheduling: A Block RAM (BRAM) on FPGA supports to be deactivated to save power while it is not in access [1]. A memory block with size larger than 36 kbits in D1 may consists of several Block RAMs on FPGA. However, only one of which needs to be activated at every read operation according to the address. In order to further reduce the energy consumption of memory, we develop and incorpo- rate a memory activation scheduling technique into our designs as shown in Fig. 4.4. Using this technique, only one BRAM block is activated in each clock cycle. As a result, the energy consumption per read/write is significantly reduced. This technique is non-trivial as consecutively reading/writing the memory block with a large stride could result in significant energy overhead due to frequent activation and de-activation. Such overhead may offset the benefit brought by de-activating BRAMs not in access. To real- ize data permutations, it is highly possible that the memory blocksM 1 ,M 2 , ...,M p in 81 … Input Output Stage 1 Stage 2 Stage (log x N) Control Unit Control Unit Control Unit … p R x blocks … … … Connection network Connection network TWC TWC TWC … R x blocks … … … Connection network Connection network TWC TWC TWC … R x blocks … … … Connection network Connection network TWC TWC TWC … Figure 4.5: High throughput design for FFT D1 perform permutation in time with large stride values. Memory access patterns are pre-computed to detect memory access patterns with large strides. The proposed BRAM activation logic would not be employed for large stride values. In our experiments (Sec- tion 4.1.8), we observe that this technique can significantly improve the memory energy performance especially when large memories are employed in theD1s. 4.1.6 Implementation of Illustrative Designs In this section, we present details on architecture implementation of the automatically generated illustrative FFT designs, including a high throughput design and a resource efficient design. The high throughput design is fully pipelined to achieve high perfor- mance. With a data parallelism ofp, it supports processing continuous streaming inputs with a throughput ofp results per cycle. The resource efficient design consumes mini- mal hardware resources by iteratively reusing the architectural components. It achieves high resource efficiency by iteratively reusing a singleD1 block to perform all the data permutations. High Throughput Design: As shown in Fig. 4.17, the high throughput design is composed of several concatenated FFT stages, each havingp=x parallelR x units. For given fixed problem sizeN and data parallelismp, the high throughput design consists 82 of (log x N) FFT stages using radix-x FFT algorithm. Each stage has at most oneD1 block. This design supports processing continuous data streams. In Equation 4.8, the factorI t i1 G (i) N t i1 H (i) N t i1 can be realized by time multiplexingD1 block. Therefore, let m denotes the size of the input sequence to be permuted by D1 in the ith FFT stage, then m = N t i1 . If m p at a particular FFT stage, D1 is replaced with an m-to-m switch. Therefore, in the high throughput design, there are P log x N1 i=log x p (1) stages ofD1 blocks. EachD1 denoted asD1(p;m;t;v) has its own parameter values form,t andv. T DM (N;p;x) = log x N1 X i=log x p ( x i+1 p ) + + 2N p (4.10) which is x(Np) p(x1) + 2N p . The termsx i+1 =p indicate the size of a memory block inD1 in ith FFT stage. The total latency introduced by the input and output connection networks in the log x N stages of D1s is O((logp) logN). As the total latency introduced by allR x blocks isO(logN), the overall latency of the high throughput design is (3N N x1 )=p +o(N) (2pN=2). The size of the entire memory blocks used by all theD1s is the the product ofp andT DM (N;p;x) which is 2N +o(N). Note that we use little-O notation here (a constant or log x N belongs too(N) butN = 2o(N)) [85]. Furthermore, the high throughput design is pipelined to process continuous data streams, resulting in a throughput of p. To achieve high throughput, a pipeline stage can be inserted in D1 after each of the log x N FFT stages. Thus the total area consumption of the log x N computation stages is O(p logN). When using external memory as data memory, the required number of I/O pins isO(p). No control bits are needed forR x blocks. while each TWC unit requires w bits for updating twiddle factor coefficients. Therefore, the total number of control bits for all the computation stages isO(pw logN). In each D1(p;m;t;v) block, the twop-to-p connection networks requireO(p logp) control bits. Thus, the total number of required control bits isO(p logN(logp +w)). 83 R x blocks Memory Block Memory Block … Input connection network Output connection network Memory Block Control Bits Generation Control Unit Input Data … … Output Data Look Up Tables D1 (p,m,t,v) p· log 2 (N/p) p· log 2 p p· log 2 p p· w TWC unit TWC unit TWC unit … Figure 4.6: Overall architecture of resource efficient design Resource Efficient Design: Fig. 4.18 shows the architecture of the resource effi- cient design. The resource efficient design cannot support processing continuous data streams, thus the proposed in-place permutation in time algorithm is not applicable. For a given data parallelismp, the resource efficient design consists of oneD1(p;m;t;v), p=x R x blocks, and p TWC units. During the computation of the linear transform, D1(p;m;t;v) is reused by configuringm, t andv to perform any permutation arising in the FFT networks. In this way, this design achieves the highest resource efficiency at the expense of throughput. As noted in Section 4.1.3, only log x N different permutation patterns exist between the log x N FFT stages in the FFT network. Thus log x N dis- tinct contexts of control information are needed, each realizes a particular permutation pattern. To complete execution of one FFT stage in the FFT network, p=x R x blocks are reusedN=p times. Since the total number of FFT stages is log x N, the resource efficient 84 design has a latency of O(logN(N=p)) for FFT. Note that the next FFT stage cannot start before completing the execution of the current FFT stage. As the intermediate results of the current stage needs to be stored before the execution of the next stage, the required on-chip memory size is exactlyN. To complete twiddle factor computa- tion in each FFT stage, thep twiddle factor coefficients need to be updated each cycle. Benefiting from the recursive structure of FFT network, only one state machine with log 2 N states for updating pw bits is required to update all the twiddle factor coeffi- cients. Therefore, the amount of logic resource with respect to configuration bits for TWC units is O((Nw) logN). Similarly, D1(p;m;t;v) needs to be reused (log 2 N) times to perform all the required data permutations. For accessing each memory block in theD1(p;m;t;v), a log(N=p)-bit counter is required for read, andp state machines (each having logN states) for updatingp log 2 (N=p) control bits are required for write. Similarly, to configure the connection networks, two state machines (each having log 2 N states) for updating p log 2 p control bits are required. As a result, the total amount of logic resource consumed with respect to the number of control bits of theD1(p;m;t;v) isO((p logN) log(N=p)). 4.1.7 Experimental Setup In this section, we present a detailed analysis of several implementation experiments by varying the parameters. All the designs were implemented in Verilog on Virtex-7 FPGA (XC7VX690T, speed grade -2L) using Xilinx Vivado 15.2. Inputs are 16-bit fixed point complex numbers. The input test vectors for simulation were randomly generated and had an average toggle rate of 25%. We used the VCD file (value change dump file) as input to Xilinx Vivado Power Estimator to produce accurate power dissipation estimation [13]. In this section, we use HT Design to denote the high throughput design and RE design to denote the resource efficient design, respectively. 85 Performance Metrics: Four metrics for performance evaluation are considered in this paper: Throughput: is defined as the number of bits of the transformed outputs per second (Gbits/s). The throughput is computed as the product of number of sample points transformed per second and data width per sample. In this section, each sample is a 16-bit complex number, thus the data width is 32-bit. Energy efficiency (or power efficiency): is defined as the number of bits of the transformed outputs per unit energy dissipated (Gbits/Joule) by the design and is calculated as the throughput divided by the average power consumed by the design. Memory efficiency: measured as the throughput achieved divided by the amount of on-chip memory used by the design (in bits). EnergyAreaTime (EAT): the product of three key metrics: energy, area, and time. When given the same problem size, we use theEAT ratio for performance comparison between different designs. Energy is the average energy consumed per transform. Area is the slice usage of the design, i.e. the number of slices occupied by the entire design on FPGA. Time is the latency of an FFT design. 4.1.8 Design Optimization Evaluation We employ a baseline architecture implemented using the radix-2 based HT Design (see Section 4.2.5) without applying the proposed optimization techniques discussed in Sec- tion 3.1.6. For both designs, we evaluate the amount of BRAM, LUT consumed for problem sizesN = 16; 128; 512; 2048 and 4096. The results are shown in Fig. 4.7. In this plot, the available amount of BRAMs or LUTs is normalized to one on they-axis. 86 BRAM consumption,p = 2 16 128 512 2048 4096 0% 1% 2% 3% 4% 0:34% 0:5% 0:6% 1:35% 1:52% 0:48% 0:88% 1:16% 2:11% 3:33% N HT Design Baseline (a) LUT consumption,p = 2 16 128 512 2048 4096 0% 0:2% 0:4% 0:6% 0:8% 1% 1:2% 1:4% 0:2% 0:31% 0:42% 0:54% 0:63% 0:25% 0:49% 0:64% 0:8% 0:89% N HT Design Baseline (b) BRAM consumption,p = 8 16 128 512 2048 4096 0% 2% 4% 6% 8% 0:2% 1:5% 2:45% 3:13% 3:54% 0:82% 2:45% 3:54% 4:63% 5:99% N HT Design Baseline (c) LUT consumption,p = 8 16 128 512 2048 4096 0% 1% 2% 3% 4% 5% 0:72% 1:62% 2:02% 2:15% 2:65% 0:67% 1:58% 2:13% 2:69% 3:1% N HT Design Baseline (d) Figure 4.7: Memory and logic resource used by the HT Design and the baseline The green bars and blue bars show the resource consumption of the HT Design and the baseline, respectively. Fig. 4.7(a) shows that the consumption of BRAM nearly doubles for the baseline for all the problem sizes whenp = 2. The number of BRAMs reduced by the optimizations declines when p = 8 as more memory blocks are implemented using distributed RAM for small values of m=p. The reduction in memory usage is 87 Energy efficiency (Gbits/Joule), radix-2 2 4 2 6 2 8 2 10 2 12 0 20 40 60 80 N Baseline HT Design p = 2 p = 4 p = 8 p = 16 (a) Energy efficiency (Gbits/Joule), radix-4 2 4 2 6 2 8 2 10 2 12 10 20 30 40 50 60 70 80 N Baseline HT Design p = 4 p = 8 p = 16 (b) Energy efficiency (Gbits/Joule), radix-8 2 3 2 5 2 7 2 9 2 11 20 40 60 80 N Baseline HT Design p = 8 p = 16 (c) Energy efficiency (Gbits/Joule), radix-16 2 4 2 6 2 8 2 10 2 12 20 40 60 80 100 N Baseline HT Design p = 16 (d) Figure 4.8: Energy efficiency of the HT Design and the baseline especially significant forN = 4096;p = 2. Moreover, the figure shows that the utiliza- tion of LUT is also reduced in the HT Design. This shows that as dual-port memory is eliminated and the total memory size is halved, LUT needed for implementing memories is also reduced. Thus, it implies that the logic overhead for implementing the proposed in-place algorithm is almost negligible, and it also demonstrates the superiority of our other proposed optimization techniques. Fig. 4.7(d) shows that the LUT consumption of HT design is more than that of baseline forN = 16; 128 andp = 8. The reason is that a 88 p = 2,N = 16 p = 4,N = 16 p = 8,N = 16 48% 14% 21% 3% 3% 7% 4% 33% 15% 28% 3% 4% 11% 6% 24% 14% 38% 2% 3% 13% 6% p = 2,N = 512 p = 4,N = 512 p = 8,N = 512 34% 21% 15% 4% 8% 11% 7% 22% 23% 20% 4% 9% 14% 8% 13% 24% 23% 4% 9% 17% 10% p = 2,N = 4096 p = 4,N = 4096 p = 8,N = 4096 Static power Clock Slice BRAM DSP I/O Signal 23% 38% 10% 4% 7% 11% 7% 16% 33% 14% 4% 9% 14% 8% 10% 30% 18% 4% 11% 17% 10% Figure 4.9: Power evaluation for variousp andN large portion of memory blocks are implemented using distributed RAMs in LUTs for HT design whenN = 16; 128, while memory blocks are implemented using BRAMs as default in the baseline. We further evaluate the energy efficiency of HT Design and the baseline forN =16, 128, 512, 2048 and 4096 while varyingp. The operating frequency is fixed at 250 MHz for the sake of power evaluation. All our designs were pipelined to achieve this clock rate. 89 100 150 200 5 10 0:2 (2; 2) (2; 4) (2; 8) (2; 16) (4; 4) (4; 8) (4; 16) (8; 8) (8; 16) (2; 2) (2; 4) (2; 8) T A E (a) 500 1;000 1;500 2;000 2;500 10 15 20 5 10 (2; 2) (2; 4) (2; 8) (2; 16) (4; 4) (4; 8) (4; 16) (8; 8) (8; 16) (2; 2) (2; 4) (2; 8) T A E (b) Figure 4.10: Design space exploration for (a)N = 64 and (b)N = 4096 4.1.9 Design Space Exploration In this section, in order to identify energy hot spots and Pareto optimal designs, we explore the design space by varying design parameters of HT design. The effects of the design parameters on energy efficiency are demonstrated by using the proposed perfor- mance metrics. Identify the energy hot spots: In this experiment, we evaluate power consumption of various key components while varying data parallelismp and problem sizeN. The key components consists of BRAM, Clock, I/O, DSP, signal, slice, as well as static 90 power. The power consumption of the key components for variousN andp are shown in Fig. 4.9. Based on the experimental results, when fixingp and varyingN, we have the following observations: For the considered problem sizes, the power consumed by the I/O and signals increase significantly when increasing p. However, despite the required extra hardware to unfold the FFT, the reduced latency of FFT computation enables the design to outperform the original design (see Fig. 5.1). AsN grows with fixedp, the BRAM becomes the energy hot spot, and consumes more than 30% power of the total power consumption for the selected problem sizes. Static power is another major power component which consumes 10% to 48% of the total power consumption. Note that the static power is not affected byp and N, the percentage of static power reduction is mainly caused by the increasing BRAM power and I/O power when increasingN andp. Signal power increases significantly with higherp. This indicates that the more signal power is one of the reasons why the design energy efficiency will not lin- early scales up with increasingp. Based on the above observations, the dominant portion of the entire power is con- sumed by the memory blocks and signal power, especially for large values of N and p. This indicates that memory binding algorithm introduced in Section 4.1.5 can be utilized to improve energy efficiency for various designs. As signal power is highly affected by the pipelining depth, pipeline registers can be balanced to obtain trade offs between energy and performance. Note that other major power components such as I/O power and static power can not be optimized using high level optimization techniques. Pareto optimal design: In this section, we explore the design space of FFT implemen- tations to identify the Pareto optimal designs. Performance metrics including energy, 91 area and time are considered. Fig. 4.10a and Fig. 4.10b show the design space evalua- tion results forN = 64 andN = 4096, respectively. The label next to each design point indicates the pair of design parameters values (x,p), where x denotes the radix value andp is the data parallelism.T indicates the latency in number of cycles.A denotes the number of thousands of slices used.E is the average energy consumption per transform. In Fig. 4.10a, the design points labeled with (2,4), (4,4), (4,8) and (8,8) are connected using red line and achieve the smallestEAT product among all the design points. These design could also be identified as Pareto optimal designs in terms ofEAT assuming the user specified energy constraint is 0:2 W. In Fig. 4.10b, the design points labeled with (4,8), (4,16), (8,8) and (8,16) achieve the smallestEAT product among all the design points. This indicates that the design parameter values of the Pareto optimal designs are highly sensitive to the specified problem size. Increasing data parallelism could significantly improve the performance, thus reducing the average energy consumed per transform especially for large problem sizeN. 4.1.10 Performance Comparison Comparison with state-of-the-arts: Fig. 4.19 presents a scatter plot comparing the design points generated by our tool with several prior work. Both designs labeled [8] and [7] are developed by CMU SPIRAL project for FFT. The designs in [8] are implemented using a patent based technique, and the designs in [7] are patent free. The designs in [11] are generated through the Xilinx IP Core Library. Thex-axis represents the on- chip memory consumption (in Mbits) of a design point, and the y-axis represents the throughput achieved by the design. The size of marks indicate the FFT problem size. Design points closer to the upper left corner of the plot achieve higher throughput with less on-chip memory. In Fig. 4.19, all our designs are dominating designs: for every design in the literature considered in this evaluation, one of our designs offers superior 92 0 1 2 3 4 20 40 60 On-chip memory consumption (Mbits) Throughput (Gbits/s) [8] [7] HT Design [11] Figure 4.11: Memory efficiency comparison of various designs throughput or memory efficiency or both. As shown in the figure, our designs achieve 57% to 223% and 51% to 76% higher memory efficiency compared with [8] and [7], respectively. Our best design provides 21% to 113% improvement in memory efficiency compared with Xilinx IP core [11]. Energy efficiency comparison: We also employ SPIRAL FFT IP Cores from Carnegie Melon University [8] for energy efficiency comparison. The SPIRAL FFT IP Cores are highly optimized FFT architectures. Their FFT soft IP Cores are automatically gener- ated in synthesizable RTL Verilog code with user inputs. The SPIRAL FFT IP cores are high performance FFT designs based on streaming architecture. By using their provided tools, customized FFT soft IP cores can be automatically generated in synthesizable RTL Verilog with user inputs [8]. The available parameters of the DFT core generator include transform size, data precision, and streaming width. In our experiments, the number of channels (each channel represents an complex number input) are selected to be 2, 4, and 8. The energy efficiency of the SPIRAL FFT IP Core and our designs are 93 Energy efficiency (Gbits/Joule) 2 4 2 5 2 6 2 7 2 8 2 9 2 10 2 11 2 12 0 10 20 30 40 N HT Design p = 2 RE Design p = 4 [8] p = 8 Figure 4.12: Energy efficiency comparison shown in Fig. 4.20. The experimental results show that our HT Design achieves 11% to 47% energy efficiency improvement for 16, 128, 512, 2048 and 4096 point FFTs, respectively. When the problem size is large, more energy efficiency improvement can be achieved using our designs. 94 4.2 High Throughput Streaming Sorting Accelerating sorting using dedicated hardware to fully utilize the memory bandwidth for Big Data applications has gained much interest in the research community. Recently, parallel sorting networks have been widely employed in hardware implementations due to their high data parallelism and low control overhead. In this chapter, we present a sys- tematic methodology for mapping large-scale bitonic sorting networks onto FPGA. To realize data permutations in the sorting network, we develop a novel RAM-based design by vertically “folding” the classic Clos network. By utilizing the proposed design for data permutation, we develop highly optimized parallel designs for bitonic sorting on FPGAs. The proposed sorting architecture is parameterizable with respect to the given input size, data width and data parallelism. We demonstrate trade-offs among through- put, latency and area using two illustrative sorting designs including a high throughput design and a resource efficient design. With a data parallelism ofp (2pN=2), the high throughput design sorts anN-key sequence with latency 6N=p +o(N), throughput p results per cycle and uses 6N +o(N) memory. This achieves optimal memory effi- ciency (defined as the ratio of throughput to the amount of on-chip memory used by the design) and outperforms the state-of-the-art. 4.2.1 Related Work A hardware algorithm for sortingN elements with a fixed number of I/O ports is pre- sented in [67]. They extend the column sort algorithm to sort data elements in row-major order. They also develop a multi-way merge algorithm to obtain the sorted sequence. It sortsN elements in ( NlogN plogp ) time using a sorting network of fixed I/O sizep and depth O(log 2 p). In [55], algorithms to reduce communication cost for bitonic sort- ing on SIMD and MIMD processors are introduced. They reduce the communication 95 between processors and shared-memory by almost one half when compared with the straight-forward bitonic sorting algorithms. The area time 2 performance of various designs for VLSI sorters is investigated in [84]. Three bitonic sort based VLSI designs are discussed. These designs use logN, log 2 N, and p N logN processors, and achieve time performance of O(N log 2 N), O(N logN) and O(N= logN), respectively. However, throughput performance is not considered. In [42], the authors present a modular design technique to obtain a high throughput and low latency sorting unit using 65-nm TSMC technology. In [94], the SPIRAL project developed DSL (Domain Specific Language) to enable mapping sorting algorithms with flexible design choices. Their design supports processing continuous data streams, while energy and memory efficiency are not considered. Several existing sorting architectures on FPGAs are implemented and evaluated in [52]. FIFO or tree based merge sorter as well as bucket sorter are selected as target designs for implemen- tation. Throughput of their designs is low due to limited data parallelism. They also discuss how to exploit partial run-time reconfiguration to reduce resource consumption. In [54], a parameterized sorting architecture using bitonic merge network is presented. Their key idea is to build a recurrent architecture of bitonic sorting network to achieve throughput area trade-offs. Hardware designs to perform primitive database operations including selection, merge join and sorting are presented in [25]. High memory band- width utilization is achieved by implementing their proposed design on an FPGA-based system. Performance comparison between our design and some of the related work is detailed in Section 4.2.9. Some other high performance sorting architectures are devel- oped for platforms other than FPGA [16, 75]. However, it is not clear how to apply their techniques on FPGAs. In this work, we propose an energy and memory efficient map- ping approach for mapping “folded” bitonic sorting network on FPGA. In our approach, we vertically “fold” the classic Clos network into a RAM-based design to perform the 96 data permutations arising in the bitonic sorting network. Detailed experiments are con- ducted to show the performance improvement over the state-of-the-art in terms of energy and memory efficiency. 4.2.2 Parameterized Architecture Fig. 4.13 shows the architecture framework within which we develop parallel designs for sorting on FPGA. A highly data parallel parameterized sorting architecture is pro- posed in the framework. To generate a specific sorting design, the required user input parameters include: Problem sizeN: The sorting problem consists of reordering an arbitraryN-key sequence. Data parallelismp: With an available data parallelism ofp (2pN=2),p keys are fed into the design in each clock cycle. Data widthw: Each input data element has a width ofw bits. Design reuse option: Either a fully pipelined design or a iteratively reused design would be generated based on the user choice. Data precision: Each data element could be represented using fixed point format or floating point format. The architecture framework is developed based on the bitonic sorting algorithm. The proposed architecture consists of main memory, memory interface, architectural building blocks, and control unit. The architecture framework is illustrated using the definitions below: (1) Main memory: We assume the input consists of several data sequences to be sorted, each of lengthN. External (main) memory is employed to store the inputs. 97 Main memory FPGA Input Output Memory interface … Communication stage Computation stage … I1 I1 I2 … C1 I1 I1 I2 … C1 I1 I1 I2 … C1 Figure 4.13: Architecture Framework (2) Stream input/output: The input data sequences are fed into the FPGA continuously in a streaming manner. The input data sequences enter the on-chip design at a fixed rate. After a specific delay, the sorted data sequences are output at the same rate. (3) Computation stage (C1): Each stage consists of p=2 Compare-and-swap (CAS) units. A Compare-and-swap (CAS) unit takes two inputs and may swap the inputs depending on the result of the comparison. Each CAS unit can be implemented using LUTs and can be pipelined using flip-flops. (4) Communication stage (D1): A communication stageD1 exists between adjacent computation stages. In this stage, data permutation is performed with a fixed rate ofp inputs/outputs per cycle. (5) Building blocks: I1, I2 and C1 are building blocks whose design structures are mainly determined by the user input parameters. I1 is a p-to-p connection network composed of 22 switches. I2 is a single-port or dual-port memory block with size of N=p. The port type is automatically determined by the generator. Each communication stage needspI2 blocks. 98 x y y' . . > x' . 1 0 1 0 x y y' . . < x' . 1 0 1 0 x y y' . . > < x' . c or 1 0 1 0 (a) (b) (c) Figure 4.14: CAS units: for generating a) ascending order, b) for descending order, c) either order When the user chooses to generate a iteratively reused design, only one stage of C1 block and D1 block is generated and would be time-multiplexed. Such a design consumes least amount of hardware resources at the expense of more latency and lower throughput. To generate a fully pipelined sorting design, logN(logN + 1)=2 stages of C1 blocks andD1 blocks are concatenated to maximize the throughput. Note that such a design has no feedback in the datapath, thus can be highly pipelined to achieve high operating frequency. More implementation details are presented in Section 4.1.6. 4.2.3 Computation Stage Template The computation stageC1 consists ofp=2 CAS units. A CAS unit is used to compare two input values and swap the values either in ascending or descending order. Three different designs of CAS unit are shown in Fig. 4.14. Fig. 4.14a shows the CAS unit used to swap the input values to output in ascending order. Similarly, the design in Fig. 4.14b is used to produce output values in descending order. Besides, we also design a configurable CAS unit shown in Fig. 4.14c to generate the output either in ascending order or descending order. Based on the given user inputs, we instantiateC1 with the three different CAS units. The CAS unit to be employed depends on the computation stage index in the fully pipelined sorting design. 99 4.2.4 Communication Stage Template To realize the data permutations, we employ the proposed RPN to generate the template design of communication stage. Using our proposed approach, theD1 block is able to process the streaming input defined in Section 4.2.2. Streaming inputs are fed into the D1 block continuously without stalling the pipeline. After a certain amount of delay, the inputs are permuted as specified by the required permutation pattern. Vertically “folding” the Clos network: We propose to vertically “fold” the Clos net- work to obtain the datapath for realizing data permutations. Using our proposed RPN, the generated datapath ofD1 consists of: Input stage: AS 1 S 1 connection network denoted usingICN is employed in this stage.ICN could be further implemented as aS 1 S 1 Clos network. To permute data inputs, it is time-multiplexed byS 2 times to perform the interconnection of eachCin. Middle stage: This stage is composed of S 1 memory blocks denoted as Mb i (1iS 1 ), each of sizeN=S 1 . Each memory block can be implemented as either a single-port or a dual-port memory. To permute data inputs, Mb i realizes the permutation byCr i temporally. Output stage: It employs aS 1 S 1 connection network denoted asOCN. OCN could also be further implemented as aS 1 S 1 Clos network. To permute data inputs, it is time-multiplexed byS 2 times to perform the interconnection of each Co. While permuting data inputs, a memory conflict is said to occur if concurrent read or write access to more than one word in a memory block is performed in a clock cycle. An important feature of the generated datapath is that it is capable of realizing arbitrary 100 permutation without any memory conflicts. Such feature is formalized as a theorem below: Theorem 4.2.1. For any given data parallelism ofS 1 , the generated datapath can real- ize arbitrary permutation on an N-key input data sequence without any memory con- flicts usingS 1 independent memory blocks, each of sizeS 2 (S 2 S 1 ;S 2 =N=S 1 ). Proof. As introduced in Section 2.2, in the Clos network, each ingress/egress stage S 1 S 1 switch has exactly one connection to each of the middle stageS 2 S 2 switches. To realize a given permutation inC(N;S 1 ;S 2 ), the ingress switches are configured to implement some fixed permutation patterns denoted usingP j (1jS 2 ). By time multi- plexing the connection networkICN shown in Fig. 3.1 forS 2 times, all the permutation patternsP j can be realized. Similarly, the connection networkOCN can also be time multiplexed to implement the permutation patterns of the egress stage switches in the Clos network. In the middle stage of the generated datapath, each memory blockMb i (1 iS 2 ) has a memory size of S 2 and can be written by an output of the ICN in each cycle. In S 2 cycles, Mb i stores S 2 data elements routed from ICN, which are exactly the data inputs ofCr i . By performing permutation in time inMb i , the permu- tation pattern ofCr i is realized temporally. Thus, theS 1 memory blocks realize all the permutation patterns of the middle stage of the Clos network. Therefore, the generated datapath realizes the same permutation as the Clos network in Fig. 3.1. Theorem 4.2.1 indicates that the generated datapath is able to realize all the permutations in the bitonic sorting network. Parameterized design template: Fig. 4.15 shows the template design ofD1 block.D1 block is parameterizable with respect top,m,t, andv defined as below: p: Data parallelism of D1 block, which is the data parallelism of the sorting design. 101 Mb 2 Mb p … ICN OCN Mb 1 Control Unit … … 2p log (m/p) (p/2) log p (p/2) log p Data Path Configuration Look Up Table Control logic Address update logic Control logic Configuration Look Up Table Address Look Up Table Figure 4.15: Template Design ofD1 block m: Length of data sequence to be permuted. In a fully pipelined sorting design, m varies with the communication stage index. t: Stride value for a specific stride permutation to be performed. v:v is used to differentiateP m;t andQ m introduced in Section 2.3.2.D1 performs P m;t ifv = 0, otherwiseQ m . As shown in Fig. 4.15, the datapath includes twop-to-p connection networksICN and OCN, one memory array consisting of p independent memory blocks Mb, each of sizem=p. The control unit consists of two configuration bit look up tables and one address look up table. Each connection network is run time configured during each clock cycle using a configuration bit look up table. Each look up table contains m=p 102 configuration sets; each set includes (p logp)=2 bits. The configuration sets are stat- ically generated for a given fixed permutation. The address look up table stores the configuration bits used by the address update logic to generate memory addresses for accessing thep memory blocks. Each memory address has a width of log(m=p). The total number of read addresses and write addresses generated in each clock cycle isp. We denote the parameterizedD1 block asD1(p;m;t;v). Based on Theorem 4.2.1, it can be configured to perform arbitrary permutation which is either P m;t (v = 0) or Q m (v = 1) arising in the bitonic sorting network. Theorem 4.2.1 constraintsS 2 S 1 , i.e., size of each Mb is no less than m=p for D1(p;m;t;v). Furthermore, using l to denote the memory size ofMb, we have: Lemma 4.2.2. D1(p;m;t;v) is dynamically configurable with respect tom,t andv if lm=p. An important feature of the datapath of D1 block is that its datapath structure remains the same when varying values of all its parameters exceptp. Only size of mem- ory blocks will change. Therefore, based on Lemma 4.2.2, with a fixedp and sufficient memory resource, theD1 block is able to perform all the permutation patterns in bitonic sort if having all corresponding control information in the control unit. This property requires the size of each memory blockMb to beN=p for sortingN keys. The control information contains the configuration bits of switches and the addresses for memory access for all the 2 logN unique permutation patterns in bitonic sorting. In Section 4.2.6, we will show how we achieve a resource efficient design for sorting, by run-time configuring the D1(p;m;t;v) to iteratively reuse it. Control bit generation: We adopt the well-known looping algorithm for routing the Clos network to compute the control bits forD1 [40]. Therefore, the time complexity for computing the control bits ofD1 isO(N). Routing algorithms on Clos network have 103 been well studies priorly [40, 38]. These previous optimizations on Clos network can also be reused for realizing data permutations usingD1 block. Fig. 4.16 shows an example where the generated D1 block using Clos network is dynamically configured to perform stride permutation P 8;2 . The configuration bits of D1 are computed from the configuration bits of Clos network. In this example,m = 8 andp = 2. The design needs two 2 2 switches and two 4-entry memories. As the data flows through theD1 block, they are first permuted through the input switch. Then the permuted data flows are written into the 2 memory blocks inm=p = 4 clock cycles, and read out of the memory blocks in the subsequent 4 clock cycles. Finally, intermediate data from memory blocks are permuted through the output switch to obtain the results. Each memory block inD1 can be implemented with single-port memory to permute a single data sequence. However, when processing continuous data streams, dual-port memory with double size is required as concurrent read and write access to different memory locations need to be performed. An algorithm which enables the use of single- port memory for processing continuous data streams is introduced next. In this section, we present details on architecture implementation of the automati- cally generated sorting designs, including a high throughput design and a resource effi- cient design. The high throughput design is fully pipelined to achieve high performance. With a data parallelism ofp, it supports processing continuous streaming inputs with a throughput ofp results per cycle. The resource efficient design consumes minimal hard- ware resources by iteratively reusing the architectural components. It achieves high resource efficiency by iteratively reusing a singleD1 block to perform all the data per- mutations. 104 0 1 2 3 4 5 0 2 4 6 1 3 P (8,2) 6 7 5 7 4-entry memory 4-entry memory 0 1 2 3 4 5 6 7 0 2 4 6 1 3 5 7 Time multiplexing Permutation in space Permutation in time Time multiplexing Input stream Output stream 0,3,4,7→ 0,4,3,7 1,2,5,6→ 2,6,1,5 Figure 4.16: Example: Routing the Clos network forD1 to realizeP (8;2) andp = 2 CAS Units … Input Output Stage 1 Stage 2 Stage (log N)(log N+1)/2 … … … Connection network Connection network CAS Units … … … Connection network Connection network CAS Units … … … Connection network Connection network Control Unit Control Unit Control Unit … p Figure 4.17: High throughput design for sorting 4.2.5 High Throughput Design As shown in Fig. 4.17, the high throughput design is composed of several concatenated sorting stages, each having p=2 parallel CAS units. For given fixed problem size N and data parallelism (p), the high throughput design consists of (logN)(logN + 1)=2 sorting stages. Each sorting stage has at most one D1 block. This design supports 105 sorting continuous data streams. Ifm p at a particular sorting stage,D1 is replaced with anm-to-m switch. In the high throughput design, there are P logN1 i=logp ( P i j=logp 1 + 1) stages of D1 blocks. EachD1 denoted asD1(p;m;t;v) has its own parameter values form,t andv. T DM (N;p) = logN1 X i=logp ( i X j=logp 2 j+1 p + 2 i+1 p ) (4.11) which is (6(Np) 2p log(N=p))=p. The terms 2 j+1 =p and 2 i+1 =p indicate the size of a memory block in anD1(p;m;t;v). The total latency introduced by input and output connection networks ofD1 isO((logp) log 2 N). As the total latency introduced by CAS units isO(log 2 N), the overall latency of the high throughput design is 6N=p +o(N) (2p N=2). The size of the entire memory blocks used by all the D1s is the the product of p and T DM (N;p) which is 6N +o(N). Note that we use little-O notation here (a constant or logN belongs to o(N) but N = 2 o(N)) [85]. Furthermore, the high throughput design is pipelined to process continuous data streams, resulting in a throughput ofp. To achieve high throughput, a pipeline stage can be inserted after each of the (logN)(logN + 1)=2 sorting stages. Thus the total area consumption of CAS units isO(p log 2 N). 4.2.6 Resource Efficient Design Fig. 4.18 shows the architecture of the resource efficient design. The resource efficient design cannot support processing continuous data streams, thus the proposed in-place permutation in time algorithm is not applicable. For a given data parallelism p, the resource efficient design requires oneD1(p;m;t;v) andp=2 CAS units. During sorting, D1(p;m;t;v) is reused by configuringm,t andv to perform anyP m;t andQ m arising in the bitonic sorting. In this way, this design achieves the highest resource efficiency at 106 CAS Units Memory Block Memory Block … Input connection network Output connection network Memory Block Control Bits Generation Control Unit Input Data … … Output Data Look Up Tables D1 (p,m,t,v) p· log (N/p) p· log p p· log p p/2 Figure 4.18: Overall architecture of resource efficient design the expense of throughput. As noted in Section 2.3.2, only 2 logN different permutation patterns exist between the log 2 N comparison stages in the bitonic sorting network. Thus 2 logN distinct contexts of control information are needed, each realizes a particular permutation pattern. To complete execution of one sorting stage in the bitonic sorting network,p=2 CAS units are reusedN=p times. Since the total number of sorting stages is (logN)(logN + 1)=2, the resource efficient design has a latency ofO((log 2 N)N=p) for sorting. Note that the next sorting stage cannot start before completing the execution of the cur- rent sorting stage. As the intermediate results of the current stage needs to be stored before the execution of the next stage, the required on-chip memory size is exactly N. Each of the p=2 CAS units employs the design shown in Fig. 4.14. To com- plete comparisons in each sorting stage, the p=2 CAS units need to be dynamically configured. Benefiting from the recursive structure of bitonic sorting network, only one state machine with logN states for updating p=2 bits is required to configure 107 all the CAS units. Therefore, the configuring overhead with respect to configuration bits for CAS units is O((p=2) log 2 N). Similarly, D1(p;m;t;v) needs to be reused (logN)(logN + 1)=2 times to perform all the required data permutations. For access- ing each memory block in the D1(p;m;t;v), a log(N=p)-bit counter is required for read, andp state machines (each having 2 logN states) for updatingp log(N=p) control bits are required for write. Similarly, to configure the connection networks, two state machines (each having 2 logN states) for updatingp logp control bits are required. As a result, the total programming overhead with respect to the number of control bits of theD1(p;m;t;v) isO((p logN) log(N=p)). 4.2.7 Experimental Setup Both the high throughput design and the resource efficient design were implemented on Virtex-7 FPGA (XC7VX690T, speed grade -2L). This device has 2940 BRAMs (each 18 kbits) and 108300 slices. The designs were synthesized and place-and-routed by Vivado 2015.2 [10]. Post place-and-route simulations were conducted for behavior and timing verification. We created input test vectors having an average toggle rate of 50% for simulation. We used SAIF (Switching Activity Interchange Format) files as inputs to Vivado power analysis tool to produce accurate power dissipation estimation [14]. In this section, we use HT Design to denote the high throughput design. Design automation tool: We have built a design automation tool based on the tech- niques and algorithms discussed in this paper. We have published it online through github [3]. The algorithm generation, permutation realization and control bit generation all have been incorporated into our tool. The tool takes as input user parameters includ- ing problem size N, data parallelism p, etc. (see Section 4.2.2). After specifying the parameter values, it outputs a register-transfer level Verilog description of the design. 108 2 4 6 8 20 40 60 On-chip memory consumption (Mbits) Throughput (Gbits/s) [52] HT Design SPIRAL [94] [63] Figure 4.19: Memory efficiency comparison of various designs The process of converting a specific design to a Verilog file is handled by a python module. Then the designs in Verilog are synthesized through FPGA Tools. Performance Metrics: Throughput: is defined as the number of bits sorted per second (Gbits/s). The throughput is computed as the product of number of keys sorted per second and data width per key. Energy efficiency (or power efficiency): is defined as the number of bits sorted per unit energy dissipated (Gbits/Joule) by the design and is calculated as the throughput divided by the average power consumed by the design. Memory efficiency: measured as the throughput achieved divided by the amount of on-chip memory used by the design (in bits). 109 4.2.8 Asymptotic Analysis Table 3.4 presents an asymptotic analysis of the performance of various sorting archi- tectures. The details of some of the prior designs are introduced in Section 4.2.1. The proposed HT Design is one of the designs achieving a linear time complexity (latency) which decreases with the available data parallelismp. We also show the constants with littleo notation in the asymptotic expressions for the sake of comparison [85]. The table shows that the memory throughput ratio (the reciprocal of memory efficiency) of the HT Design is 6N=p +o(N). Whenp 4 andN 128, the HT Design outperforms all the other designs with respect to memory efficiency. Moreover, benefiting from the proposed in-place permutation in time algorithm, the HT Design supports processing continuous data streams using single-port memory blocks. 4.2.9 Performance of HT Design We employ a baseline architecture implemented using the HT Design without applying the proposed optimization techniques discussed in Section 3.1.6 and Section 3.1.5. For both designs, we evaluate the amount of BRAM, LUT consumed for problem sizesN = 1024; 2048; 4096; and 16384. The results are shown in Fig. 4.7. In this plot, the available amount of BRAMs or LUTs is normalized to one on they-axis. The green bars and blue bars show the resource consumption of the HT Design and the baseline, respectively. Fig. 4.7 shows that the consumption of both BRAM and LUT nearly doubles for the baseline for all the problem sizes whenp = 4. The number of BRAMs reduced by the optimizations declines when p = 16 as more memory blocks are implemented using distributed RAM for small values ofm=p. The reduction in memory usage is especially significant for N = 16K. Moreover, the figure shows that the utilization of LUT is also reduced in the HT Design. This shows that as dual-port memory is eliminated 110 and the total memory size is halved, LUT needed for implementing memories is also reduced. Thus, it implies that the logic overhead for implementing the proposed in- place algorithm is almost negligible, and it also demonstrates the superiority of our other proposed optimization techniques. Fig. 4.7b shows that the LUT consumption for N = 4096 is more than that for N = 16384. The reason is that a large amount of distributed RAMs in LUTs are employed whenN = 4096, while more memory blocks are implemented using BRAMs when N = 16384. We further evaluate the energy efficiency of HT Design and the baseline forN =1024, 2048, 4096, and 16384 while varyingp. The operating frequency is fixed at 250 MHz for the sake of power evaluation. All our designs were pipelined to achieve this clock rate. The data width is varied from 8-bit to 64-bit. The experimental results are presented to demonstrate the benefit of the optimization techniques incorporated in our design framework for sorting from a power point of view. Fig. 4.8 shows that the energy efficiency of both designs are sensitive to data width. This is because we create testbench with a pessimistic estimation of 50% toggle rate for each design. When the data width increases, the switching activity of the designs increases significantly. The results also show that as the data width and problem size are varied, the HT design achieves 27% to 138% improvement in energy efficiency. 4.2.10 Performance Comparison with the State-of-the-Art Fig. 4.19 presents a scatter plot comparing the design points of our work with several prior work. The design points labeled [52] are developed for sorting 43K-key or 21:5K- key data sequence. The design points labeled HT Design and SPIRAL [94] are real- ized for sorting 16K-element data sequence. The design in [63] can process up to 250 MB data sets consisting of 8-key data sequences. An embedded system based sorting solution is presented in [63]. Thex-axis represents the on-chip memory consumption (in Mbits) of a design point, and they-axis represents the throughput achieved by the 111 Energy efficiency (Gbits/Joule), 32-bit 2 10 2 11 2 12 2 13 2 14 0 20 40 60 80 100 120 140 N HT Design p = 4 RE Design p = 8 SPIRAL p = 16 Figure 4.20: Energy efficiency comparison design. Design points closer to the upper left corner of the plot achieve higher through- put with less on-chip memory. In Fig. 4.19, all our designs are dominating designs: for every design in the literature considered in this evaluation, one of our designs offers superior throughput or memory efficiency or both. Our designs achieve 135% to 430% higher memory efficiency compared with [52]. Our best design provides 160% and 56% improvement in memory efficiency compared with SPIRAL and [63], respectively. We also compare both the HT Design and the resource efficient design with the designs developed by the SPIRAL project [94] with respect to energy efficiency. For the sake of illustration, the operating frequency is set to 250 MHz for power evaluation. Energy efficiency of our designs is compared against that of the SPIRAL IP cores. The problem size is chosen to be 1024, 2048, 4096 and 16384. We vary the data parallelism from 4 to 16. RE Design represents the Resource Efficient Design in Fig. 4.20. As shown in Fig. 4.20, for variousN, the HT Design improves energy efficiency by 49% 112 to 112%. The results also show that the resource efficient architecture consumes the most energy per unit performance. The reason is that a considerable amount of energy is consumed by the path connecting the design and I/O ports. We believe the resource efficient design can achieve a much higher energy efficiency if implemented in VLSI as an ASIC design. 4.3 High Throughput Streaming Equi-Join Accelerating database applications using FPGAs has recently been an area of growing interest in both academia and industry. Equi-join is one of the key database operations whose performance highly depends on the performance of sorting. However, as the data sets grow in scale, the database primitive sorting exhibits high memory usage on FPGA. For sorting large data sets, external memory has to be employed to perform data buffer- ing between the sorting stages after exhausting FPGA memory resource. This intro- duces pipeline stalls as well as several data communication iterations between FPGA and external memory, thus causing significant performance decline. In this section, we present a high throughput design for accelerating Equi-Join using a hybrid CPU-FPGA heterogeneous platform. 4.3.1 Related Work Using dedicated logic design to accelerate database operations especially on FPGA plat- forms has become popular recently in both academia and industry [83, 72, 25, 36]. Researchers at Microsoft developed a reconfigurable fabric to accelerate large-scale data center services [72]. A portion of tasks for Microsoft Bing Search’s ranking are accelerated using this prototype system. In [25], hardware designs to perform primitive database operations including selection, merge-join and sort are presented. 113 FPGA Memory Interface CPU Bitonic Sorter 1 Bitonic Sorter 2 Bitonic Sorter k ... Merge sort tree Main memory Memory Interface Other I/O device k nodes at the bottom Interface Figure 4.21: Hybrid Sorting Design High throughput performance is achieved by implementing their proposed design on an FPGA-based system. However, the memory bandwidth utilization of their design is relatively low. A system called glacier which compiles queries directly to a high level hardware description has been proposed in [62]. The team developed a streaming median operator by utilizing sorting networks in [63]. However, their design is targeted at much smaller data sets and they do not discuss any throughput performance opti- mizations. There are also some work focused on accelerating sorting on FPGAs target- ing database related applications. Several existing sorting architectures on FPGAs are implemented and evaluated in [52]. FIFO or tree based merge sorter as well as bucket sorter are selected as target designs for implementation. They also discuss how to use partial run-time reconfiguration to reduce resource consumption. In [36], a parameter- ized sorting architecture using bitonic merge network is presented. Their key idea is to build a recurrent architecture of bitonic sorting network to achieve throughput area trade-offs. However, the presented results are limited data set sizes. Other than FPGAs, there are also some techniques for high performance join operation based on general purpose platforms [50, 22]. However, it is not clear how to apply these techniques on a heterogeneous CPU-FPGA platform. 114 . . > < . or 1 0 1 0 (a) ... (b) (c) (d) Figure 4.22: (a) Compare-and-switch (CAS) unit, (b) Data buffer, (c) Connection net- work, (d) Parallel-to-serial/serial-to-parallel MUX (PS/SP) X 3 X 6 X 9 X 12 X 2 X 5 X 8 X 15 X 1 X 4 X 11 X 14 X 0 X 7 X 10 X 13 Data output in parallel X 3 X 7 X 11 X 15 X 2 X 6 X 10 X 14 X 1 X 5 X 9 X 13 X 0 X 4 X 8 X 12 X 12 X 13 X 14 X 15 X 8 X 9 X 10 X 11 X 4 X 5 X 6 X 7 X 0 X 1 X 2 X 3 0 1 2 3 Output Cycles 0 1 2 3 0 1 2 3 Input Cycles Memory entries Connection network Connection network Figure 4.23: Data permutation in the data buffers for 16-key sorting 4.3.2 Hybrid Design for Sorting Fig 4.21 shows the overview of our proposed merge sort based hybrid sorting design where the first few sorting stages in the merge sort tree are replaced with “folded” bitonic sorting networks, each is implemented as an FPGA accelerator named as a bitonic sorter. k such bitonic sorters work in parallel on FPGA, the partial results from FPGA are then merged on CPU using merge sort tree based implementation. High Throughput Bitonic Sorter on FPGA: The bitonic sorter consists of four build- ing blocks (Fig.4.22): compare-and-switch (CAS) unit, data buffer, connection net- work, and parallel-to-serial/serial-to-parallel (PS/SP) multiplexer. A complete design is obtained by a combination of the basic blocks. 115 CAS unit: This module compares two input values and switch the values either in ascending or descending order depending on the control bit value. Each CAS unit is pipelined using flip-flops. To implement an n-input “folded” bitonic sorting network, logn(logn + 1) cascaded stages of CAS units are required. Each stage consists ofm=2 CAS units. m is the data parallelism denoted as the number of parallel inputs/outputs per cycle. The data permutation between adjacent subsequent stages of CAS units is performed through the modules including the connection network and data buffers. Data buffer: Each data buffer consists of a dual-port RAM havingn=m entries. Data is written into one port and read from the other port simultaneously. Fig. 4.23 shows the data buffering process for sorting 16 keys. In four cycles, 16 permuted data inputs are fed into the data buffers. In each cycle, with alternating locations, four data outputs are read in parallel. For different n value, the read and write addresses are generated with different strides. In Fig. 4.23,X 0 ;X 4 ;X 8 ;X 12 are written in input cycle 0, 1, 2, 3 respectively. Then they are output simultaneously in output cycle 0. Connection network: Parallel input data are required to be permuted before being processed by the subsequent modules. The connection network is implemented based on our prior work on data permutation in [34, 36]. As shown in Fig. 4.23, in input cycle 0, (X 0 ;X 1 ;X 2 ;X 3 ) are fed into the first entry of each data buffer without permutation. In the next cycle, another four data inputs are written into the sec- ond entry of each data buffer with one location permuted. The parallel output data (X i ;X i+4 ;X i+8 ;X (i+12)mod16 ;i = 0; 1; 2; 3) are stored in different RAMs after four cycles. PS/SP module: This module is used to multiplex serial/parallel input data to output in parallel/serial respectively. For example, when the number of I/Os is limited to one, but the CAS units operate on four data inputs in parallel, thus the PS/SP module is employed to match the data rate both before and after the CAS units. 116 CAS Units … Input Output Stage 1 Stage 2 Stage (log n)(log n+1)/2 … … … Connection network Connection network CAS Units … … … Connection network Connection network CAS Units … … … Connection network Connection network Control Unit Control Unit Control Unit … m Figure 4.24: A fully pipelined high throughput bitonic sorter Fig. 4.24 shows a fully pipelined high throughput sorting architecture built using the architectural building blocks introduced above. In the figure, n is the input size; m determines the number of parallel inputs/outputs. The input data sequences can be fed into the sorter continuously in a streaming manner at a fixed rate. After a specific delay, the sorted data sequences are output at the same rate. As a bitonic sorter has (logn)(logn + 1)=2 stages of data buffers, the latency introduced by all the data buffers can be calculated by T (n;m) = logn1 X i=logm ( i X j=logm 2 j+1 m + 2 i+1 m ) (4.12) which is (6(nm) 2m log(n=m))=m. The factor 2 j+1 or 2 i+1 indicates the size of a data buffer. As the total latency introduced by all the CAS units and connection networks isO((logm) log 2 n), the entire latency of the bitonic sorter isO(n=m) (1mn). 4.3.3 Decomposition-based Task Partition Approach In this section, we present a decomposition-based approach for task-partition in our hybrid design for sorting. Assumingk fully pipelined bitonic sorters are implemented on FPGA for sorting, our task partition approach is described as follows: 117 “W or k ” w 11 w lk … A 1 A k General Purpose Processors w 1k w 2k … w 21 ... w l1 … … … … … … A 1 A k … A 1 A k … … … … r 11 r k1 … r 12 r k2 … r 1l r kl … … … … … … Figure 4.25: Decomposition based task partition approach Decompose: Partition the N-key data set (“work”) into N nk groups of subtasks, each group hask subtasks. Letl denotes N nk . Each subtask is to sortn keys using a bitonic sorter. Each subtask is denoted usingw ij (1il, 1jk). Accelerate: Distribute the subtasks w ij to the bitonic sorters denoted as A 1 ;:::;A k . A bitonic sorterA p (1pk) handlesl subtasks includingw ip (1il). All the bitonic sorters work on the subtasks in parallel. As each bitonic sorter is fully pipelined, all its assignedl subtasks are processed continuously without any stalls. Merge: For bitonic sorterA i , its data results are represented asr i1 ;:::;r il , which are produced sequentially in a streaming manner. These sorted data sequences are then transferred from the FPGA to external memory. The rest work to obtain a complete sorted data sequence will be handled by the CPU based on sort merge tree algorithm. Figure 4.25 shows the basic idea of the proposed task partitioning approach. To sort N (divisible by n) keys using our hybrid sorting design, each bitonic sorter sorts l =N=(nk)n-key data sequences in a streaming manner. Theoretically, the throughput of each bitonic sorter can be calculated as: Th = nl 2nl=m + 6n=m = m 2 + 6=l (4.13) 118 where 2nl=m is the number of input and output cycles, 6n=m is the number of cycles to fill the pipeline, obtained through approximation of Equation 4.12. We fixn in our hybrid sorting design. n is chosen based on the amount of available memory on FPGA. As a result, theoretically, as N is increased with fixed values of n and k, Th finally approximates m=2. This indicates that a high throughput can always be sustained by the parallel bitonic sorters with increasing data set size. After FPGA acceleration, the rest of the computation task is shifted to CPU which performsO(N log N n ) operations using merge sort tree algorithm. In FPGA-only approach, to complete the final log N n sorting stages, FPGA accelerator has to visit external memory for O(log N n ) iterations, each iteration loading 2n keys and offloading 2n merged keys. Without high perfor- mance memory hierarchy in CPU platform, these repeated iterations significantly lower the throughput performance of the FPGA-only approach. However, asN=n increases, the CPU execution time may become the performance bottleneck in our hybrid design, especially considering a lower memory bandwidth utilization for CPU. To resolve this issue, we propose two streaming join algorithms in Section 4.3.4 by optimizing the classic CPU join algorithms to overlap the CPU and FPGA computation. Experimental results of our join design in Section 4.3.8 show that about an average of 40% of the execution time of FPGA is overlapped with the CPU execution time. 4.3.4 Streaming Join Algorithms We develop two streaming join algorithms: streaming sort-merge join (SSMJ) algorithm and streaming block nested loop join (SBNL) algorithm. Both the two algorithms are valuable in practical: SSMJ is applicable if the client query requires two data columns to be joined and sorted; SBNL algorithm has less computation workload if the client query is a join-only request. 119 Algorithm 6 Streaming Sort-Merge Join Algorithm 1: procedure SSMJ 2: input:L:r ij ,R:r ij (1ik;1jl L (l R ));keysel 3: output:L:r ij ./R:r ij 4: Initialize:s L size of L,s R size of R,l L =s L =(kn) ,l R =s R =(kn),j = 0 5: whilej <l L do . sorting Phase 6: if receiveL:r 1j ;L:r 2j ;:::;L:r kj from FPGA then 7: then merge sortL:r 1j ;L:r 2j ;:::;L:r kj 8: L:r(:;j) L:r 1j ;L:r 2j ;:::;L:r kj andj++ 9: end if 10: end while 11: j = 0 12: whilej <l R do 13: if receiveR:r 1j ;R:r 2j ;:::;R:r kj from FPGA then 14: merge sortR:r 1j ;R:r 2j ;:::;R:r kj 15: R:r(:;j) R:r 1j ;R:r 2j ;:::;R:r kj andj++ 16: end if 17: end while 18: merge sortL:r(:;1);L:r(:;2);:::;L:r(:;l L ) 19: merge sortR:r(:;1);R:r(:;2);:::;R:r(:;l R ) 20: fori = 1 tos L =T do . merge-join and select 21: forj = 1 tos R =T do 22: call MJS(L:r(:;i),R:r(:;j),keysel) 23: end for 24: end for 25: end procedure Algorithm 7 Merge-Join and Selection (MJS) 1: procedure MJS 2: input:x;y;keysel 3: output:x./y 4: if(x:min>y:max)k(x:max<y:min) then 5: return 6: end if 7: for each itemu inx do 8: for each itemv iny do 9: ifu.key==v.key then 10: outputu./v ifu.key2keysel 11: end if 12: end for 13: end for 14: end procedure Streaming Sort-Merge Join (SSMJ): We assume two input table columns with equal size need to be joined. The data values of the table columns are represented using vectors L andR. Sub-vectors ofL andR are first sorted by the bitonic sorters sequentially. The 120 partial results produced by FPGA will then be merged by the CPU, which also performs the merge-join and selection operations on the sortedL andR. Algorithm 6 shows our proposed algorithm. Notations in Section 4.3.3 are reused to illustrate our algorithm. The sorted data sequences from FPGA are denoted asL:r ij andR:r ij . The size ofL(R) is denoted ass L (s R ). Assumes that the size of eachL:r ij or R:r ij is n. The k bitonic sorters will produce k sorted data sequences in parallel, each of sizen, after every some specific delay. Oncek sorted data sequences have been sorted, the bitonic sorters will notify CPU so that it can starts merging thek sorted data sequences immediately. This merge sort process will firstly be performed onL:r ij and thenR:r ij asL andR are sorted by the bitonic sorters sequentially. After that, the pro- cessor needs to further merge the sorted subvectors includingL:r :;1 ;L:r :;1 ;:::;L:r :;s L =k orR:r :;1 ;R:r :;1 ;:::;R:r :;s L =k . Until now, bothL andR have been sorted based on the key values. After that, merge-join and selection operations shown in Algorithm 7 are per- formed on the CPU. We still useL andR to represent the sorted inputs.L(R) is divided into s L =T (s R =T ) sub-vectors denoted as L(:;i)(R(:;j)). T is empirically selected in our experiments depending on the cache size. In each loop iteration, two sub-vectors are fetched, each having T data elements. If the two sub-vectors have no key value overlap, the next iteration will be executed. Otherwise, compare the key values of the two sub-vectors and output the join result if a key is selected. Streaming Blocked-Nested-Loop (SBNL): In this algorithm, instead of completing sorting phase on CPU after receiving intermediate results from FPGA in SSMJ algo- rithm, we perform merge-join and selection operations immediately. We use the same notations as in Algorithm 6. We evenly distribute the sorting tasks to the k bitonic sorters;L andR are sorted in parallel, each handled byk=2 bitonic sorters. We assume L andR have equal size. Algorithm 8 shows our proposed SBNL algorithm. Similarly, let the inputs of the CPU beL:r ij ,R:r ij (1ik; 1jl L (l R )), which are produced 121 Algorithm 8 Streaming Blocked-Nested-Loop Algorithm 1: procedure SBNL 2: input:L:r ij ,R:r ij (1ik;1jl L (l R ));keysel 3: output:L:r ij ./R:r ij 4: constants:s L size of L,s R size of R 5: while !(finished allL:r ij ./ allR:r ij ) do 6: if interrupt received then 7: forj = 1 tol L do 8: fori = 1 tok do 9: ifL:r ij not received then 10: continue 11: else if finishedL:r ij ./ allR:r ij then 12: continue 13: else 14: forj 0 = 1 tol R do 15: fori 0 = 1 tok do 16: if doneL:r ij ./R:r i 0 j 0 then 17: continue 18: else 19: x L:r ij ;y R:r i 0 j 0 20: call MJS(x,y,keysel) 21: end if 22: end for 23: end for 24: end if 25: end for 26: end for 27: end if 28: end while 29: end procedure in a streaming manner by the FPGA accelerators. Once k n-key data sequences have been sorted in parallel, the bitonic sorters send an interrupt signal to the processor. After the sortedk data sequences have been transferred to the memory, the software checks if eachL:r ij has been received or not using a table of sizekl L . For specific values of i andj, ifL:r ij has been received, it will further check whether the join operation has been performed between theL:r ij and allR:r i 0 j 0 using a flag, thus totallykl L flag bits for allL:r ij . If the flag bit is zero, the MJS procedure introduced in Algorithm 7 will be called usingL:r ij andR:r i 0 j 0 as the input if this procedure has not been performed on the two previously. The benefit of using this algorithm is that, asL:r ij andR:r i 0 j 0 have been sorted, we can check if the key value ranges of the two input data vectors are not 122 Table 4.1: Key features of the ZedBoard CPU core Dual ARM Cortex-A9, 666 MHz CPU cache 32 KB L1D+L1I, 512 KB L2 DRAM and bandwidth 512 MB DDR3,3.2 GB/s 1 FPGA logic resource 85000 logic cells, 53200 slice LUTs FPGA on-chip RAM 560 KB (BRAM) 1 Considering a 75% of memory controller effi- ciency [12] Control Unit for CAS AXI HP0 Bitonic Sorter AXI HP1 AXI HP2 AXI HP3 GPIO Control Logic HP0 HP1 HP2 HP3 GPIO DRAM Controller L2 $ PS PL 100MHz 64 bits 100MHz 64 bits 100MHz 64 bits 100MHz 64 bits 100MHz 32 bits Bitonic Sorter Bitonic Sorter Bitonic Sorter ARAM L1 L1 ARAM L1 L1 SSMJ Merge sort tree Merge join Select SBNL Merge-join Select Figure 4.26: Block diagram of the complete system design on Zynq overlapped. If so, the merge-join phase can be avoided thus saving time. Furthermore, the computation for sorting using the bitonic sorters and the computation for merge-join on the CPU can be further overlapped. As a result, the overall computation latency can be reduced. 4.3.5 CPU-FPGA System Implementation We target the ZedBoard platform with Xilinx Zynq Z7020 as our experimental platform to implement our proposed join designs. Xilinx Zynq processor is a high performance 123 low power SoC architecture integrating general purpose CPU and FPGA [12]. The key features of the ZedBoard is shown in Table 4.1. The Zynq processor consists of two components: programmable logic (PL) and programmable system (PS) integrating ARM CPU, on-chip interconnection and various peripherals [12]. A set of advanced extensible interface (AXI) interfaces are available for the communication between PS and PL. Each AXI interface supports 32/64 bit full-duplex transaction. Fig 4.26 shows the block diagram of our Equi-Join designs on Zynq. Parallelization and throughput-balancing: We consider data rate for throughput- balancing purpose. The data rate is defined as number of input elements per cycle. The data rate of a bitonic sorter is determined by the clock frequency, data width and data parallelism. The data rateS of each bitonic sorter can be calculated as: S =mwF clock (4.14) where m is the data parallelism defined in Section 4.3.2, w is the data width per data element,F clock is the operating frequency of the FPGA design. AssumingF clock is 100 MHz, we attach one bitonic sorter to each of the four AXI HP ports as shown in Fig 4.26, then the result data rate is 3.125 GB/s. Note that a total of 3.2 GB/s peak bandwidth can be achieved by the DDR3 DRAM on Zynq if running at 1066 MHz [12]. This ensures throughput balancing between the DRAM and the bitonic sorters. The current DDR3 device has a data bus width of 32-bit, and we can expect a higher bandwidth when it is replaced with a 64-bit DDR device [6], and more data parallelism on FPGA can be explored. System control and data flow: The system starts from the input phase by feeding data inputs to the bitonic sorter continuously in a streaming manner. The input data set is evenly partitioned to ensure the workloads of the four bitonic sorters are balanced. 124 The entire input data set is read from DRAM by all the bitonic sorters during the input phase through AXI HP. LetN denotes data set size and each bitonic sorter is capable of sortingn inputs. Thus each of the four bitonic sorters handlesN=4n sorting subtasks in a streaming manner, each subtask is to sortn inputs, as shown in Fig. 4.25. An AXI GP interface is enabled and configured so that the processor can send control information to or receive updates from the FPGA accelerators. To track the current status of the N=4n sorting tasks assigned to a bitonic sorter, we use a status bit vector of sizeN=4n. For each bitonic sorter, it updates its status bit vector through GP AXI after finishing the current sorting subtask and completing the corresponding data transfer process. The software engine on the CPU always checks the values of the status bit vector stored in the DRAM and initiates either the merge sort tree operation in SSMJ algorithm or the merge-join operation in SBNL algorithm if any two new sorting subtasks have been completed. Some other AXI related control interfaces such as the central interconnect and memory switch are also employed to ensure the correct system dataflow [12]. System implementation: We used the generated firmware by Xilinx Vivado toolset for our target board. To avoid feedback loop between the CPU and the FPGA, we implement the merge-join operation and the selection operation in software, especially considering the fact that the most time consuming part of join is sorting. We implemented both the SSMJ and SBNL algorithms introduced in Section 4.3.4 on the platform. Detailed experimental results of the two algorithms are presented in Section 3.1.7. 4.3.6 Experimental Setup All our designs were implemented on the Zedboard with Xilinx Zynq SoC XC7Z020- CLG484-1 using Xilinx Vivado 14.4 [12]. To illustrate the benefit of our proposed design approach, we report the performance of the hybrid sorting design, as well as the performance of the Equi-Join system design on the Zedboard. Each bitonic sorter can be 125 1 2 3 4 5 6 20 40 60 On-chip memory consumption (Mbits) Throughput (Gbits/s) Merge sort design [52] Our design Figure 4.27: Performance comparison to merge-sort design optimized to run at a maximum frequency of 180 MHz. We clocked the bitonic sorters on the FPGA fabric at 100 MHz for the sake of throughput balancing in our system implementation. We used the Logic Analyzer in the Xilinx Vivado tool set to measure the throughput and latency of our design. To illustrate the advantage of using both the CPU and the FPGA for a single query, we present experimental results comparing our hybrid Equi-Join design with CPU only and FPGA only baselines. We also provide performance comparison with the state-of-the-art. 4.3.7 FPGA Accelerator Performance To illustrate the benefit of using our proposed bitonic sorter, we compare the perfor- mance of our design with the state-of-the-art merge-sort based design. We separately implemented the bitonic sorter on a Xilinx Virtex-7 FPGA (XC7VX690T) [14] to ensure a fair comparison with prior work. Figure 4.27 shows the throughput performance com- parison for sorting 16K-key 32-bit data sequence. The top right triangle indicates the throughput of our fully pipelined bitonic sorter. Other triangles in red represent com- pact designs by folding the fully pipelined bitonic sorter horizontally to save logic and memory, at the expense of throughput. As shown in Figure 4.27, compared with the 126 Table 4.2: Resource consumption of the PL section on Zynq Modules # LUTs LUT utilization # of BRAMs BRAM utilization # of Register Register utilization Bitonic sorters 34686 65% 72 51.4% 18693 60% AXI Interconnects 4048 7.6% 0 0% 4960 4.7% AXI BRAM Controller 1756 3.3% 0 0% 1712 1.6% Other AXI Interfaces 2119 3.9% 0 0% 2473 2.3% BRAM for I/O buffering 0 0% 32 22.8% 0 0% merge-sort based design, all our designs are dominating designs: one of our designs offers superior throughput or uses less on-chip memory or achieves both. This indi- cates that our proposed bitonic sorter achieves a higher memory efficiency compared with the merge-sort design, i.e., the fully pipelined bitonic sorter always outperforms in throughput performance using the same amount of on-chip memory resource. The fully pipelined bitonic sorter can handle 4 64-bit values per clock cycle (250 MHz), providing a throughput of up to 7.9 GB/s, which almost fully utilizes the peak memory bandwidth (around 10 GB/s) of a 64-bit DDR3 DRAM [6]. As the proposed bitonic sorter is con- figurable with regard to data parallelism, a high DRAM bandwidth utilization can easily be achieved. There are two reasons why the proposed bitonic sorter outperforms: first, the throughput of merge sort design depends on the input values; second, the inherent control complexity of merge sort limits its data parallelism. 4.3.8 CPU-FPGA System Performance In our system implementation, we employ four bitonic sorters running in parallel with a total data parallelism of four. All the bitonic sorters are fully pipelined. Each bitonic sorter handles 64 KBytes data set and produces a 64-bit output result per clock cycle. The supported problem size is chosen to be 64 KBytes based on the available on-chip 127 2 0 2 3 2 6 2 9 2 12 2 15 0 0:5 1 1:5 2 Input size (KBytes) Throughput (GBytes/s) CPU+FPGA CPU only FPGA only Figure 4.28: Throughput comparison for various input sizes memory resource on Zynq PL. Each output is a combination of a 32-bit key and two 16- bit values. We measure the processing throughput which is the number of data values in Bytes produced per second when performing Equi-Join. Resource consumption: Table 4.2 summarizes the resource consumption of all the logic modules on the Zynq PL. The on-chip communication interfaces including the AXI interconnects, AXI BRAM controller, and other AXI related control interfaces consume 14.8% LUT of the programmable logic. The four bitonic sorters consume 65% LUT logic and 51.4% BRAM blocks. An additional 32 BRAM blocks are employed for input/output data buffering. All the communication interfaces were implemented using Xilinx provided IP cores [12]. As these IP cores can be memory mapped in the PS address space, the data communication between the FPGA and the processor was easily handled at the software level. Comparing Software, Hardware and Hybrid Designs: In this section, we compare the performance of SSMJ algorithm based accelerator with the sort-merge join algo- rithm based CPU-only and FPGA-only implementations. The SSMJ based CPU+FPGA approach uses the same experimental setup introduced in Section 4.3.8. The CPU-only design runs on a single Cortex-A9 core inside the Zynq system with caches enabled. 128 10% tuples with matches 2 0 2 3 2 6 2 9 2 12 2 15 0:5 1 1:5 2 Input size (KBytes) Throughput (GBytes/s) SSMJ SBNL (a) 30% tuples with matches 2 0 2 3 2 6 2 9 2 12 2 15 0:5 1 1:5 2 Input size (KBytes) Throughput (GBytes/s) SSMJ SBNL (b) 50% tuples with matches 2 0 2 3 2 6 2 9 2 12 2 15 0:5 1 1:5 2 Input size (KBytes) Throughput (GBytes/s) SSMJ SBNL (c) 70% tuples with matches 2 0 2 3 2 6 2 9 2 12 2 15 0:5 1 1:5 2 Input size (KBytes) Throughput (GBytes/s) SSMJ SBNL (d) Figure 4.29: Throughput performance of the SBNL-based design and the SSMJ-based design The FPGA-only design was implemented on the Zynq system PL section. The percent- age of matching tuples for all the input sizes was varied from 10% to 70% and average throughput performance is reported. In FPGA-only design, for input sizes greater than 64 KBytes, the four fully pipelined bitonic sorters for sorting 64 KBytes data set are first employed to rearrange the entire input data set into sorted 64 KBytes sub data sets. Then a merge sorter with one single merge stage is used to merge the sorted 64 KBytes sub data sets into a single sorted data set. For input sizes smaller than 64 KBytes, the merge sorter is not required. Accelerators for merge-join operation and selection operation in 129 the FPGA-only design are implemented based on prior work [25]. The modules in the FPGA-only design has a total data parallelism of four, which is same as the total data parallelism of the bitonic sorters in the system implementation of the SSMJ algorithm. We vary the data set (a data column) size from 2 KBytes to 64 MBytes for performance evaluation. The overall throughput (GBytes/s) for the three design approaches are shown in Figure 4.28. Our proposed hybrid design achieves an average of 3.1x throughput improvement compared with the CPU only approach. This is because the CPU only approach usually achieves a low memory bandwidth utilization [49, 50]. The proposed hybrid design is 1.9x as fast on average as the FPGA-only approach. We can see that for input size greater than 64 KBytes, the throughput performance of the FPGA-only design declines significantly, and switching to the CPU for the merge-join and selection phases in the hybrid approach gives faster execution than FPGA-only approach. This implies that for the FPGA-only implementation the benefits from the massive data paral- lelism on FPGA is offset by the cost of increasing data loading and offloading iterations between the FPGA accelerator and the DRAM. We also observe that a large portion of the CPU computation is overlapped with the FPGA computation in our hybrid design. This in turn justifies the efficiency of our proposed hybrid design approach. More related results about the execution time breakdown are presented in Section 4.3.8. Comparing SBNL and SSMJ: In this section, we present experimental results when performing Equi-Join using the SBNL and the SSMJ algorithms. We vary the input size from 2 KBytes to 64 MBytes to evaluate the scalability of the two hybrid designs. As the bitonic sorters on FPGAs are fixed to sort 64 KBytes data set, we just need to modify the software implementations to process data sets with various sizes. As introduced in Section 4.3.4, in SBNL algorithm, merge-join operation is performed only if two sorted data sequences have overlapped key values, thus the execution time for merge- join also depends on the number of matching tuples. Figure 4.29 shows the effect of 130 0 6 10 14 0 0:5 1 1:5 Input size (2 x KBytes) Execution time in percentage CPU computation Overlapped time FPGA computation FPGA! CPU Figure 4.30: Execution time breakdown of the SSMJ-based design the percentage of matching tuples on the performance of our proposed streaming join algorithms. We vary the percentage from 10% to 70% for all the input sizes. We notice that as there are less tuples with matches, the SBNL algorithm improves the overall join performance compared with the SSMJ algorithm, especially for large data sets. This is because less number of merge-join operations need be performed in SBNL if the percentage of tuples with matches decreases. As shown in Figure 4.29, compared with the SSMJ based approach, SBNL based approach improves the throughput by up to 38% when 10% tuples match. For both the SSMJ and SBNL algorithms, the throughput decreases with the input size. The reason for this is that as the maximum problem size supported by the FPGA accelerator is fixed, more computations need to be handled by the CPU with the growing data set size, thus the overall impact of FPGA acceleration becomes less. A larger FPGA device providing more on-chip memory resource can further speed up the performance. Hybrid Execution time breakdown: Figure 4.30 provides a breakdown of the exe- cution time for the SSMJ algorithm for various input sizes. The FPGA!CPU time indicates the latency overhead for switching from FPGA accelerators to the CPU. We observe that this latency is easily hidden after CPU execution time and FPGA execu- tion time overlap for input sizes beyond 64 KBytes. Data transfer overhead has been 131 included in both the CPU computation time and the FPGA computation time. On the average, 40% of the execution time of FPGA is overlapped with the CPU execution time for input sizes greater than 64 KBytes. We observe that the execution time of FPGA almost increases linearly with the input size. This observation matches well with our theoretical analysis on the throughput of the bitonic sorters in Equation 4.13. The execution time of CPU increases significantly as the input size grows beyond 64 KBytes. This is consistent with the throughput performance declining in Figure 4.28. When the input size is smaller than 64 KBytes, the execution time of CPU accounts for 10% of the total on the average. For the input sizes beyond 64 KBytes, the CPU computation time shown in blue accounts for 37% of the total on the average, and eventually becomes a performance bottleneck. More performance improvement can be achieved by using a faster CPU with more cache resources. 4.3.9 Comparison with Prior Works As the ZedBoard offers much less DRAM bandwidth than platforms in prior work, our design throughput turns out to be comparatively slow. However, as indicated by our results in Sections 4.3.8 and 4.3.8, the performance of our hybrid design scales well as more bandwidth becomes available. Low throughput is not an inherent problem with our hybrid solution. We believe more powerful versions of CPU-FPGA platforms, such as the Xilinx UltraScale+ Zynq and Altera Stratix 10 SoCs will obtain improved per- formance by implementing our proposed algorithms. To make a more fair comparison with prior works, we use throughput per unit bandwidth as a metric considering the memory-bandwidth-bound nature of the Equi-Join operation. Table 4.3 shows detailed comparison with several state-of-the-art works on reported join performance. In [50], the authors achieve a throughput of 128 million 64-bit tuples per second (1GB/s) with 25.6 GB/s available bandwidth on a CPU platform. In [49], 4.6 GB/s of aggregate 132 throughput is achieved using GPU, while the peak memory bandwidth is 192.4 GB/s. In the most recent work [25], the authors propose a hardware implementation for join on a multiple-FPGA platform achieving a throughput of 6.45 GB/s. Their platform has a total of 115.2 GB/s of peak memory bandwidth, thus the design utilizes only 5.6% of the peak memory bandwidth. Our implementation provides a 3.9x increase on the average over the reported bandwidth utilization of the state-of-the-art design [25]. Table 4.3: Comparison with prior works Work Platform Clock freq Throughput (GB/s) BW (GB/s) Throughput/BW (%) [25] Multiple Xilinx Virtex-6 FPGAs 200 MHz 6.45 115.2 5.6% [50] Intel Core i7 965 System 3.2 GHz 1 25.6 3.8% [49] Nvidia GTX 580 GPU 1.5 GHz 4.6 192.4 2.3% This work Zedboard 100 MHz 0.69 3.2 21.6% 133 Chapter 5 Conclusion 5.1 Summary of Contributions Parallel designs for realizing data permutations are critical for realizing high perfor- mance designs of data and signal processing applications. Intermediate data results have to be frequently permuted between consecutive computation stages in ubiquitous FFT and sorting algorithms. In this thesis, we proposed novel algorithms for realizing data permutations on streaming data, developed optimal designs for specific data permuta- tion patterns in data and signal processing algorithms, and built complete parameterized designs for streaming applications. We applied our techniques into several application candidates including FFT, sorting and Equi-Join, and evaluated the post place-and-route performance of the streaming designs on FPGA. With respect to realizing data permu- tation on streaming data, our contributions include: We proposed a universal RAM-based Permutation Network which is capable of realizing arbitrary permutation for a given data parallelism. We developed a divide-and-conquer based mapping algorithm by vertically fold- ing the classic Interconnection networks including Benes network and Clos net- work, thus maintaining the minimal connection cost feature of these networks. We developed a heuristic routing algorithm for Benes network so as to optimize the interconnection complexity of our proposed streaming permutation network. 134 Based on the proposed RAM-based permutation network, we developed a design automation framework targeting high performance designs for data and signal processing kernels such as sorting, FFT and Equi-Join on FPGA. Detailed experimental results showed that by reducing the interconnection com- plexity, our design outperforms the baselines with respect to both throughput and energy efficiency. From the perspective of high throughput stream processing on FPGAs, our contri- butions include: We developed a design framework such that an FFT or a sorting problem is first transformed into data permutation problem which is further realized by our pro- posed RAM-based permutation network. In the proposed design framework, we incorporated memory optimizations for processing continuous data streams at the algorithmic level and developed archi- tecture binding algorithms to improve the design energy efficiency at the architec- ture level. Our experimental results show that the proposed high-level optimizations improve the memory and energy efficiency of our designs significantly. Our experimental results show that the proposed design framework can be used to obtain high performance designs with respect to latency, throughput and energy efficiency. We presented high throughput energy efficient implementation of the Radix- x Cooley-Tukey FFT algorithm achieving significant performance improvement compared with state-of-the-art designs. 135 We developed high throughput bitonic sorting on FPGAs demonstrating superior memory efficiency compared with the state-of-the-art designs. We developed streaming Equi-Join algorithms customized for CPU-FPGA plat- form by optimizing the classic CPU-based nested-loop join and sort-merge join algorithms. We developed a hybrid sorting design to alleviate the burden of memory usage on FPGA. Our designs improve the average sustained throughput of Equi-Join implementation compared with FPGA-only and CPU-only designs, especially for large data sets. We reported the performance and identified the effect of the percentage of match- ing tuples on the throughput performance of the two streaming join algorithms. Our implementation on the Zedboard achieves significant improvement in DRAM bandwidth utilization compared with the state-of-the-art designs. 5.2 Future Work Our proposed RAM-based permutation networks to perform the data communication between subsequent computation stages can be applied to a wide variety of domains. Our preliminary work has shown that the proposed RAM-based permutation networks can be utilized to solve the data layout issues in the external memory [30]. Design automation tools using our proposed techniques can also be built for design space explo- ration and finding Pareto Optimal Designs for a given application. 136 5.2.1 Optimizing Memory Performance through Dynamic Data Layout In external memory such as DRAM, memory row activations consume significant energy [2]. By using appropriate data layouts [30, 35, 32] leading to optimal data locality, the number of memory row activations can be minimized. For large data sets, intermediate data results need to be communicated between the FPGA and the general purpose processors through main memory. Our preliminary work indicates that the data layout generated by writing FPGA output to main memory is usually not the optimal data layout required for general purpose processor to read memory [80]. For example, when a processor needs to merge severaln-key data sequences andn is large, succes- sive read operations on different memory rows need to be performed by the processor. This will result in poor data locality and frequent memory row activations. The total number of memory row activations can be minimized through the optimal data layout required by the processor when executing computation tasks with the external memory [2]. The proposed RPN could be employed as a data layout remapping engine between the FPGA and the main memory, so that the data layout can be dynamically obtained during run-time. Such an approach could be utilized to reduce the number of mem- ory row activations significantly. Data layout optimization is a complex problem that depends on memory technology and application parallelization. 5.2.2 Design Space Exploration and Pareto Optimality To perform design space exploration, a data driven performance model can be built for a hybrid algorithm on heterogeneous platform. Such a performance model tailored to the properties of the platform can reduce the number of parameters that need to be explored. 137 Figure 5.1: Pareto frontier for energy effi- ciency and throughput The multi-objective integrated opti- mization using the performance model over energy efficiency and throughput leads to a Pareto Frontier of designs (the designs shown by red points in Fig- ure 5.1). A design is Pareto optimal if it is impossible to improve one objective with- out deteriorating the other [74]. Finding the set of Pareto optimal designs is partic- ularly useful; since by obtaining all poten- tially optimal solutions, a designer can make trade-offs by exploring the designs on the Pareto frontier, within this con- strained set of parameters (i.e., the space to the right and bottom of the curve defined by the red points in Figure 5.1), rather than considering the full range of all the parame- ters. Therefore, exploiting the proposed permutation-based design framework to derive Pareto optimal designs would be promising for efficient design space exploration. 5.2.3 Application-specific Composite Metrics Using application-specific composite metrics to evaluate the performance of our designs would also be advantageous for our proposed design framework. Specifically, we could measure energy efficiency improvement with respect to Giga Operations per Second per Watt (GOPS/Watt) and Giga UPdates per Second per Watt (GUPS/Watt) [15] for graph database problems. From the system point of view, GUPS measures mem- ory architecture throughput in terms of number of memory locations in billions that can be randomly updated per second. Note that for large graph databases, memory energy 138 is the main contributor to the overall energy consumption ([30]). To avoid optimizing GOPS/Watt or GUPS/Watt by simply extending processing time beyond what is rea- sonable for an application, we can also explore the use of other metrics such as the energy-delay product, which rewards designs that reduce the runtime and latency, or increase throughput. As a result, we can eliminate unnecessary arithmetic operations and efficiently choose which operations to perform. This is however conflicting with the goal of minimizing data movement and communication, which may require additional computations. 139 Bibliography [1] 7 Series FPGAs Memory Resources User Guide . http://www.xilinx. com/support/documentation/user_guides/ug473_7Series_ Memory_Resources.pdf. [2] DDR4 SDRAM. http://www.micron.com/products/dram/ ddr4-sdram. [3] Design automation tool for Sorting. https://github.com/ryanchr/ SortGen. [4] Hybrid Memory Cube (HMC). http://www.hotchips. org/wp-content/uploads/hc_archives/hc23/HC23.18. 3-memory-FPGA/HC23.18.320-HybridCube-Pawlowski-Micron. pdf. [5] Matlab Reshape Function. http://www.mathworks.com/help/ matlab/ref/reshape.html. [6] Micron DDR3 and DDR4 SDRAM. http://www.micron.com/ products/dram/. [7] SPIRAL FFT IP Cores using a non-patented technique.http://www.spiral. net/hardware/dftgen.html. [8] SPIRAL FFT IP Cores using a patented technique. http://www.spiral. net/hardware/dftgen.html. [9] SPIRAL: Software/Hardware Generation for DSP Algorithms. http://www. spiral.net/hardware.html. [10] Vivado design suite user guide: design flows overview.http://www.xilinx. com/support/documentation/. 140 [11] Xilinx FFT IP Cores. http://www.xilinx.com/support/ documentation/ip_documentation/xfft/v9_0/pg109-xfft. pdf. [12] Xilinx Zynq-7000 All Programmable SoC Technical Reference Manual. http://www.xilinx.com/support/documentation/user_ guides/ug585-Zynq-7000-TRM.pdf. [13] XST User Guide for Virtex-6, Spartan-6, and 7 Series Devices. http://www. xilinx.com/support/documentation. [14] XST user guide for Virtex-6, Spartan-6, and 7 series devices. http://www. xilinx.com/support/documentation. [15] V . Aggarwal, Y . Sabharwal, R. Garg, and P. Heidelberger. Hpcc randomaccess benchmark for next generation supercomputers. In Parallel & Distributed Pro- cessing, 2009. IPDPS 2009. IEEE International Symposium on, pages 1–11. IEEE, 2009. [16] K. Andryc, M. Merchant, and R. Tessier. FlexGrip: A soft GPGPU for FPGAs. In Proc. of IEEE International Conference on Field Programmable Technology (FPT), pages 230–237, Dec 2013. [17] M. Ayinala, M. Brown, and K. K. Parhi. Pipelined parallel fft architectures via folding transformation. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 20(6):1068–1081, June 2012. [18] B. Baas. A low-power, high-performance, 1024-point FFT processor. IEEE Jour- nal of Solid-State Circuits, 34(3):380–387, 1999. [19] K. E. Batcher. Sorting networks and their applications. In Proc. of AFIPS, pages 307–314. ACM, 1968. [20] V . E. Benes. Permutation groups, complexes, and rearrangeable connecting net- works. Bell System Technical Journal, 43(4):1619–1640, 1964. [21] G. Bi and E. Jones. A pipelined FFT processor for word-sequential data. IEEE Transactions on Acoustics, Speech and Signal Processing, 37(12):1982–1985, 1989. [22] S. Blanas and J. M. Patel. Memory footprint matters: Efficient equi-join algorithms for main memory data processing. In Proceedings of the SOCC, pages 19:1–19:16. ACM, 2013. [23] M. W. Blasgen and K. P. Eswaran. Storage and access in relational data bases. IBM Syst. J., 16(4):363–377, Dec. 1977. 141 [24] R. A. Brualdi. Combinatorial matrix classes, volume 13. Cambridge University Press, 2006. [25] J. Casper and K. Olukotun. Hardware acceleration of database operations. In Proc. of ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA), 2014. [26] D. Chen, J. Cong, and P. Pan. Fpga design automation: A survey. Found. Trends Electron. Des. Autom., 1(3):139–169, Jan. 2006. [27] D. Chen, G. Yao, C. Koc, and R. Cheung. Low complexity and hardware-friendly spectral modular multiplication. In Proceedings of Field-Programmable Technol- ogy (FPT), pages 368–375, 2012. [28] R. Chen, H. Le, and V . K. Prasanna. Energy efficient parameterized FFT archi- tecture. in Proc. of IEEE International Conference on Field-Programmable Logic and Applications, 2013. [29] R. Chen and V . K. Prasanna. Energy-efficient architecture for stride permutation on streaming data. In Proc. of IEEE International Conference on Reconfigurable Computing and FPGAs (ReConFig), pages 1–7, Dec 2013. [30] R. Chen and V . K. Prasanna. Energy optimizations for fpga-based 2-d fft archi- tecture. In Proc. of IEEE International Conference on High Performance Extreme Computing Conference (HPEC), pages 1–6, Sept 2014. [31] R. Chen and V . K. Prasanna. Automatic generation of high throughput energy efficient streaming architectures for arbitrary fixed permutations. In Proc. of 25th International Conference on Field Programmable Logic and Applications (FPL), pages 1–8, Sept 2015. [32] R. Chen and V . K. Prasanna. Dram row activation energy optimization for stride memory access on fpga-based systems. In International Symposium on Applied Reconfigurable Computing, pages 349–356. Springer, 2015. [33] R. Chen and V . K. Prasanna. Accelerating equi-join on a cpu-fpga heteroge- neous platform. In Proceedings of IEEE International Symposium on Field- Programmable Custom Computing Machines (FCCM), pages 212–219, May 2016. [34] R. Chen and V . K. Prasanna. Optimizing interconnection complexity for realizing fixed permutation in data and signal processing algorithms. In Proceedings of IEEE International Conference on Field Programmable Logic and Applications (FPL), pages 1–9, Aug 2016. 142 [35] R. Chen, S. G. Singapura, and V . K. Prasanna. Optimal dynamic data layouts for 2d fft on 3d memory integrated fpga. In International Conference on Parallel Computing Technologies, pages 338–348. Springer, 2015. [36] R. Chen, S. Siriyal, and V . Prasanna. Energy and memory efficient mapping of bitonic sorting on fpga. In Proceedings of ACM/SIGDA International Symposium on FPGa, pages 240–249. [37] S. G. Chen, S. J. Huang, M. Garrido, and S. J. Jou. Continuous-flow parallel bit- reversal circuit for mdf and mdc fft architectures. IEEE Transactions on Circuits and Systems I: Regular Papers, 61(10):2869–2877, Oct 2014. [38] M. V . Chien and A. Y . Oruc. Adaptive binary sorting schemes and associated interconnection networks. IEEE Transactions on Parallel and Distributed Systems (TPDS), 5(6):561–572, 1994. [39] S. Choi, R. Scrofano, V . K. Prasanna, and J.-W. Jang. Energy-efficient signal processing using FPGAs. In Proceedings of FPGA ’03, 2003, pages 225–234. [40] C. Clos. A study of non-blocking switching networks. Bell System Technical Journal, 32(2):406–424, 1953. [41] J. W. Cooley and J. W. Tukey. An algorithm for the machine calculation of complex fourier series. Mathematics of computation, 19(90):297–301, 1965. [42] A. Farmahini-Farahani, H. Duwe, M. Schulte, and K. Compton. Modular design of high-throughput, low-latency sorting units. IEEE TC, 62(7):1389–1402, July 2013. [43] M. Garrido, J. Grajal, and O. Gustafsson. Optimum circuits for bit reversal. IEEE Transactions on Circuits and Systems II: Express Briefs, 58(10):657–661, Oct 2011. [44] M. Garrido, J. Grajal, M. S´ anchez, and O. Gustafsson. Pipelined radix-feedforward fft architectures. IEEE Transactions on Very Large Scale Integration (VLSI) Sys- tems, 21(1):23–32, 2013. [45] M. Garrido, J. Grajal, M. A. S´ anchez, and O. Gustafsson. Pipelined radix-2k feedforward fft architectures. IEEE Trans. Very Large Scale Integr. Syst., 21(1):23– 32, Jan. 2013. [46] S. He and M. Torkelson. A new approach to pipeline FFT processor. In Proceed- ings of IPPS ’96, pages 766–770. [47] T. J¨ arvinen. Systematic Methods for Designing Stride Permutation Interconnec- tions. Tampere University of Technology, 2004. 143 [48] T. Jarvinen, P. Salmela, H. Sorokin, and J. Takala. Stride permutation networks for array processors. In Proceedings of IEEE ASAP, pages 376–386, 2004. [49] T. Kaldewey, G. Lohman, R. Mueller, and P. V olk. Gpu join processing revisited. In Proc. of the International Workshop on Data Management on New Hardware, pages 55–62. ACM, 2012. [50] C. Kim, T. Kaldewey, and et.al. Sort vs. hash revisited: Fast join implementation on modern multi-core cpus. Proc. VLDB Endow., 2(2):1378–1389, Aug. 2009. [51] H. Kimura, H. Nakamura, S. Kimura, and N. Yoshimoto. Numerical analysis of dynamic snr management by controlling dsp calculation precision for energy- efficient ofdm-pon. Photonics Technology Letters, IEEE, 24(23):2132–2135, 2012. [52] D. Koch and J. Torresen. FPGASort: A high performance sorting architecture exploiting run-time reconfiguration on FPGAs for large problem sorting. In Proc. of ACM/SIGDA Field-Programmable Gate Arrays (FPGA), pages 45–54, 2011. [53] D. H. Lawrie. Access and alignment of data in an array processor. IEEE Trans. Comput., 24(12):1145–1155, Dec. 1975. [54] C. Layer, D. Schaupp, and H. J. Pfleiderer. Area and throughput aware comparator networks optimization for parallel data processing on FPGA. In Proc. of IEEE International Symposium on Circuits and Systems (ISCAS), pages 405–408, May 2007. [55] J.-D. Lee and K. Batcher. Minimizing communication in the bitonic sort. IEEE Transactions on Parallel and Distributed Systems (TPDS), 11(5):459–474, May 2000. [56] W. Li, F. Yu, and Z. Ma. Efficient circuit for parallel bit reversal. IEEE Transac- tions on Circuits and Systems II: Express Briefs, 63(4):381–385, April 2016. [57] Micron. Ddr3 sdram system-power calculator (XLSM). https: //www.micron.com/ ˜ /media/documents/products/ power-calculator/ddr3_power_calc.xlsm. [58] P. A. Milder, M. Ahmad, J. C. Hoe, and M. P¨ uschel. Fast and accurate resource estimation of automatically generated custom DFT IP cores. In Proceedings of FPGA ’06, 2006, pages 211–220. [59] P. A. Milder, J. C. Hoe, and M. Puschel. Automatic generation of streaming data- paths for arbitrary fixed permutations. In Proc. of IEEE Conference on Design, Automation Test in Europe Conference Exhibition (DATE), pages 1118–1123, April 2009. 144 [60] T. K. Moon and W. C. Stirling. Mathematical methods and algorithms for signal processing, volume 1. Prentice Hall New York, 2000. [61] H. Moussa, A. Baghdadi, and M. J´ ez´ equel. Binary de bruijn on-chip network for a flexible multiprocessor ldpc decoder. In Proceedings of the Design Automation Conference, 2008. [62] R. Mueller, J. Teubner, and G. Alonso. Streams on wires: A query compiler for FPGAs. Proc. VLDB Endow., pages 229–240, Aug. 2009. [63] R. Mueller, J. Teubner, and G. Alonso. Sorting networks on FPGAs. International Journal on VLDB, 21(1):1–23, 2012. [64] D. Nassimi and S. Sahni. Bitonic sort on a mesh-connected parallel computer. IEEE Transactions on Computers (TC), 100(1):2–7, 1979. [65] NetFPGA. Netfpga research platform. http://netfpga.org/site/#/ systems/3netfpga-10g/details/. [66] G. Nordin, P. A. Milder, J. C. Hoe, and M. P¨ uschel. Automatic generation of customized Discrete Fourier Transform IPs. In Proceedings of Design Automation Conference (DAC), pages 471–474, 2005. [67] S. Olarlu, M. C. Pinotti, and S. Q. Zheng. An optimal hardware-algorithm for sorting using a fixed-size parallel sorting device. IEEE Transactions on Computers (TC), 49(12):1310–1324, 2000. [68] D. Opferman and N. Tsao-Wu. On a class of rearrangeable switching networks part i: Control algorithm. Bell System Technical Journal, 50(5):1579–1600, 1971. [69] K. Ovtcharov, O. Ruwase, J.-Y . Kim, J. Fowers, K. Strauss, and E. S. Chung. Accelerating deep convolutional neural networks using specialized hardware. Microsoft Research Whitepaper, 2, 2015. [70] M. C. Pease. An adaptation of the fast fourier transform for parallel processing. Journal of the ACM (JACM), 15(2):252–264, 1968. [71] M. P¨ uschel, P. A. Milder, and J. C. Hoe. Permuting streaming data using rams. Journal of the ACM, 56(2):10:1–10:34, 2009. [72] A. Putnam, A. Caulfield, E. Chung, and et.al. A reconfigurable fabric for acceler- ating large-scale datacenter services. In Proc. of ACM/IEEE ISCA, pages 13–24, June 2014. [73] L. R. Rabiner and B. Gold. Theory and application of digital signal processing. Englewood Cliffs, NJ, Prentice-Hall, Inc., 1975. 777 p., 1. 145 [74] D. H. Ram, M. Bhuvaneswari, and S. Logesh. A novel evolutionary technique for multi-objective power, area and delay optimization in high level synthesis of datapaths. In VLSI (ISVLSI), 2011 IEEE Computer Society Annual Symposium on, pages 290–295. IEEE, 2011. [75] A. Rasmussen, G. Porter, M. Conley, H. V . Madhyastha, R. N. Mysore, A. Pucher, and A. Vahdat. Tritonsort: A balanced and energy-efficient large-scale sorting system. ACM Transactions on Computer Systems (TOCS), 31(1):3, 2013. [76] J. J. Rotman. Advanced modern algebra, volume 114. American Mathematical Soc., 2010. [77] M. Saecker and V . Markl. Big data analytics on modern hardware architectures: A technology survey. In Business Intelligence, pages 125–149. Springer, 2013. [78] F. Serre, T. Holenstein, and M. P¨ uschel. Optimal circuits for streamed linear per- mutations using ram. In Proc. of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA), FPGA ’16, pages 215–223, New York, NY , USA, 2016. ACM. [79] N. Shirazi, P. M. Athanas, and A. L. Abbott. Implementation of a 2-D Fast Fourier Transform on an FPGA-Based Custom Computing Machine. In Proceedings of IEEE Conference on Field-Programmable Logic and Applications, pages 282–292, 1995. [80] A. Srivastava, R. Chen, V . K. Prasanna, and C. Chelmis. A hybrid design for high performance large-scale sorting on fpga. In 2015 International Conference on ReConFigurable Computing and FPGAs (ReConFig), pages 1–6, Dec 2015. [81] H. S. Stone. Parallel processing with the perfect shuffle. IEEE Trans. Comput., 20(2):153–161, Feb. 1971. [82] T. Sugimura, H. Yamasaki, H. Noda, O. Yamamoto, Y . Okuno, and K. Arimoto. A high-performance and energy-efficient FFT implementation on super parallel processor (MX) for mobile multimedia applications. In Proceedings of Intelligent Signal Processing and Communications Systems, pages 1–4, 2009. [83] B. Sukhwani, H. Min, and et.al. Database analytics acceleration using FPGAs. In Proc. of PACT, pages 411–420. ACM, 2012. [84] C. Thompson. The VLSI complexity of sorting. IEEE Transactions on Computers, C-32(12):1171–1184, Dec 1983. [85] J. D. Ullman, A. V . Aho, and J. E. Hopcroft. The design and analysis of computer algorithms. Addison-Wesley, Reading, 4:1–2, 1974. 146 [86] C. Van Loan. Computational Frameworks for the Fast Fourier Transform. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 1992. [87] A. Waksman. A permutation network. J. ACM, 15(1):159–163, Jan. 1968. [88] C.-W. J. Wen-Chang Yeh. High-speed and low-power split-radix FFT. IEEE Trans- actions on Signal Processing, 51(3):864–874, 2003. [89] E. H. Wold and A. M. Despain. Pipeline and parallel-pipeline FFT processors for VLSI implementations. IEEE Transactions on Computers, 100(5):414–426, 1984. [90] K.-J. Yang, S.-H. Tsai, and G. C. H. Chuang. Mdc fft/ifft processor with vari- able length for mimo-ofdm systems. IEEE Trans. Very Large Scale Integr. Syst., 21(4):720–731, Apr. 2013. [91] S. Yoshizawa, A. Orikasa, and Y . Miyanaga. An area and power efficient pipeline fft processor for 8x8 mimo-ofdm systems. In 2011 IEEE International Symposium of Circuits and Systems (ISCAS), pages 2705–2708, May 2011. [92] T. yun Feng. A survey of interconnection networks. Computer, 14(12):12–27, Dec 1981. [93] C. Zhang, R. Chen, and V . Prasanna. High throughput large scale sorting on a cpu- fpga heterogeneous platform. In 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pages 148–155, May 2016. [94] M. Zuluaga, P. Milder, and M. Puschel. Computer generation of streaming sorting networks. In Proc. of ACM/EDAC/IEEE Design Automation Conference (DAC), pages 1241–1249, June 2012. 147
Abstract (if available)
Abstract
Streaming architectures are widely employed in hardware design particularly when high throughput is desired. A streaming architecture takes input and produces output at a fixed rate with no gap between consecutive data streams. Applications implemented with streaming architectures are usually composed of computation stages separated by data permutations, which is a fixed reordering of the data elements. ❧ Data permutation is a fundamental problem in various research areas, including signal processing, machine learning, and data analytics. Well known permutations include stride permutation (perfect shuffle, corner turn, etc.), bit reversal, and the Hadamard reordering. Data permutation can be simply realized by reordered hardware wire connections if all data inputs are available concurrently. Nevertheless, such approach is not desirable for large input size due to high interconnection area and complexity. Instead, this overhead could be highly reduced in streaming architectures. However, when the streaming width (the number of input/output data per cycle) is non-trivial (2 or more), designing a streaming permutation structure becomes challenging. ❧ In this thesis, we develop a universal RAM-based Permutation Network (RPN) for realizing permutations on streaming data. The RPN is universal as it can realize arbitrary fixed permutation on a given data sequence for a given data parallelism. The key idea is to construct RPN through a divide-and-conquer based mapping algorithm utilizing the classic multi-stage Clos and Benes networks. We further develop a RPN-based framework for generating optimal designs for specific permutations arising in well-known data intensive algorithms. The designs are optimal as follows: they consume minimum number of memory words and minimum number of multiplexers that are necessary for realizing a particular permutation, and the designs have minimum latency. In particular, we make the following contributions: (1) a universal RAM-based permutation network to perform any given fixed permutation with a flexible data parallelism, (2) a divide-and conquer based mapping algorithm such that the RPN is parameterized to accommodate any input size and data parallelism, (3) algorithmic-level optimizations for reducing the memory and the multiplexers consumed by the RPN, (4) RPN-based optimal designs for well known permutations arising in data-intensive applications including sorting, Equijoin, Fast Fourier Transform (FFT), etc. and (5) highly optimized RPN-based streaming architectures for these applications on Field-Programmable Gate Array (FPGA). ❧ We evaluate our research contributions through post place-and-route experiments on FPGA. We provide detailed experimental results of the RPN-based designs for streaming permutation using various performance metrics including memory efficiency and interconnection complexity. We also present evaluation results which demonstrate that our proposed highly optimized streaming architectures can achieve high performance with respect to throughput, energy efficiency and memory efficiency. As future work, we discuss opportunities in applying our proposed RPN in data layout optimization for new memory technologies.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Architecture design and algorithmic optimizations for accelerating graph analytics on FPGA
PDF
Accelerating reinforcement learning using heterogeneous platforms: co-designing hardware, algorithm, and system solutions
PDF
Acceleration of deep reinforcement learning: efficient algorithms and hardware mapping
PDF
Compiler and runtime support for hybrid arithmetic and logic processing of neural networks
PDF
High performance classification engines on parallel architectures
PDF
Hardware-software codesign for accelerating graph neural networks on FPGA
PDF
Learning and control for wireless networks via graph signal processing
PDF
An FPGA-friendly, mixed-computation inference accelerator for deep neural networks
PDF
Scaling up temporal graph learning: powerful models, efficient algorithms, and optimized systems
PDF
Optimal distributed algorithms for scheduling and load balancing in wireless networks
PDF
Adaptive and resilient stream processing on cloud infrastructure
PDF
Online learning algorithms for network optimization with unknown variables
PDF
Dynamic graph analytics for cyber systems security applications
PDF
Object classification based on neural-network-inspired image transforms
PDF
Hardware and software techniques for irregular parallelism
PDF
From matching to querying: A unified framework for ontology integration
PDF
Human motion data analysis and compression using graph based techniques
PDF
Efficient pipelines for vision-based context sensing
PDF
Optimizing task assignment for collaborative computing over heterogeneous network devices
PDF
Learning, adaptation and control to enhance wireless network performance
Asset Metadata
Creator
Chen, Ren
(author)
Core Title
Optimal designs for high throughput stream processing using universal RAM-based permutation network
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
02/13/2017
Defense Date
12/07/2016
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
bit-reversal,data layout,data processing,database,equi-join,FFT,FPGA,hardware acceleration,machine learning,OAI-PMH Harvest,permutation network,signal processing,sorting,sorting networks,streaming architecture
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Prasanna, Viktor K. (
committee chair
)
Creator Email
renchen@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c40-336895
Unique identifier
UC11258035
Identifier
etd-ChenRen-5054.pdf (filename),usctheses-c40-336895 (legacy record id)
Legacy Identifier
etd-ChenRen-5054.pdf
Dmrecord
336895
Document Type
Dissertation
Rights
Chen, Ren
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
bit-reversal
data layout
data processing
database
equi-join
FFT
FPGA
hardware acceleration
machine learning
permutation network
signal processing
sorting
sorting networks
streaming architecture