Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Compiler and runtime support for hybrid arithmetic and logic processing of neural networks
(USC Thesis Other)
Compiler and runtime support for hybrid arithmetic and logic processing of neural networks
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
Compiler and Runtime Support for Hybrid Arithmetic and Logic Processing of Neural Networks
by
Arash Fayyazi
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(ELECTRICAL ENGINEERING)
December 2023
Copyright 2024 Arash Fayyazi
Dedication
This thesis is dedicated to the brave and resilient Iranian people who have fearlessly fought for their
freedom, endured imprisonment and sacrificed their lives during the women life freedom movement.
Their unwavering commitment to achieving equality and justice serves as an enduring testament to the
strength of the human spirit.
To those whose voices have been silenced, whose dreams have been cut short, and whose lives have
been forever altered, I offer my deepest admiration and respect. Your unwavering determination and
courage in the face of adversity have inspired me throughout my academic journey.
This dedication is a heartfelt tribute to the countless individuals who have faced persecution and imprisonment, simply for demanding their inherent rights. Their sacrifices will never be forgotten, and their
memory will continue to fuel our collective pursuit of a more just and inclusive society.
The women life freedom movement in Iran has had a profound impact on shaping the discourse around
human rights and social justice. Through their struggle, Iranian women have challenged societal norms,
shattered barriers, and opened doors for future generations. Their relentless pursuit of equality has ignited
a flame of hope that burns brightly, illuminating the path toward a more equitable world.
It is with great humility that I dedicate this thesis to the Iranian people, for it is their courageous fight
for freedom that has provided me with the motivation and inspiration to undertake this research. May our
shared commitment to justice and liberation serve as a reminder that the pursuit of knowledge and the
struggle for human rights are intertwined.
ii
Acknowledgements
I would like to express my heartfelt gratitude to Professor Massoud Pedram for his unwavering support,
invaluable guidance, and unwavering commitment to my academic journey. His expertise, insightful feedback, and encouragement have been instrumental in shaping the trajectory of this thesis. I am truly grateful
for the privilege of working under his mentorship.
I extend my sincere appreciation to Professor Pierluigi Nuzzo and Professor Aiichiro Nakano for their
valuable contributions as members of my thesis committee. Their expertise, critical insights, and constructive feedback have enriched the quality of my research and broadened my intellectual horizons.
To my dear wife, Haleh Akrami, whose unwavering belief in me has been a constant source of strength
and motivation, I am forever grateful. Your love, patience, and understanding have sustained me through
the challenges of this journey, and I am deeply appreciative of your unwavering support.
I would like to express my deep gratitude to my family, particularly my parents, for their unwavering
love, constant support, and unyielding belief in my abilities. Their steadfast encouragement and selfless
sacrifices have provided the solid foundation upon which I have forged my academic path.
Additionally, I would like to express my gratitude to all of my collaborators and lab mates who have
contributed to this work. Their expertise, collaboration, and shared enthusiasm have been invaluable in
shaping and refining my research. I am grateful for their camaraderie, stimulating discussions, and the
sense of community we have fostered together.
iii
In conclusion, I would like to express my deep appreciation to all those who have contributed to the
completion of this thesis. Your support, guidance, and encouragement have been essential in making this
endeavor a reality. This journey would not have been possible without each and every one of you.
Thank you.
iv
Table of Contents
Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii
Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Chapter 2: Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1 ANN, CNN, and DNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 NullaNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3 CNN Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4 Compiler Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.5 Architectural Techniques for Exploiting Data Reuse . . . . . . . . . . . . . . . . . . . . . . 17
2.6 Dataflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.7 Accelerator Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.8 SDAccel Environment and Host Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Chapter 3: An FPGA-Friendly Framework for Designing Ultra-Low-Latency Nulla-Mapped
Neural Network Accelerators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.1 F2N3 Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2 Deep Neural Network Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3 Logic Minimization and Dataset Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.4 Hardware Realization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.4.1 Accelerator Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.4.2 Memory Layout and Data Placement . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.4.3 Nulla Streaming Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.5 F2N3 Back-end Compilation Module: Optimizer and Scheduler . . . . . . . . . . . . . . . . 35
3.5.1 Replication of Fixed-Function Combinational Logic Blocks . . . . . . . . . . . . . . 36
3.5.2 Cost Function and Replication Factor Determination . . . . . . . . . . . . . . . . . 37
3.6 SDAccel Code Generation Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.6.1 SW Code Generator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.6.2 HW Code Generator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.7 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
v
3.7.1 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.7.2 Tasks with extreme-throughput requirements . . . . . . . . . . . . . . . . . . . . . 41
3.7.3 Tasks with high-accuracy requirements . . . . . . . . . . . . . . . . . . . . . . . . 44
3.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
Chapter 4: Efficient Compilation and Mapping of Fixed Function Combinational Logic on Digital
Signal Processors Utilizing High-level Synthesis . . . . . . . . . . . . . . . . . . . . . . 47
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.2 Terminology and Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.3 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.4 Hardware Acceleration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.4.1 Memory Layout and Data Placement . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.4.2 Hardware and Software Optimization . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.4.2.1 Burst Data Transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.4.2.2 Double Buffering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.4.2.3 Task pipelining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.4.2.4 Multiple Parallel Accelerators . . . . . . . . . . . . . . . . . . . . . . . . 55
4.5 Compiler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.5.1 Mapping to Logic Gates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.5.2 Compiler Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.5.3 Illustrating Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.6 Application of Proposed Method to NN Inference . . . . . . . . . . . . . . . . . . . . . . . 70
4.6.1 Optimization Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.7 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.7.1 Efficacy of Model Used in Compiler Optimization . . . . . . . . . . . . . . . . . . . 72
4.7.2 Analytical Comparison: Memory communications vs Computations . . . . . . . . 73
4.7.3 Comparison Between MAC-based, XNOR-based and Nullanet-based Implementation on CNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
Chapter 5: Algorithms and Hardware for Efficient Processing of Logic-based Neural Networks . . 84
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.2 Prior Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.3 Terminology and Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.4 Proposed Design Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.5 Boolean Processor Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.5.1 Programmable LPE unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
5.5.2 Programmable non-blocking multicast switch network . . . . . . . . . . . . . . . . 97
5.5.3 Parallelization of multiple LPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.6 Compiler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.6.1 Addressing the width issue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.6.2 Addressing the depth issue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.6.3 Utilizing multiple LPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
5.7 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.7.1 Ablation study with the gate fanout limitation . . . . . . . . . . . . . . . . . . . . . 109
5.7.2 LPE utilization calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
5.7.3 Blocking vs nonblocking switch network . . . . . . . . . . . . . . . . . . . . . . . . 112
5.7.4 Comparison with SoA ASIC BNN Accelerators . . . . . . . . . . . . . . . . . . . . 113
vi
5.7.5 Comparison Between MAC-based, XNOR-based and Nullanet-based FPGA
Implementation of NNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.7.6 Ablation study with LPV count . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
5.7.7 Multiple LPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
5.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
Chapter 6: Sparse Periodic Systolic Dataflow for Lowering Latency and Power Dissipation of
Convolutional Neural Network Accelerators . . . . . . . . . . . . . . . . . . . . . . . . 118
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
6.2 Preliminaries and Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
6.2.1 DNN processing and Compilers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
6.2.2 Periodic Pattern-based Pruning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
6.3 Overall Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
6.4 Compiler Tailored to Periodic Pattern-based Sparsity . . . . . . . . . . . . . . . . . . . . . 124
6.4.1 Kernel and Filter Ordering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
6.4.2 Systolic Padding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
6.5 SPS Dataflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
6.6 Proposed Periodic Sparsity Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
6.7 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
6.7.1 Experimental Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
6.7.2 Storage Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
6.7.3 Hardware Utilization and Energy Efficiency Comparisons . . . . . . . . . . . . . . 134
6.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
Chapter 7: Conclusions & Future directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
vii
List of Tables
3.1 Comparison between the hardware realization metrics of F2N3 with those of LogicNets
[101] on JSC and NID tasks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.1 Symbols used in chapter ?? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.2 The contents of the input data buffer, opcode buffer, and address memory buffer for
realizing function g1 (c.f. fig. 4.3). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.3 The contents of the input data buffer, opcode buffer, and address memory buffer for
realizing function g2 (c.f. fig. 4.4). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.1 A taxonomy overview of existing neural network accelerators . . . . . . . . . . . . . . . . . . . 89
5.2 Instruction set specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.3 Resource utilization of design of LPV count = 16. . . . . . . . . . . . . . . . . . . . . . . . . . . 109
5.4 Summary of design specs for MLPMixers used in this chapter. The "S" and "B" (small and base)
model scales down follow Tolstikhin et al. [99]. The notation "B/4" means the model of base scale
with patches of resolution 4 * 4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
5.5 Normalized throughput (%) with respect to a design using strictly nonblocking single-stage
crossbar to demonstrate the switch network bottleneck. . . . . . . . . . . . . . . . . . . . . . . 110
5.6 Compare to SoA ASIC BNN Accelerator. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
5.7 FPS Comparison between different implementation of models with high accuracy requirements.
LPV count in LPU is 16. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
5.8 FPS Comparison between different implementation of models with high throughput requirements.
LPV count is 16. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
6.1 Parameters used in the proposed compiler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
6.2 Hardware Utilization for VGG16 on CIFAR-10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
viii
List of Figures
1.1 Block diagram of a general neural network accelerator architecture. . . . . . . . . . . . . . 2
1.2 Block diagram of an exemplary design of a neural network processing circuit utilizing
XNOR and pop count operations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1 An exemplary convolutional layer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.1 F2N3 flow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2 N consecutive layers realized using the proposed accelerator design and the details of
second layer’s realization. The replication factor is two. Yellow and red cuboids show the
first and the second computation iteration of the Nulla layer, respectively. Each cuboid
is a patch of input feature maps (IFMs) passed through the computational engines and
produce a channel of output feature maps (OFMs). . . . . . . . . . . . . . . . . . . . . . . . 31
3.3 The per-layer computations for each architectural clock, for a three-layer neural network
implemented using the inter-layer concurrent streaming architecture. A kernel is a
fixed-function, combinational logic block realized in the hardware. . . . . . . . . . . . . . 34
3.4 Layer-by-layer latency improvements achieved by using the F2N3 flow and fixed-function
combinational logic functions for VGG-16. On average, we achieve around 1384x latency
improvement using F2N3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.1 A high-level view of a local area of the Xilinx FPGA layout. . . . . . . . . . . . . . . . . . . . . 77
4.2 hardware architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.3 Illustration example 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.4 Illustration example 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.5 A comparison between our proposed model used in the compiler and actual hardware
implementation in terms of the achieved performance for layer 7 of VGG16 network. . . . 81
4.6 The proportional percentage of latency spent in memory communications and computations phases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
ix
4.7 The overall dataflow of the proposed implementation. . . . . . . . . . . . . . . . . . . . . . 82
4.8 Comparison between MAC-, XNOR-, and NULLANET-based implementation. The main
resource for the computations in all three type of the implementations are DSP blocks. . . 82
4.9 Comparison between MAC-, XNOR-, and NULLANET-based implementation. The main
resource for the computations in all three type of the implementations are DSP blocks. . . 83
5.1 An example of neuron realization with FFCL. Weights are shown on edges; without loss of
generality, a step-function nonlinearity is assumed with a threshold value of 1. . . . . . . . . . . . 88
5.2 An overview of the proposed framework. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.3 The LPU architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
5.4 Architecture of 64 × 64 5-stage interconnection network. . . . . . . . . . . . . . . . . . . . . . 99
5.5 An example of parallelizing 4 LPUs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.6 A running example for partitioning, scheduling and instruction queue configuration.
a) partitioning where Lmax = 7, b) scheduling of the MFGs, and c) instruction queue
configuration corresponding to the scheduled MFGs. Note: Ci denotes clock cycle i. . . . . 104
5.7 A running example for partitioning, scheduling and instruction queue configuration with a depth
issue handler. a) partitioning where Lmax = 10, b) scheduling of the MFGs with a circulation
mechanism, and c) instruction queue configuration corresponding to the scheduled MFGs with the
instruction relocation. Note: Ci denotes clock cycle i. . . . . . . . . . . . . . . . . . . . . . . . . 106
5.8 An example of workload balancing in 2 LPUs implementation. . . . . . . . . . . . . . . . . . . . 109
5.9 Ablation study with the different number of gates’ fanout allowance. . . . . . . . . . . . . . . . . 110
5.10 LPE utilization rate. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
5.11 Inference time of VGG16 and LENET5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
5.12 The computation cycles count of different models for different number of LPUs within the
underlying Boolean processor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
5.13 FPS Comparison between models mapped to a Boolean processor with 1 LPU and 10 LPUs using different
scheduling algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
6.1 Illustration of periodic pattern-based sparsity. KSS=4 and P=2 . . . . . . . . . . . . . . . . 121
6.2 Overall flow of the SPS Acceleration Framework . . . . . . . . . . . . . . . . . . . . . . . . 122
6.3 Overview of the PPW storage format. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
x
6.4 High-level overview of systolic array accelerator design. . . . . . . . . . . . . . . . . . . . 129
6.5 Total storage comparison for unique VGG16 CONV layers. . . . . . . . . . . . . . . . . . . 131
6.6 Indexing storage comparison for unique VGG16 CONV layers. . . . . . . . . . . . . . . . . 131
6.7 Total storage comparison for unique ResNet18 CONV layers. . . . . . . . . . . . . . . . . . 132
6.8 Indexing storage comparison for unique ResNet18 CONV layers. . . . . . . . . . . . . . . . 132
6.9 Percent Storage for unique VGG16 CONV layers. . . . . . . . . . . . . . . . . . . . . . . . 133
6.10 Benchmarking total storage against baseline for different sparsity. . . . . . . . . . . . . . . 133
6.11 Benchmarking total storage against PPW for different sparsity. . . . . . . . . . . . . . . . . 133
6.12 Energy Savings over dense baseline. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
xi
Abstract
Deep neural networks (DNNs) are deployed widely to solve challenging data analytics, classification, and
forecasting problems. To improve their output quality, DNNs are growing in size and complexity, demanding ever more compute cycles, memory footprint, and I/O bandwidth during their training and inference.
Given their performance, flexibility, and energy efficiency, field-programmable gate array (FPGA)-based
DNN accelerators are gaining traction as a serious contender to replace graphics processing unit- and
central processing unit-based platforms. This dissertation aims to provide compiler and runtime support for hybrid arithmetic and logic processing of neural networks. The key idea of the logic processing
part is to replace expensive multiply-and-accumulate operations that are required to compute various filter/neuron functions in a DNN with Boolean logic expressions, which are subsequently mapped to native
look-up tables (LUTs) of the FPGA device, resulting in low hardware cost and ultra-low latency. In this
proposal, we present F2N3, an across-the-stack design and optimization framework for the construction
of resource-constrained and energy-efficient, ultra-low-latency FPGA-based neural network accelerators.
Our experimental evaluations across several datasets and DNN architectures demonstrate the superior
performance of F2N3 in terms of inference latency, energy efficiency, and output accuracy compared to
prior art FPGA-based DNN accelerators.
We also present a framework for efficient compilation and mapping of Fixed Function Combinational
Logic on digital signal processors (DSPs) utilizing High-level Synthesis. Mapping the large Boolean function with many input variables and product terms to digital signal processors (DSPs) on Field-programmable
xii
gate arrays (FPGAs) needs a new framework considering DSP blocks’ structure and reconfigurability during this process. The proposed methodology in this proposal maps the fixed-function combinational logic
blocks to a set of Boolean functions where Boolean operations inside the functions are mapped on DSP
devices rather than look-up tables (LUTs) on the FPGAs to take advantage of the high-performance, low
latency, and parallelism of DSP blocks.
We also introduce a novel reconfigurable Boolean processor consisting of multiple logic processing
units for processing logic-based NN models comprising large Boolean functions with many input variables
and product terms. The Boolean processor is accompanied by a mapping framework for the compilation
and mapping of NNs utilizing FFCL into this Boolean processor. We also present a scheduling that includes
iterative modulo scheduling of the maximal feasible sub-graphs (MFGs) and a place and route of scheduled
MFGs on the Boolean processor architectural resources. A circulation strategy is presented to handle FFCL
blocks that cannot straightforwardly fit the Boolean processor.
Finally, we also improved the performance of the arithmetic processing of NNs. Particularly, this
dissertation introduces the Sparse Periodic Systolic (SPS) dataflow, which advances the state-of-the-art
hardware accelerator for supporting lightweight neural networks. Specifically, the SPS dataflow enables a
novel hardware design approach unlocked by periodic pattern-based pruning, resulting in neural network
weights with characteristically higher regularity and thus exhibiting higher degrees of parallelism. We
achieve this by addressing the central challenge of reducing the overhead incurred by the irregularity of
weights. Our sparsity-aware compiler optimally reorders the weights and uses a simple indexing unit in
hardware to create matches between the weights and corresponding activations.
xiii
Chapter 1
Introduction
Deep neural networks (DNNs) have surpassed the accuracy of conventional machine learning models in
many challenging domains including computer vision [77, 71, 95, 55, 116, 44] and natural language processing [46, 41, 43, 28]. Advances in building both general-purpose and custom hardware have been among
the key enablers for transforming DNNs from rather theoretical concepts to practical solutions for a wide
range of problems [94, 20].
A neural network inference task may be run on a variety of platforms ranging from CPU and GPUs
to FPGA devices and custom ASICs. A common feature of most of these platforms is that they provide
processing elements that are capable of doing an arithmetic multiply-and-accumulate operation on weights
and input activations to produce intermediate results that are then acted upon by other processing elements
capable of applying a nonlinear transformation to produce the output activations. These platforms are
commonly referred to as neural network inference accelerators, machine learning accelerators, or deep
learning accelerators.
Fig. 1.1 shows a general neural network accelerator architecture doing neural network inference. Existing neural network inference accelerators incur a high latency cost and/or use enormous hardware
1
HOSTLayer 1 … Layer N
Off-chip DDR
memory Hardware Accelerator
DRAM
…
…
…
…
…
…
Input Buffer
Output Buffer
Tree Adders
…
…
…
Processing Unit
…
PE PE PE
PE PE PE
PE PE PE
PE PE PE
Output data flow
Input data flow
PTC #0
PTC #1
…
PTC #N-1
Reg
File
Reg
File
BRAM
Weight Buffer
2D Array of Processing Elements
Weight Distribution
Network
Reg
File
ALU Instruction Queue
Reg
File
ALU
Figure 1.1: Block diagram of a general neural network accelerator architecture.
resources which, in turn, prevent their deployment in latency-critical applications, especially on resourceconstrained platforms. The high latency and large hardware cost emanate from the fact that practical, highquality deep learning models entail billions of arithmetic operations and millions of parameters, which
exert considerable pressure on both the processing and memory subsystems. To sustain the ubiquitous
deployment of deep learning models and cope with their high computational and memory complexities,
many methods operating at different levels of the design hierarchy have been developed.
2
At the algorithmic level, methods such as model quantization [62, 19, 33, 78, 68, 18], model pruning [106, 117, 69, 107, 56, 63, 40, 29], and knowledge distillation [35, 100, 32, 42, 67, 76, 98] have gained
popularity.
Model quantization methods refer to methods for quantizing weights and/or activations during training
and inference of neural network models. The data representation formats for the input and output activations varies and can range from full-precision floating point (32-bit operands) to half-precision floating
point (16 bits) to fixed-point representations (widths between 16 and 8 bits) to 8 or 4 bit integer to binary.
In case of the binary representation for weights and activations, the multiply-and-accumulate (MAC) operations are implemented with XNOR and pop count (counting number of 1’s). The arrangement in fig. 1.2
shows the block diagram of an exemplary design of a neural network processing Circuit utilizing XNOR
and pop count operations for neural network inference task when having binary representation for weights
and activations, in addition to the off-chip memory.
Model pruning is another approach for reducing the memory footprint and computational cost of
neural networks, in which filters or subsets of filters with small sensitivity are removed from the network
model, resulting in a sparse computational graph. Here, filters are subsets of filters with small sensitivity
are those whose removal minimally affects the model or layer output accuracy.
To optimize and map a trained neural network model to hardware, a compiler is needed. The compiler is a software program which optimizes and transforms application code describing a neural network
inference task into low-level code (hardware-specific instructions) that are executed on a neural network
inference accelerator. The compiler typically performs a variety of operations, for example, pre-processing,
lexical analysis, parsing, semantic analysis, conversion of input programs to an intermediate representation, initialization of a neural network, low-level code generation, instruction scheduling, data movement
management, or combinations thereof. Indeed, many data path and memory optimizations and devicespecific code generation techniques targeting machine learning applications have been proposed [9, 60, 8,
3
…
…
…
…
…
…
Input Buffer
Tree Adders
BatchNormalization-PReLU
…
PE PE PE
PE PE PE
PE PE PE
PE PE PE
Output data flow
Input data flow
BRAM
2D Array of Processing Elements
1 0 1 … 1 0 0
Reg
File
1 0 0 … 1 1 0
Reg
File Reg 1 1 0 … 1 0 1
File
Reg 1 1 0 … 1 0 1
File
DSP
PopCount 4
Reg
File
Thresholding
Adders
Layer N
HOSTStream FIFO Layer 1 ... Stream FIFO
Off-chip DDR
memory Hardware Accelerator
2D Array of Processing Elements
for FP MAC Operations
Figure 1.2: Block diagram of an exemplary design of a neural network processing circuit utilizing XNOR
and pop count operations.
4, 87, 104]. At the architecture level, different dataflow architectures (e.g., output stationary and weight
stationary dataflows) that support various data reuse schemes have been developed in order to reduce the
data movement cost and improve the hardware efficiency of the required neural network computations for
a network layer [11, 113, 24, 90, 38, 114]. At the circuit and device levels, various energy-efficient, digital
and analog processing components for vector-matrix multiplications have been designed [34, 12, 17, 51,
36, 88, 16].
Generally speaking, the CNN accelerator designs on the target device may be divided into two categories [105, 86, 5, 11], single computation engine architectures and streaming architectures. As its name
4
implies, the first approach utilizes a generic accelerator architecture comprising a single computation engine (e.g., a systolic array of MAC units) that is used for the computation of all neural network layers. This
approach, which executes the computation of each layer of the CNN for that layer sequentially, sacrifices
customization for flexibility. This approach, which has been used in many prior work references [91, 93],
is also called a homogeneous design. The streaming architecture, on the other hand, uses one distinct
hardware component for each layer where each component is optimized separately to exploit the parallelism in its corresponding layer, constrained by the available resources of the target device [104, 108]. The
streaming architecture (a.k.a. heterogeneous design) tends to use more hardware resources, but results in
DNN inference with higher throughput compared to the single computation engine architecture.
While there is a large body of work on efficient processing of DNNs [97], energy-efficient, low-latency
realization of these models for real-time and latency-critical applications continues to be a complex and
challenging problem. A major source of inefficiency in such conventional platforms and data flows is
the need to look-up the weights from a weight memory (which may be on or off-chip) and do a MAC
operation between the weight and corresponding input activation (which is also read from an on-chip
or off-chip input buffer). Expensive are the costs of memory accesses (these are typically large buffers
which are located outside the processing element arrays) and full MAC operations. Even in the case of
binary representation for weights and activations, where expensive MAC operations are implemented
with low-cost XNOR and pop count operations, the overhead of memory accesses for weight look-ups is
still significant.
What is needed is an compiler and runtime support for energy-efficient, low-latency processing of
neural networks during the inference phase. We propose a hybrid framework for efficient neural network processing where it optimizes a target neural network for a given dataset and maps key parts of
the required neural network computations to ultra-low-latency, low-cost, fixed-function logic processing
elements which are added to arithmetic processing elements (e.g., tensor and vector computation units)
5
that are typically found in conventional neural network accelerator designs. Examples of such neural network computations are those performed in individual filters one at a time, all filters within one layer, and
even all filters within groups of consecutive layers in the DNN. The remaining computations (i.e., those
that are not mapped to fixed-function, combinational logic blocks) will be scheduled to run on arithmetic
processing elements.
While the idea of converting layers of DNNs to fixed-function, combinational logic (FFCL) followed by
the mapping of those blocks to look-up tables (LUTs) has been previously discussed in NullaNet [74] and
LogicNets [101], its application has been limited to multilayer perceptrons (MLPs) designed for relatively
easy classification tasks. For example, NullaNet applies this idea to MLPs with a few hundred neurons
while LogicNets designs MLPs with tens of neurons such that the number of inputs to each neuron is
small enough to enable full enumeration of all its input combinations (e.g. fewer than 12 inputs). The
arrangement in Fig. 3.2 shows the block diagram of an exemplary design of an inference accelerator for
doing a layer of a neural network using FFCLs. Input feature maps are stored in input buffers and are
fetched into input registers before FFCLs are applied to them to calculate the output feature map. Next
these results are moved to output registers and stored in output buffers, which function as the input buffers
for the next layer.
LogicNets cannot be applied to neural networks where filters in a layer receive hundreds or even
thousands of inputs and therefore full enumeration of all input combinations is an impossibility. Moreover, both these techniques rely on Boolean logic function only whereas in many cases multi-valued (say
4 or 8 valued) logic is the right approach. In addition, these prior art techniques tend to result in large output accuracy loss in many neural network applications, a challenge that is successfully addressed by this
invention through modifications made to the neural network model itself. Furthermore, both techniques
are only capable of optimizing MLPs while CNNs play an important role in many real-world problems.
Creating truth tables for CNNs may lead to logic functions with hundreds of thousands to millions of
6
minterms, which cannot be optimized with existing methods. Finally, these prior arts make use of only
fixed-function, combinational logic blocks whereas many real-world applications can benefit from heterogeneous computational fabrics comprising MAC-based compute units, XOR/popcount compute units, and
custom FFCL compute units. Also, this idea of hybridization can be extended to other technologies such
as memristor, mainly targeting the benefits of the parallelism offered by memristive crossbar arrays in
Boolean function implementation which can be considered as a future plan of this thesis.
In summary, My Ph.D. research aims to provide compiler and runtime support for hybrid arithmetic
and logic processing of neural networks. The present work extends the original NullaNet idea in many
novel ways, including, but not limited to, applying it to convolutional neural networks (CNNs), optimizing
the number of times each filter in a CNN should be replicated to achieve extremely low latency while
maintaining an acceptable resource utilization level, balancing the utilization of different resources on
field-programmable gate array (FPGA) devices by mapping certain layers to digital signal processors (DSPs)
within these devices and developing a compilation flow and compiler tool to automate different network
architecture and hardware mapping optimizations for given a DNN model and a target FPGA device. In
recent years, we have presented F2N3, an across-the-stack design, and optimization framework for the
construction of resource-constrained and energy-efficient, ultra-low-latency FPGA-based neural network
accelerators.
We also present a novel design and optimization methodology for the compilation and mapping of
NNs utilizing fixed-function combinational logic to DSPs on FPGAs employing high-level synthesis flow.
This flow can be integrated into F2N3 and map some of Boolean logic expressions to DSPs if the total
number of LUTs is limited on the target FPGAs. In another work, we designed a Boolean processor which
is essentially vectors of Boolean logic units and interconnects, wherein the Boolean logic unit is operable
to perform Boolean operations of each Boolean function associated with an FFCL extracted from a BNN,
as introduced in F2N3, is critical to doing inference with different BNNs. Therefore, designing efficient
7
Boolean processors as logic-based NN inference engines, which can be used in a variety of applications, is
highly desirable.
Compilation and scheduling of an arbitrary Boolean logic graph associated with a Boolean function to
be mapped onto a Boolean processor is a challenging task from dual viewpoints of the Boolean processor
design and the compiler design. The compiler needs to group the operations of all gates that can be
executed simultaneously considering hardware resource limitations (i.e., the number of Boolean logic units
per Boolean processor). The next challenge is that each node in logic level i of a given logic graph can be
connected to any node in logic level i − 1 of the graph so that a naive interconnection scheme among the
Boolean logic units may (and will likely) result in significant routing congestion and delay. To address this
problem, we present an innovative optimization methodology for compiling and mapping BNNs utilizing
FFCL into this Boolean processor. The proposed compiler generates customized instructions for static
scheduling of all operations of the logic graph during inference.
This thesis is not limited to run-time support and optimization for logic-based inference of neural networks, but also we improve the performance of arithmetic-based inference. In this respect, we targeted
weight pruning which is an approach for reducing the memory footprint and computational cost of neural networks. By removing redundant weights of a network that does not harm the model accuracy, the
model is compressed from a dense to a sparse computational graph. With the progress in weight pruning methods, pattern-based pruning has emerged as a promising avenue that seeks to find a sweet spot
between the two conventional pruning schemes: 1) structured pruning which has high regularity and is
hardware-friendly, but susceptible to accuracy degradation; 2) unstructured pruning which retains high
accuracy, but suffers from large hardware overhead to manage irregular weight indices. Pattern-based
pruning method compromises between these two pruning schemes by enforcing a semi-structured level
of regularity through pre-defined patterns. This ameliorates the hardware overhead compared to unstructured pruning, but it still necessitates a series of auxiliary buffers to manage a unique set of indexing
8
scenarios with the pattern-based approach. At its core, hardware overhead caused by indexing sparse
weights manifests a fundamental design limitation for the accelerator to further optimize latency, power,
and memory requirements. Finally, we advance the state-of-the-art in sparse neural network accelerator
design by exploiting the concept of periodicity in pattern-based pruning for the first time in hardware.
9
Chapter 2
Preliminaries
This chapter includes background on neural networks, CNN processing, compiler optimizations, and a
description and taxonomy of hardware architectural approaches for designing CNN accelerators. The
chapter also describes the SDAccel environment, which is used for developing CNN accelerator hardware
and host software. This chapter also briefly describes the idea behind NullaNet, which is required for
understanding different steps of mapping a filter/neuron to fixed-function, combinational logic blocks.
2.1 ANN, CNN, and DNN
Artificial neural networks (ANNs) constitute a class of machine learning models which are inspired by
biological neural networks. An ANN is comprised of artificial neurons and synaptic connections. Each
artificial neuron (neuron, for short) receives information from its input synaptic connections, processes
the information, and produces an output which is consumed by neurons connected to its output synaptic
connections. On the other hand, each synaptic connection (called an edge) determines the strength of the
connection between its producer and consumer neurons using a weight value.
The first mathematical model of an artificial neuron was presented by Warren S. McCulloch and Walter Pitts in 1943 [66]. A McCulloch-Pitts neuron (a.k.a. the threshold logic unit) takes a number of binary
excitatory inputs and a binary inhibitory input, compares the sum of excitatory inputs with a threshold,
10
and produces a binary output of one if the sum exceeds the threshold and the inhibitory input is not set.
More formally,
y =
1 if
nP−1
i=1
xi ≥ b and x0 = 0
0 otherwise,
where each xi represents one of the n binary inputs (x0 is the inhibitory input while the remaining inputs
are excitatory), b is the threshold (a.k.a. bias), and y is the binary output of the neuron. It is evident that
a McCulloch-Pitts neuron can easily implement various logical operations such as the logical conjunction
(AND), the logical disjunction (OR), and the logical negation (NOT) by setting appropriate thresholds and
inhibitory inputs. As a result, any arbitrary Boolean function can be mapped to an ANN that is comprised
of McCulloch-Pitts neurons. One of the main shortcomings of McCulloch-Pitts neurons is the absence of
weights which determine the strength of synaptic connections between neurons.
A perceptron, which was first proposed by Frank Rosenblatt in 1958 [85], addresses some of the shortcomings of McCulloch-Pitts neurons by introducing tunable weights and allowing real-valued inputs. The
output of a perceptron is found by
y =
0 if
nP−1
i=0
wixi < b
1 if
nP−1
i=0
wixi ≥ b
= H(w · x − b), (2.1)
where each wi determines the strength of its corresponding input xi
, w · x is the dot product of weights
and inputs∗
, and H(·) is the Heaviside step function.
A learning algorithm adjusts values of weights such that they form a decision boundary that perfectly
segregates linearly-separable data. To allow the direct use of gradient descent and other optimization
methods for tuning weights, the Heaviside step function can be replaced with a differentiable nonlinear
∗
Please note that
nP−1
i=0
wixi is replaced with the dot product of w and x for conciseness.
11
function such as the logistic function, hyperbolic tangent function, and rectifier†
[39]. As a result, the
output of a neuron can be written as
y = ϕ(w · x − b), (2.2)
where ϕ(·) represents the nonlinear function (a.k.a. the activation function). In this new equation, outputs
can assume any real value defined in the range of the activation function. The outputs are usually referred
to as activations.
To enable effective segregation of nonlinear data, perceptrons are organized into multiple layers where
each layer includes several neurons and each neuron in a layer is connected to all neurons in the previous
layer (except for neurons in the first layer which are directly connected to inputs). Such an ANN is referred
to as a multilayer perceptron (MLP) and each layer is referred to as a linear (a.k.a. fully-connected) layer.
MLPs are typically trained through the backpropagation algorithm. Backpropagation efficiently computes the gradient of a loss function, which measures the deviation of predicted output from the ground
truth, with respect to the weights of the network. This is achieved by applying the chain rule to compute
the gradients, iterating backward from the last layer to avoid redundant calculations of intermediate terms
in the chain rule. The aforesaid efficient calculation of gradients makes it feasible to use gradient descent
optimization for updating the weights to minimize loss.
While MLPs have proven successful in a variety of applications, other classes of ANNs may be better
suited for many other application domains. For example, convolutional neural networks (CNNs) have
become the de facto standard for solving various computer vision tasks such as image classification, object
detection, and semantic segmentation.
Each layer in a CNN is comprised of multiple three-dimensional (3D) trainable filters which are applied
to different patches of a three-dimensional input. A layer is typically described by its kernel width and
height (wk and hk), number of input channels (cin), number of filters (cout), stride (s), and padding (p). Each
†A unit employing the rectifier is referred to as a rectified linear unit (ReLU).
12
3D filter raster scans an input volume‡
(a.k.a. input feature maps) along its width and height dimensions
with a stride s, applies (2.2) to each visited input volume of wk × hk × cin to generate different output
pixels, and produces a two-dimensional (2D) output channel (a.k.a. output feature map) comprised of the
said pixels. Note that the input volume is typically zero-padded by p along its width and height dimensions.
The output volume, which is the input volume to the next layer, is found by stacking the 2D output channels
of all cout 3D filters along a third dimension.
Assuming that the input width is represented with win, the output width wout can be calculated by
wout =
win − wk + 2p
s
+ 1. (2.3)
Similarly, the output height hout can be found given hin. Fig. 2.1 illustrates a convolutional layer where
the input volume is 5 × 5 × 3, the padding is zero, the kernel size is 3 × 3, the stride is one, the number of
filters is four; therefore, the output volume is 3 × 3 × 4. Notice that a linear layer is in fact a convolutional
layer with wk = hk = 1, cout filters each of which corresponds to an output neuron, s = 1, p = 0, and an
input volume of 1 × 1 × cin, where each one of cin input channels corresponds to an input neuron.
CNNs may include other types of layers such as max pooling or average pooling layers. Pooling
layers implement non-linear down-sampling of individual feature maps by partitioning them into nonoverlapping regions of size wp × hp
§
and calculating the max or average functions over each region. A
by-product of pooling is the progressive reduction in the size of feature maps.
Another type of layer which is commonly used in MLPs and CNNs is the batch normalization layer
[47]. A batch normalization layer performs centering and scaling on its inputs, which in turn improve the
speed, performance, and stability of training and doing inference with DNNs.
‡The input volume may be zero-padded by p along its width and height dimensions.
§Typically, win = hin, wout = hout, wk = hk, wk ≤ win, hk ≤ hin, and wp = hp.
13
3
Output Volume Filters
4
…
3
3
1
3
3
4
5
Input Volume
3
3
5 3
3
Figure 2.1: An exemplary convolutional layer.
Deep neural networks comprise a number of layers i = 1, · · · , N, each layer i contains a number of
filters, ni
. The layers are connected such that layer i may feed into any layers in front of it, including
layers i + 1 to N (e.g., in dense feed-forward networks) although typically the fanout range of a layer
is limited to a small value such as 1 (for simple feed-forward networks) or 2 (for feed-forward networks
with skip connections). Each layer receives the input data as input feature maps (input activations) and
produces output feature maps (output activations). The first and last layers are special layers where the
first layer processes raw input data corresponding to the training or inference data points and the last layer
(typically) applies the softmax function to its input activations to produce the classification or forecasting
results of the DNN. The other layers may be any of a number of common types such as fully-connected or
convolutional. Each of these layers is typically decomposable into a collection of sub-layers, such as tensor
computation sub-layer for doing multiply-and-accumulate operations, nonlinear transformation sub-layer
14
for applying activation functions to outputs of the tensor computation sub-layer, batch normalization sublayer, max pooling sub-layer, etc. as explained above.
2.2 NullaNet
A summary of the NullaNet [74] flow is as follows. NullaNet first discretizes input and output activations of
artificial neurons to binary values while training a DNN. Next, it forms Boolean specifications for the said
neurons either by enumerating all their possible input combinations and calculating their corresponding
outputs (a.k.a. realization based on input enumeration) or through applying all training data points to the
neural network and for each neuron, recording values of the binary inputs and outputs encountered when
processing each data point (a.k.a. realization based on incompletely specified functions (ISFs)). Realization
based on input enumeration implements the exact same function as the one realized using MAC operations.
However, it is only applicable to neurons with a small number of inputs (e.g. 14 inputs or less).
Realization based on ISFs, on the other hand, samples the algebraic function that represents each neuron and transforms that algebraic function to a Boolean function that approximates it. This approach
is suitable for implementing neurons designed for state-of-the-art neural networks which include tens to
hundreds of inputs. In such neurons, the input space is huge and the samples only represent a tiny fraction
of the input space that matters to the DNN, hence the approximation. After finding the Boolean specification of each neuron, NullaNet employs logic synthesis to find a near-optimal realization of each neuron
by optimizing its corresponding Boolean function. During inference, the output of each neuron, which
is normally calculated by applying dot product, batch normalization, and the application of an activation
function, is simply calculated by a number of logic operations that were found during the logic synthesis
step. This paradigm shift not only enables significant savings in computations, it also eliminates the need
to access a neuron’s parameters during inference, which leads to substantial savings in energy and latency.
15
2.3 CNN Processing
The computational flow for a convolutional layer in CNN can be represented by a six-level nested loop
(seven-level nested loops when considering the iteration over images in a mini-batch) known as a computational block. See Algorithm 1. Indeed, a convolutional layer receives input feature maps (IFMs) of size
win × hin × cin, and convolves them with cout different filters, each filter of size wk × hk × cin to generate
output feature maps (OFMs) of size wout ×hout ×cout. The convolution stride for each filter is represented
by s. The set of OFMs of the current convolutional layer constitute the IFMs for the next convolutional
layer.
Algorithm 1 MAC computations of a convolutional layer
1: for m in 0 .. cout − 1 do
2: for y in 0 .. hout − 1 do
3: for x in 0 .. wout − 1 do
4: Y [m][x][y] = 0
5: for n in 0 .. cin − 1 do
6: for ky in 0 .. hk − 1 do
7: for kx in 0 .. wk − 1 do
8: Y [m][x][y] += X[n][x + kx][y + kx] · W[n][m][kx][ky]
2.4 Compiler Optimizations
Compilers are responsible for performing a variety of optimizations to efficiently schedule and map operations defined in neural networks onto general-purpose or custom computing processors. Because the accelerator for each layer should perform the same computation i.e., implement the aforesaid six-level nested
loop, when mapping the convolutional layers of a CNN to a systolic array of MAC units, the search space
for the accelerator may be formally specified by how it transforms (i.e., tiles, reorders, and parallelizes) the
16
nested loop structure for that layer. Although we reuse the same systolic array for computations of all convolutional layers, each layer has its own unique set of loop transformations. Notice that fully connected
layers perform similar computations, but with only a two-level nested loop.
2.5 Architectural Techniques for Exploiting Data Reuse
Processing neural networks involves a large number of MAC operations for calculating outputs of filters/neurons according to the computational block. The MAC operations can be easily parallelized by using
spatial architectures, which include an array of ALUs and a memory hierarchy that is comprised of centralized, large memory blocks in addition to distributed, small memory sub-blocks. While accesses to the
large memory blocks incur rather large latency and come with high energy consumption cost, accesses to
small memories are fast and energy-efficient. For large and complex CNN models, it is unlikely that the
complete model (including weights) can be mapped onto the chip. Due to the limited off-chip bandwidth,
it is critically important to increase the on-chip data reuse and reduce the off-chip data accesses to improve
the computing efficiency.
At the high level, a CNN accelerator design on a target FPGA device typically comprises several components, namely, the core compute fabric, the memory hierarchy, and on-/off-chip interconnect. Data to
be processed by the accelerator is typically stored in an off-chip (external) memory. To utilize burst access
to the off-chip memory, data is first cached in on-chip buffers before being fed to the computation engine.
The on-chip interconnect is used for data communication between the computation engine and on-chip
buffer banks. By employing different types of computational engines and different designs for the memory
hierarchy, we can realize different accelerator designs as is explained below.
17
2.6 Dataflow
The dataflow (or data reuse pattern) of a CNN inference is in the form of a directed acyclic graph (DAG),
which can be accelerated in hardware without extering excessive pressure on the memory resources. More
precisely, to avoid frequent data transfers between large and small memory blocks and to reuse data in
each level of hierarchy as much as possible, the inference dataflow is optimized to determine what data
gets read into which level of the memory hierarchy and when each piece of data is processed. Based on
how different data components (e.g., weights, activations, etc.) are reused, various dataflows have been
proposed including weight stationary [38], output stationary [30], and row stationary [13] data flows.
The weight stationary (WS) dataflow reads weights from the dynamic random-access memory (DRAM)
into register files and reuses those weights for processing different patches of input feature maps or for
processing input feature maps corresponding to different samples. In other words, the WS dataflow keeps
filter weights stationary in each PE’s register file and forces all MACs that use the same filter weight to be
mapped onto the same PE for serial processing. Dnn-X [38] is among accelerators that employ the weight
stationary dataflow.
The output stationary (OS) dataflow aims to process the output of a filter/neuron in a single PE by
keeping the partial sums of accumulators in the register file of the PE. In other words, the OS dataflow
keeps partial sums stationary by accumulating them locally in the register file of the same PE by forcing
all MACs that generate partial sums for the same pixel of an output feature map be mapped onto the same
PE serially. ShiDianNao [30] is one of the notable accelerators that employ an output stationary dataflow.
The row stationary (RS) dataflow, which was first introduced in Eyeriss [13], aims to maximize the
reuse of weights, activations, and partial sums. The RS dataflow requires that the MACs for applying a
row of filter weights on a row of input feature map pixels, which generate a row of partial sums, be mapped
onto the same PE. The ordering of these MACs enables the use of a sliding window for input feature maps.
18
Because of the pattern of connectivity among adjacent PEs (i.e., horizontal, vertical, and diagonal), the row
stationary dataflow is more suitable for ASICs.
2.7 Accelerator Architecture
Generally, the CNN accelerator designs on FPGA may be divided into two categories [105]: single computation engine and streaming architectures. The first class of accelerator designs employ a single computation
engine that is used for the computation of all neural network layers. This approach takes one input image
at a time and executes computations of each neural network layer sequentially. This approach, which has
been used in many prior works, including [91, 93]. The streaming architecture on the other hand typically comprises of one distinct hardware resource for each neural network layer where each resource is
optimized separately to exploit the parallelism that exits in its assigned layer. See [104, 108]. The tradeoff
is that one can use a complete set of FPGA hardware resources to process each neural network layer one
at a time or partition these hardware resources into (non-overlapping) parts and assign each hardware
resource part to exactly one layer of the network.
2.8 SDAccel Environment and Host Code
SDAccel is a development environment for OpenCL applications targeting Xilinx FPGA-based accelerator
cards. The SDAccel environment provides a framework for developing and delivering FPGA accelerated
applications using standard programming languages. In the SDAccel framework, an application program is
split between a host application and hardware accelerated kernels with a communication channel between
them. The host application, which is written in C/C++ and uses Application Programming Interface (API)
abstractions such as OpenCL, runs on a CPU while the kernels run on the FPGA device(s). Communication
between the host CPU and the FPGA accelerator board takes place via the PCIe bus. The host memory is
19
only accessible by the host application whereas the global memory, which is used to transfer data between
the host application and the accelerated kernels, is accessible by both the host processor and hardware
accelerators. Host code, which provides an interface to allow data transfer from the host machine to
accelerated kernels, follows the OpenCL programming paradigm and is structured into three code sections
for (a) setting the environment, (b) enqueuing kernels for their executions, and (c) post-processing and
releasing the resources.
The flow of the host code is as follows, (1) The host application writes the data needed by a kernel into
the global memory of the attached device through the PCIe interface. (2) The host application sets up the
kernel with its input parameters. (3) The host application triggers the execution of the kernel function on
the FPGA. (4) The kernel performs the desired computations while reading data from global memory. (5)
The kernel writes data back to global memory and notifies the host. (6) The host application reads data
back from global memory into the host memory and continues processing as needed.
20
Chapter 3
An FPGA-Friendly Framework for Designing Ultra-Low-Latency
Nulla-Mapped Neural Network Accelerators
Deep neural networks (DNNs) have surpassed the accuracy of conventional machine learning models in
many challenging domains including computer vision [55, 116, 44] and natural language processing [43,
28]. Advances in building both general-purpose and custom hardware have been among the key enablers
for transforming DNNs from rather theoretical concepts to practical solutions for a wide range of problems [94, 20]. Unfortunately, existing DNN-based inference engines have a high latency cost and/or use
enormous hardware resources which, in turn, prevent their deployment in latency-critical applications,
especially on resource-constrained platforms. The high latency and large hardware cost emanate from the
fact that practical, high-quality deep learning models entail billions of arithmetic operations and millions
of parameters, which exert considerable pressure on both the processing and memory subsystems.
To sustain the ubiquitous deployment of deep learning models and cope with their high computational and memory complexities, many methods operating at different levels of the design hierarchy have
been developed. At the algorithmic level, methods such as model quantization [78, 68, 18], model pruning [40, 29], and knowledge distillation [42, 67, 76, 98] have gained popularity. At the compiler level,
domain-specific optimizations, memory-related optimizations (e.g. instruction scheduling, static memory
allocation, and copy elimination), and device-specific code generation have been employed [87, 104]. At
21
the architecture level, different dataflow architectures that support various data reuse schemes have been
developed in order to reduce data movements [38, 114]. Finally, at the circuit and device levels, various
energy-efficient, digital and analog processing components for vector-matrix multiplications have been
designed [51, 36, 88, 16]. While there is a large body of research on efficient processing of DNNs [97],
energy-efficient, low-latency realization of these models for real-time applications continues to be a complex and challenging problem.
This work presents an across-the-stack approach for energy-efficient, low-latency processing of DNNs
during the inference phase. This solution, which is referred to as F2N3 (FPGA-Friendly Framework for
Nulla-Mapped Neural Networks), optimizes a target DNN for a given dataset and maps major parts of the
DNN computations to ultra-low-latency, low-cost, fixed-function, combinational logic blocks. Examples
of such computations are those performed in individual filters/neurons one at a time, all filters/neurons
within one layer, and even all filters/neurons within groups of consecutive layers in the DNN. The remaining computations (i.e., those that are not mapped to fixed-function, combinational logic blocks) will be
scheduled to run on general-purpose processors or custom accelerators.
While the idea of converting certain layers of DNNs to fixed-function, combinational logic blocks followed by the mapping of those blocks to look-up tables (LUTs) has been previously discussed in NullaNet
[74] and LogicNets [101], its application has been limited to multilayer perceptrons (MLPs) designed for
relatively easy classification tasks. For example, NullaNet applies this idea to MLPs with a few hundred
neurons while LogicNets designs MLPs with tens of neurons such that the number of inputs to each neuron
is small enough to enable full enumeration of all its input combinations (e.g. fewer than 14 inputs). The
present work extends the original NullaNet idea in many novel ways including, but not limited to, applying
it to convolutional neural networks (CNNs), optimizing the number of times each filter in a CNN should
be replicated to achieve extremely low latency while maintaining an acceptable resource utilization level,
22
balancing the utilization of different resources on field-programmable gate array (FPGA) devices by mapping certain layers to digital signal processors (DSPs) within these devices, and developing a compilation
flow and compiler tool to automate different network architecture and hardware mapping optimizations
for given a DNN model and a target FPGA device. More specifically, the main contributions of this chapter
are as follows:
• We present an end-to-end framework called F2N3, which generates high-performance Nulla-mapped
DNN accelerators suited to a target FPGA device from a high-level description of the network in PyTorch, while making effective use of the available FPGA resources and the on-chip/off-chip memory
bandwidths.
• We introduce a quantization-aware training approach that utilizes a combination of different activation
functions for different layers to improve the output accuracy while pruning any redundant weights in
the DNN in order to make neurons (filters) amenable to efficient realization as fixed-function, combinational logic blocks.
• We present image sampling approaches specifically tailored to neurons/filters that are to be mapped
to fixed-function, combinational logic blocks, making the NullaNet-based implementation approach
highly scalable (that is, we can deal with very large DNNs and very large training datasets).
• To utilize available resources on an FPGA device, we introduce a hybrid implementation of a target
DNN where some DNN layers are implemented as custom fixed functions mapped to LUTs by using the
NullaNet Accelerator (Nulla layers) whereas other layers are mapped to the multiply-and-accumulate
(MAC) Array Accelerator (MAC layers). Consequently, we present accelerator architectures featuring
both fixed-function logic blocks realized on LUTs and MAC units running on DSPs as the inference
operators.
23
• We present a powerful and flexible compiler (called F2N3 Compiler) for mapping a given DNN/CNN inference engine running on any dataset onto our optimized accelerator designs. This is accomplished by
converting the network model to a computational graph, scheduling the graph’s execution, and trading
resource usage for latency by determining the replication factor of fixed-function, combinational logic
blocks of a DNN layer (while preserving a target output accuracy level).
• We develop a mix of register-transfer level (RTL)/C++ descriptions of the DNN accelerator where we
use the RTL black-boxing feature of high-level synthesis (HLS) tools. This gives freedom to wrap customized high-performance Verilog description of fixed-function, combinational logic blocks in optimized synthesizable C++ templates, acting as external modules or interfaces such as on-board double
data rate (DDR) memories, yielding low-latency accelerator designs on FPGAs.
• We achieve higher accuracy at lower resource utilization compared to published prior work while presenting results on DNNs and datasets that were previously impossible to optimize by using the original
NullaNet approach [74] and its derivative approach, LogicNets [101].
The remainder of this chapter is organized as follows. Section 3.1 outlines the proposed F2N3 solution
and its components. Next, Sections 3.2 and 3.3 explain the employed training module and logic minimization module, respectively. After that, Section 3.4 discusses the proposed accelerator design. Sections 3.5
and 3.6 detail the compilation and SW/HW code generation modules, respectively. Finally, Section 5.7
presents the experimental results, whereas Section 3.8 concludes the chapter.
3.1 F2N3 Flow
A brief description of the four main components of F2N3, which follows shortly, demonstrates how upstream components take account of downstream components while performing various optimizations (as
24
shown in Fig. 3.1). Therefore, while F2N3 divides the optimization process into logically separate components, it has a holistic approach to efficient processing of DNNs. Such an end-to-end solution enables
unprecedented levels of energy-efficiency and low latency while maintaining acceptable levels of classification accuracy.
The training module performs quantization-aware training on the provided model and dataset.∗
Quantization-aware training is optionally followed by the application of a fanin-constrained DNN pruning
approach, which is either based on the alternating direction method of multipliers (ADMM) [7] or gradual pruning [118]. Applying fanin constraints on filters/neurons significantly reduces the computational
complexity of the required two-level logic minimization problem and very often also reduces the hardware
cost.
The logic minimization module comprises two main modules, the two-level logic minimization
and the multi-level logic minimization. The two-level logic minimization module creates truth tables that
represent the (approximate) functions of different filters/neurons either by enumerating all their input
combinations or through examining their inputs and outputs when (a subset of) the training data is applied
to the trained model.† After this step, the module passes the truth tables to a suite of exact or approximate
two-level logic minimization algorithms that harden the functions of filters/neurons into fixed-function,
combinational logic blocks. The optimized combinational logic no longer requires access to the parameters
of filters/neurons. The multi-level logic minimization optimizes filters/neurons in a group of consecutive
layers or a subset of filters/neurons within a specific layer by applying logic restructuring techniques such
as common sub-expression extraction and elimination. This step is optionally followed by a target-specific
technology mapping step.
∗
In this work, activations are typically quantized to (scaled) binary (0/1), bipolar (-1/+1), or multiple-valued (e.g., 0, 1, 2, and
3) values while model parameters are left in floating-point representation.
†The truth tables may be created for a group of consecutive DNN layers or portions of computations performed in filters/neurons of a given DNN layer.
25
The back-end compilation module performs optimizations tailored to the employed accelerator
design realizing fixed-function, combinational logic blocks for the implementation of the inference graph.
The proposed compiler takes a neural network model, converts it to a computational graph, schedules its
operations, and more importantly, optimizes its nodes by leveraging intrinsic fusion of different required
operations such as convolution or fully-connected layer computations with batch normalization. Then,
the compiler decide on different set of loop optimizations (including tiling, reordering, and parallelizing
the nested loops of computational block) for each layer that is mapped to MAC array. In the case of Nullamapped layer, it also determines the number of fixed-function combinational logic blocks replications
employed for each layer. After these optimizations, it compiles the information from the optimizer and
extracts the required parameters for the accelerator design and generates a static kernel execution schedule.
The SDAccel code generation module comprises software (SW) and hardware (HW) code generation modules. In the SDAccel framework, an application program is split between a host application and
hardware accelerated kernels with a communication channel between them. The host application, which
is written in C/C++ and uses API abstractions like OpenCL, runs on a CPU while the kernels run on the
FPGA device(s). Host code, which provides an interface to allow data transfer from the host machine to
kernels, follows the OpenCL programming paradigm and is structured into three code sections for (a) setting the environment, (b) enqueuing kernels for their executions, and (c) post-processing and releasing
the resources. The SW generator takes the kernel schedule and generates C++/OpenCL codes for the host,
which is in charge of the kernel execution scheduling, model initialization, data buffer management, and
so on. Finally, the HW generator wraps the RTL kernels generated at the end of the logic minimization
module in HLS templates and generates synthesizable hardware code that is used for generating the FPGA
bit stream.
26
Training Data
Start
Model
Architecture
Training
Data
Target
Accuracy
Pre-trained
Weights
Quantization-Aware
Training
Fan-in-constrained Pruning
Met Target
Accuracy?
Filter Input/Output
Sampling
Two-level Logic
Minimization
Config.
FPGA
Characteristics
No Reduce
Pruning Rate Yes
End
1. Training 3. Back-end Compilation
2. Logic Minimization
Estimate LUT Count for each
Nulla Node
Optimize the replication Count
of Filters
Extract Nulla Node Hardware
Parameters
Generate and schedule off-chip
memory access addresses and
strides
HW code generation
SW code generation
Accelerator
configuration files
4. SDAccel Code Generation
HLS
templates
Two-level logic
minimization
Retiming
Layer Optimization
Filter Optimization
LUT-Mapping
Granularity
Multi-level Logic
Minimization
Config.
Figure 3.1: F2N3 flow.
27
3.2 Deep Neural Network Training
The training module is responsible for both quantization-aware training and fanin-constrained pruning.
Quantization-aware training refers to the quantization of activations to binary, bipolar, or multi-bit values during the training of a neural network.‡ Fanin-constrained pruning, as its name suggests, limits the
number of inputs to a filter/neuron to prevent the logic minimization step from running into scalability
issues. Notice that fanin-constrained pruning does not have to be done after the quantization-aware training. In fact, quantization-aware training and fanin-constrained pruning can be combined into a single step
to speed up the training process.
One of the main differences between this work and NullaNet [74] or LogicNets [101] is that it can
employ different activation functions for different layers to yield higher accuracy. For example, if the
inputs to a DNN assume both negative and positive values, we employ an activation function such as
the sign function or a parameterized hard tanh (PHT) function to better capture the range of inputs. On
the other hand, if a set of values can only assume non-negative numbers, we rely on the parameterized
clipping activation (PACT) [18] function to quantize activations. The same consideration is taken into
account when quantizing the outputs of the last layer which are fed to a softmax function. Another major
difference between this work and [74] is that while that reference only deals with binary inputs and outputs
that are found by applying the sign function, the present work allows multi-bit quantization of activations,
which tends to yield higher classification accuracy.
Recall that each layer which is to be mapped to fixed-function, combinational logic blocks takes nonnegative integers for both its inputs and outputs and constructs a Boolean function based on the provided
data. Because the realized function is Boolean (rather than arithmetic), the non-negative integers for inputs
and outputs can be produced by any of the activation functions described in this section. If an activation
‡
Such quantization is only applied to layers that are to be mapped to fixed-function, combinational logic blocks.
28
function produces negative numbers in addition to positive ones, the quantized values are simply mapped
to a non-negative range without modifying the rest of the flow. Judicious choice of the right activation
function(s) can appreciably improve the classification accuracy.
This work employs fanin-constrained pruning based on either the alternating direction method of
multipliers (ADMM) [7] or gradual pruning [118] to reduce hardware cost and make the mapping to fixedfunction, combinational logic blocks scalable. Because each filter/neuron that is mapped to such logic
blocks has to be optimized using two-level logic minimization and because the computational complexity of
heuristic two-level logic minimization is super-linear in its input variable count, constraining the number
of inputs to that filter/neuron is of utmost importance. The details and mathematical formulation of our
newly-introduced fanin-constrained pruning are beyond the scope of this chapter.
3.3 Logic Minimization and Dataset Sampling
As briefly described in Section 2.2, logic realization based on input enumeration is suitable for filters/neurons with a small number of inputs while logic realization based on ISFs is more suitable for neurons with
a large number of inputs. While NullaNet applies all training data points to a DNN to form ISFs for different filters/neurons, we claim that such an approach is neither necessary nor scalable. First, assume a
layer of a CNN designed for the CIFAR-10 dataset where the size of the input to that layer is 16 × 16 and
the layer consists of 3 × 3 filters. The ISF corresponding to each filter of the said layer can have up to
16 × 16 × 50, 000 = 12, 800, 000 minterms where 50,000 is the size of the training set. Optimizing such
ISFs with existing two-level logic minimization tools is impossible as they can optimize functions with at
most 50,000 or so minterms. Additionally, not all training points are informative from a logic minimization perspective and choosing a subset of the training points (training dataset sampling) tends to result
in defining much simpler ISFs without sacrificing the classification accuracy. The focus of the rest of this
section is on our proposed sampling strategies.
29
This work presents three sampling approaches which rely on the trained model to find representative samples from training data. These approaches are similar in that they first apply the training data to
the DNN and extract the output of one of the intermediate layers, e.g., the last feature extraction layer,
for each sample in the training data and then use that intermediate representation to rank training samples. However, the way the intermediate representation is used to rank samples is different among these
approaches.
The first approach, which we refer to as support vector machine (SVM)-based sampling and is related
to [61], uses the intermediate representation of training data in addition to class/label information to train
a one-vs-rest SVM for each class. Next, for each trained SVM corresponding to a class, it picks support
vectors that belong to that class as representative samples. By aggregating support vectors found from the
trained SVMs, a sample of the training data is generated. If the total number of support vectors exceeds the
desired number of samples, a subset of support vectors is chosen by applying uniform random sampling.
The SVM-based sampling approach finds samples of the training data that determine the boundaries of
each class when a specific set of neural network layers are used to extract an intermediate representation.
The second approach, which we refer to as near-mean sampling and is related to [79], first finds a
representative vector for each class by averaging the intermediate representation of samples which belong
to that class. Next, for each class, it picks a training sample such that the difference between the average
of picked samples so far and the representative vector of the class is minimized. This step is repeated until
a desired number of samples for each class is selected. Near-mean sampling, as its name suggests, picks
samples close to the mean of intermediate representation of all samples which belong to a class.
By combining SVM-based sampling with near-mean sampling, we devise a third sampling approach,
which finds samples that not only represent the boundaries of each class but also its mean. This is a
superior sampling strategy and is the one that we have adopted in this chapter.
30
Registers BRAMs
Layer 2
Round 2 Round 1
Computational Engine 1
Computational Engine 2
BRAMs Registers CLBs
Round 2 Round 1
...
Layer 1 Layer 2 Layer N-1 Layer N Registers CLBs
BRAMs
Registers
BRAMs
Registers
CLBs
Registers
BRAMs
Registers
CLBs
Registers
CLBs
Registers
Figure 3.2: N consecutive layers realized using the proposed accelerator design and the details of second
layer’s realization. The replication factor is two. Yellow and red cuboids show the first and the second
computation iteration of the Nulla layer, respectively. Each cuboid is a patch of input feature maps (IFMs)
passed through the computational engines and produce a channel of output feature maps (OFMs).
3.4 Hardware Realization
3.4.1 Accelerator Design
The computation engine of the NullaNet accelerator is a custom fixed-function, combinational logic fabric.
This custom function differs for each layer. Therefore, we have different instances of the accelerator for
each Nulla layer (streaming architecture), i.e., we cannot reuse the computational logic for one layer for
other layers. Fig. 3.2 shows N consecutive layers realized using the NullaNet accelerator design. If the
first Nulla-mapped layer is the first layer of the network, input feature maps for the first layer should be
transferred from dynamic random access memory (DRAM) to its input block RAM (BRAM). Afterwards,
for each layer, the data is read from the FPGA BRAMs and moved to the register files, which serve as
the input for the custom combinational logic. The output of the computation is written to the output
register files and then the output BRAMs for the layer. For each layer, by iterating over the input BRAM
31
and bringing different input data patches to registers, performing the required Boolean operations, and
storing the results from output registers to output BRAMs, the required computation for each Nulla layer
is completed. The number of iterations for each layer depends on (i) dimensions of the input feature
map, (ii) size of the patches, and (iii) number of times the custom combinational logic is replicated in the
computation engine of the layer. The replicas enable parallel processing of more than one patch of the
input in each iteration, as will be discussed in Section 3.5. Note that, in fully-connected layers (e.g., last
two layers in Fig. 3.2), there is only one copy of each combinational function and one iteration, so feature
maps are not stored in the BRAM and can be accessed directly from registers.
3.4.2 Memory Layout and Data Placement
Required computations for each Nulla layer are done by iteratively reading and transferring data from the
input BRAM to registers, executing custom combinational logic functions on the data, and storing results
from output registers to BRAMs.
In each iteration, each replica of the custom combinational logic function computes the pixel values
for a (x, y) position alongside all the output channels in that layer (see Algorithm 4). Let us denote the
number of output channels in each layer l by c
l
out and the number of input channels in that layer by c
l
in,
respectively (c
l
out = c
l+1
in ). Each custom combinational logic function for layer l is in fact comprised of c
l
out
logic functions. The width of the data stored in output registers and BRAMs for layer l is c
l
out, whereas
the width of the data stored in input registers and BRAMs for that layer is c
l
in. As shown in Fig. 3.2, the
same set of BRAMs is used for storing both output BRAMs of layer l and input BRAMs of layer l + 1.
3.4.3 Nulla Streaming Architecture
Recall that one cannot reuse the custom combinational logic of one layer to perform the Boolean computations of another layer. Hence, the single computation engine accelerator design is not suitable for
32
Nulla layers, and instead we must employ the streaming architecture (using a distinct hardware block for
each layer). All heterogeneous blocks are chained to form a pipeline as depicted in Fig. 3.2. The data proceeds through different parts (layers) of the neural network as they are streamed through the architecture.
As a result, this design approach exploits the parallelism that exists between pairs of consecutive layers
by means of pipelining and enables concurrent execution of these layers (in this case, each layer will be
operating on different inference data).
In this chapter, we employ inter-layer concurrent streaming architecture where the level of concurrency among the computations corresponding to the custom NullaNet logic functions associated with
different layers of the DNN, is only at the level of different input samples (images) in the inference data
batch. For example, computations corresponding to layer l of image i, and layer l + 1 of image i−1 can be
done in parallel (assuming the accelerators corresponding to layer l and layer l + 1 are both implemented
using the NullaNet approach). In this scheme, for a certain image i, all of the NullaNet computations corresponding to a layer must be completed before computations corresponding to the next layer can begin.
Fig. 3.3 shows the sequence of operations for three consecutive Nulla layers implemented by using this
scheme.
The mapping and optimizations performed by the compiler for Nulla layers are also adjusted based on
this architecture. Thus, we exploit task-level parallelism in our accelerator design. The dataflow scheme is
enabled across Nulla layers. The dataflow enables pipelining at the task level, allowing layers workload to
overlap in their operations, increasing the parallelism of RTL implementations and improving overall design performance. In other words, functions in this region (e.g., across Nulla layers) operate concurrently
and continuously. So, the inference data initiation interval (hence throughput) is improved. The architecture also uses a double-buffering scheme where channel buffers are filled with new data while Boolean
operations are being performed on the present data, achieving the “ping-pong behavior” that reduces the
latency of a layer’s computations.
33
Kernel 1
Kernel 2
Kernel 3
Image 1-
Layer 1
Architectural Clock
1 2 3
Image 1-
Layer 2
Image 1-
layer 3
Image 2-
Layer 1
Image 3-
Layer 1
Image 1-
Layer 2
Figure 3.3: The per-layer computations for each architectural clock, for a three-layer neural network implemented using the inter-layer concurrent streaming architecture. A kernel is a fixed-function, combinational logic block realized in the hardware.
34
3.5 F2N3 Back-end Compilation Module: Optimizer and Scheduler
The decision about which subset of layers will be mapped to the MAC array, and which subset to the
NullaNet accelerator is made based on our output accuracy and performance targets and is determined
prior to the compilation process.
For layers to be implemented using the systolic array of MACs, because the accelerator for each convolutional layer should perform the same computation, i.e., implement a six-level nested loop shown in
Algorithm 2, the search space for the accelerator may be formally specified by how it transforms (i.e., tiles,
reorders, and parallelizes) the nested loop structure for that layer.
Algorithm 2 MAC-based computations for a convolutional layer
1: for m in 0 .. cout − 1 do
2: for y in 0 .. hout − 1 do
3: for x in 0 .. wout − 1 do
4: O[m][x][y] = 0
5: for n in 0 .. cin − 1 do
6: for ky in 0 .. hk − 1 do
7: for kx in 0 .. wk − 1 do
8: O[m][x][y] += I[n][x + kx][y + kx]×
W[n][m][kx][ky]
For layers to be mapped to the NullaNet accelerator, which is the focus of the present chapter, the
F2N3 compiler explores optimizations such as setting the replication degree of the fixed-function, combinational logic as will be discussed in Section 3.5.2, resulting in trade-offs between resource utilization and
performance.
The F2N3 compiler, in addition to searching the design optimization space for each layer and across
different layers, generates a static schedule for the data transfers between different levels of the memory
hierarchy, e.g., between external memories and on-chip global buffers, and between these global buffers
and registers used for implementing each layer’s computations, for both MAC and Nulla layers. This static
scheduling step mitigates the need for complex handshaking and improves scalability and performance.
35
In the case of the NullaNet realization of a convolutional layer l, the first and last three nested loops
shown in Algorithm 2 are collapsed i.e., they are implemented as a fixed-function, combinational logic
function ψ
l
, which performs all required computations for that layer in one step. Therefore, we only need
to optimize the two remaining loops as shown in Algorithm 3. In the following subsections, replications
of fixed-function logic computations and the method for determining the replication factor for each layer
are described.
Algorithm 3 NullaNet computations for a convolutional layer
1: for y in 0 .. hout − 1 do
2: for x in 0 .. wout − 1 do
3: O[0 : cout − 1][x][y] = ψ
l
(I[0 : cin − 1][x : x + wk][y : y + hk])
3.5.1 Replication of Fixed-Function Combinational Logic Blocks
To increase the performance for processing each layer, the fixed-function combinational functions associated with the custom Boolean computations in some layer of a target NN can be replicated to increase
the NullaNet’s processing performance for the layer. This, of course comes at the cost of an area increase.
Algorithm 3 shows the iterative application of multi-input multi-output layer l-specific fixed logic function
(ψ
l
) to the corresponding input bits in the input feature map (I) to generate the corresponding output bits
in the output feature map (O). Each loop in Algorithm 3, either the one along the Y dimension (height)
or the one along the X dimension (width), can be split into a loop with a smaller loop bound where the
bound is determined by the number of replications in that dimension i.e., hr replications in the Y dimension and wr replications in the X dimension. Next, these smaller loops can be unrolled to increase the
parallelization.
The number of replications in each dimension is determined by the optimization algorithm employed
by the F2N3 and its associated cost function as will be explained in Section 3.5.2. Algorithm 4 shows
the added loops for the optimization associated with the replications and the corresponding unrolling
36
(shown with #pragma unroll). Two added loops represent the replications in the horizontal and vertical
dimensions. The unrolling pragma instructs the HLS compiler to unroll a loop by some number of iterations
(e.g., factor). This can substantially increase the available parallelism, and thus enables an architecture that
runs much faster at the cost of consuming more resources.
Algorithm 4 NullaNet computations with logic block replications
1: for y in 0 .. ⌈hout/hr⌉ − 1 do
2: for x in 0 .. ⌈wout/wr⌉ − 1 do
3: for yr in 0 .. hr − 1 do
4: #pragma unroll(yr)
5: for xr in 0 .. wr − 1 do
6: #pragma unroll(xr)
7: O[0 : cout − 1][x × wr + xr][y × hr + yr] =
ψ
l
xr,yr
(I[0 : cin − 1][x × wr + xr : x × wr + xr+
wk][y × hr + yr : y × hr + yr + hk])
3.5.2 Cost Function and Replication Factor Determination
As stated above, the number of replications employed for each Nulla layer is determined by an optimizer
based on some cost function while accounting for the available resources on the target FPGA platform
(number of LUTs, Registers, BRAMs, etc.). Employing the inter-layer concurrent streaming architecture
discussed in Section 3.4.3, the delay associated with Nulla layer l may be calculated as:
Tl(hout, wout, hr, wr) = ⌈hout/hr⌉ × ⌈wout/wr⌉ × tψl + tc (3.1)
where ⌈.⌉ denotes the ceiling function and tψl
is the worst-case delay of the fixed-functional combinational
logic blocks associated with the custom computation in layer l. tc denotes a fixed delay value.
The goal is to minimize the layer latency given a set of resources. Considering LUTs as the main
limiting resource for Nulla layers [74], the lowest computational latency for layer l can be written as:
T
∗
l (lutshare
l ) = min
{hr,wr|LUT _CNT (hr,wr,LUTψl
)≤lutshare
l
}
Tl(hout, wout, hr, wr),
(3.2)
37
where LUTψl
is the number of LUTs used by each replica of the fixed-function logic ψl and LUT_CNT
denotes the total LUT count used for layer l when using hr ×wr replicas of the fixed-function logic blocks
for that layer, that is:
LUT_CNT(hr, wr, LUTψl
) = hr × wr × LUTψl . (3.3)
lutshare
l
in Eq. 3.2 represents the allocated number of (share of) LUTs for each Nulla layer l, and is obtained
by the Dual Annealing algorithm of [111] with the goal of minimizing the latency of N consecutive Nulla
layers constrained by the available resources in the target FPGA device:
min
{(lutshare
l
)|Σl=1..N lutshare
l ≤ LUT _T OT }
Σl=1..N T
∗
l (lutshare
l ), (3.4)
where LUT_TOT denotes the total number of LUTs on the target FPGA device.
LUTψl metric is extracted by running the ABC synthesis tool [10] and mapping fixed functions generated by Espresso to six-input LUTs using the following commands:
1 $ resyn; resyn2; resyn2rs; compress2rs; st;
2 $ if −K 6; st; dch; if −K 6; st; dch; if −K 6
The first four commands aim to reduce the size of the AND-Inverter Graph (AIG) representing the input
logic networks and are heuristic methods provided by the ABC synthesis to optimize the AIG network. if
command is a priority-cut based mapper where 6 shows the cut size. dch performs the AIG-based synthesis
with a repeated sequence of technology-independent optimizations on different structural choices (which
are functionally equivalent networks obtained by running AIG rewriting scripts on the current network),
and st transforms the network back to the AIG form.
3.6 SDAccel Code Generation Module
To realize our design on FPGAs, we use Xilinx SDAccel and Vivado HLS, which provide a tool chain for
programming and optimizing applications on Xilinx FPGAs using a high-level programming language (C,
38
C++ or OpenCL) and/or hardware description languages (VHDL, Verilog and SystemVerilog) and a runtime
tool based on the OpenCL APIs, which is used by the host-side software to interact with the accelerator.
3.6.1 SW Code Generator
SW code generator, which takes a kernel execution schedule as input and generates the C++/OpenCL code
for the host, instantiates a host code template with key parameters extracted in the back-end compilation
module such as the number of physical buffers, kernel execution order, and so forth.
3.6.2 HW Code Generator
We use RTL-HLS hybrid templates instead of pure RTL or pure HLS templates for realizing the hardware.
Compared with HLS, RTL designs utilize resources more efficiently, but it is well-known that RTL design
is quite time consuming. HLS tools receive designs programmed in high-level programming languages
(C, C++, OpenCL, etc.) and compile them into FPGA programming files. HLS design has a better abstraction for external modules and interfaces, making it easier and faster to implement complex control
logic. However, current HLS designs cannot achieve as much fine-grained optimization as those in RTL
designs, particularly for mapping Boolean functions onto LUTs. Therefore, to fully realize advantages of
both design approaches, we take a hybrid RTL-HLS approach for our accelerator design i.e., we use RTL
for designing a high-performance computation engine for the custom Nulla functions and employ C++
based HLS templates to implement the control logic for the RTL part and max-pooling layers, orchestrate
the data movement along the memory hierarchy, implement the (PCIe) interface between the host and the
FPGA device and the infrastructure IP to access the double data rate (DDR) memories on the board.
With the accelerator configuration extracted by back-end compilation module (i.e., the replication factor, the number and size of required BRAMs, etc.), our HW generator instantiates RTL modules/HLS templates to generate the hardware codes for FPGA. The library of HLS templates can be extended when new
39
types of layers that cannot be fused into the Nulla functions (e.g., an average pooling layer) are encountered.
The RTL part of the computational kernels are written in Verilog and generated after a retiming step
during logic synthesis, whereas the HLS part is written in C++ based HLS. Furthermore, to achieve this
hybrid design, we utilize the RTL Black-boxing feature of HLS tools where custom RTL Verilog code can
replace a C++ function within an HLS/SDAccel project. The RTL is then weaved into the rest of the C++
code through a JSON file by using the ap_ctrl_chain protocol [112] to manage data transactions between
the RTL and the C++ code. This gives freedom to use customized high-performance HDL code in our
design.
3.7 Experimental results
3.7.1 Experimental setup
For evaluation purposes, F2N3 targeted a Xilinx VU9P FPGA in the cloud (available on the AWS EC2 F1
instance). This FPGA platform includes 64 GiB DDR4 ECC protected memory, with a dedicated PCIe x16
connection. There are four DDR banks. This FPGA contains approximately 2.5 million logic elements and
approximately 6,800 DSP units§
. Input images are sent using PCIe from the host CPU to the on-board
DDR4, accessible by the accelerator, and the output results are sent back to the host CPU.
First, we evaluate F2N3 against extreme-throughput tasks in physics and cybersecurity such as jet
substructure classification and network intrusion detection. To have a fair comparison, we used a similar
settings and FPGA board as the one used in LogicNets [101]. We use Xilinx Vivado 2019.1 in the out-ofcontext mode with Flow_PerfOptimized_high for synthesis and Performance_Explore for place and route
without any manual placement constraints. We constrained the clock cycle time to 1 ns to achieve the
highest possible frequency.
§
https://aws.amazon.com/education/F1-instances-for-educators/
40
We also evaluated the F2N3 framework on a well-known CNN, i.e., VGG16 and a commonly used
computer-vision dataset for object recognition i.e., the CIFAR-10 dataset. As a baseline for the state-of-theart generic MAC array-based accelerator for the layers realized using conventional MAC calculations, we
used the open-source implementation of [93] with some modifications including transferring all weights
required for the computation of the layer from the external memory into BRAMs, where these weights
get reused for calculations corresponding to different patches of input feature maps. Furthermore, partial
sums of accumulation for processing the output of a filter/neuron are also stored in the register file of the
same processing element. Considering these modifications, we reduce the latency of VGG-16 inference
employing the generic MAC array-based accelerator.
We use the Xilinx Power Analyzer (XPA) tool integrated into Vivado with default settings, that is
commonly used for early power estimation [26], to assess the power consumption of each design.
3.7.2 Tasks with extreme-throughput requirements
Jet Substructure Classification (JSC): Collisions in hadron colliders result in color-neural hadrons formed
by a combination of quarks and gluons. These are observed as collimated spray of hadrons which are
referred to as jets. The Jet Substructure Classification is the task of finding interesting jets from large jet
substructures. We use the 16-inputs and 5-outputs classification formulation of Duarte et al [31] for JSC.
Processing such collisions requires architectures that operate at or above a 40 MHz clock frequency and
have a sub-microsecond latency.
Similar to what was done in LogicNets [101], we trained three different architectures. Table 3.1 highlights these architectures. β denotes the number of bits to represent quantized numbers and γ refers to
fanin of the network. All networks were trained for upto 200 epochs with the Adam optimizer [52]. To
41
Table 3.1: Comparison between the hardware realization metrics of F2N3 with those of LogicNets [101] on JSC and NID tasks.
Architecture Neurons per layer β γ Accuracy (% Inc.) Synth LUTs (Dec. ratio) FF (Dec. ratio)
fmax (Inc. ratio)
JSC-S 64, 32, 32, 32 2 3 69.65% (+1.85%) 39 (5.50x) 75 (3.30x) 2079 MHz (1.30x)
JSC-M 64, 32, 32, 32 3 4 72.33% (+1.73%) 1,553 (9.30x) 151 (2.90x) 841 MHz (1.40x)
JSC-L 32, 64, 192, 192, 16 3
∗
4
∗ 73.35% (+1.55%) 11,752 (3.20x) 565 (1.40x) 436 MHz (1.02x)
NID-S 593, 100 2 7 93.14% (+9.26%) 95 (37.75x) 153 (8.63x) 1560 MHz (1.92x)
NID-M 593, 256, 128, 128 2 7 93.43% (+2.13%) 671 (23.77x) 480 (2.65x) 1099 MHz (2.33x)
NID-L 593, 100, 100, 100 3
†
5
† 93.28% (+4.60%) 205 (122.20x) 373 (3.81x) 1319 MHz (3.16x)
∗First and last layers’ bit width are 4 and 7, respectively, while last layer’s fanin is 5 in this architecture.
†First layer’s bit width and fanin are 2 and 7, respectively, in this architecture.
42
constrain the fanin count of neurons (γ), we used gradual pruning for its usability and effectiveness. For activation functions, we either use PACT or PHT. At times, we applied batch normalization before activation
function since it gets implemented as part of a logic block for free.
Table 3.1 shows performance differences between F2N3 and LogicNets for JSC task. As is seen, the
F2N3 implementation achieves higher accuracy (i.e., closer to the accuracy of floating-point MAC-based
implementation) along with 3x to 9x improvements in LUT with up to 3x decrease in flip-flops (FF) usage.
More specifically, our medium design achieves about 0.5% higher accuracy compared to LogicNet’s large
design while it has a 2.36x lower latency and 24.42x lower LUT utilization. Similarly, compared to the
optimized design of [49], our medium design achieves the same level of accuracy while it has a 9.25x lower
latency.
Network Intrusion Detection (NID): Identifying suspicious packets is an important classification task
in cybersecurity. Neural networks used for identifying malicious attacks need extreme throughput so as
to not cause any bottlenecks in the network because the number of packets sent to a machine is in the
order of millions per second. Therefore, these types of datasets are good benchmarks for F2N3 as they
need specialized hardware for seamless intrusion detection.
We used UNSWNB15 dataset to compare F2N3 with other methods. We employed the same preprocessed training and testing data as that of Murovic et al. [72] which has 593 binary features corresponding
to 49 original features and two output classes. Each of the original features is transformed by either assigning a number to each unique string and converting it to binary, representing integers with enough bits
for maximum value, or transforming floating point numbers to fixed bit-width numbers. We benchmarked
F2N3 on the same architectures as LogicNets [101] using the training method described above.
We observe that the difference in F2N3’s NID-S accuracy verses other two architectures is really small
while providing faster implementation. Increasing the number of layers or number of bits used for quantization of activations does not improve accuracy much since our optimized network’s accuracy is already
43
close to trained network accuracy. Overall, we get 37x to 122x improvements in LUT utilization while 2x
to 8x decrease in FF usage. The reduction in the frequency results in up to 16x improvement in the latency
of these architectures.
3.7.3 Tasks with high-accuracy requirements
We use VGG-16 with CIFAR-10 dataset as a case study for tasks with high-accuracy requirements. We
implement intermediate convolutional layers 8-13 in VGG-16 using the proposed F2N3 flow and fixedfunction combinational logic functions. Fig. 3.4 shows the achieved layer-by-layer latency improvements
compared to when implementing the said convolutional layers using the MAC array accelerator design.
As illustrated in the figure, we achieve significant savings in terms of layer-wise computational latency
for intermediate convolutional layers 8-13 of VGG-16, which have large memory footprints (i.e., weights).
Using F2N3, the total latency for layers 8-13 is reduced by around 760x compared to employing the MAC
array accelerator design. Furthermore, the obtained accuracies using both of these approaches are relatively close. The model accuracy when layers 8-13 are mapped using MAC array accelerator design is
obtained as 93.04%, while it is obtained as 92.26% when layers 8-13 are mapped using F2N3.
The computational latency of layers, when implemented with the MAC array accelerator design (shown
with yellow bars), is mostly influenced by the corresponding number of weights rather than the intensity
of on-chip computations (i.e., FLOPs). The number of weights for layers 9-13 in VGG-16 is equal to each
other and twice the number of weights we have for layer 8. The same trend is observed in the yellow
bars. Furthermore, when we implement the layers using F2N3, the computational latency of layers (shown
with red bars) is mostly correlated with the width and height of their corresponding IFMs. The width and
height of IFMs for layers 11-13 is half the width and height of IFMs for layers 8-10, respectively. The same
trend is also observed in the red bars.
44
300
400
500
600
700
800
900
C
o
m
p
utatio
n
al L
ate
n
c
y (u
s)
384.0
772.0 769.0 753.0 760.0 756.0
Computational latency improvement using F2N3 for VGG-16
layer 8 layer 9 layer 10 layer 11 layer 12 layer 13
0
1
2
3
4
5
1.5 1.5 1.5
0.24 0.39 0.39
Figure 3.4: Layer-by-layer latency improvements achieved by using the F2N3 flow and fixed-function combinational logic functions for VGG-16. On average, we achieve around 1384x latency improvement using
F2N3.
Furthermore, power consumption for F2N3 accelerator is obtained as 8.6 W compared to 10.1 W for
MAC array accelerator design. Considering the energy consumption, employing F2N3 leads to around
893x energy savings.
3.8 Conclusion
This work introduced F2N3, an across-the-stack design and optimization framework for the construction
of resource and energy-efficient, ultra-low-latency FPGA-based neural network accelerators. With F2N3,
we argue for a newfangled approach where FPGA resources are used more efficiently by unlocking the
full potential of the LUTs. With the above novel approach, F2N3 achieved 760x higher performance compared to the state-of-the-art generic MAC array-based accelerator when targeting VGG-like DNN on the
same FPGA. We also achieved higher accuracy (up to 9%), lower latency (up to 16x), and higher resource
45
efficiency (up to 122x) compared to the latest LUT-based DNN inference accelerator from industry (Xilinx
LogicNets).
46
Chapter 4
Efficient Compilation and Mapping of Fixed Function Combinational
Logic on Digital Signal Processors Utilizing High-level Synthesis
4.1 Introduction
The embedded digital signal processing (DSP) blocks in modern Field Programmable Gate Arrays (FPGAs) support fast and efficient implementation of a wide variety of logic operations. They support simple
Boolean operations as well as complicated arithmetic operations such as multiplication in a single instruction, multiple data (SIMD) scheme. DSPs have evolved to support a wide range of applications requiring
significant amounts of Boolean operations that may not even necessarily fit on the available lookup tables
(LUTs) on an FPGA. In addition to the vast computation capabilities, DSP blocks support dynamic runtime programmability, which allows a single DSP block to be used as a different computational block in
each clock cycle. Vendor synthesis tools provide capabilities to utilize the available resources on FPGAs;
however, existing tool flows such as high-level synthesis tools fail to fully exploit the existing capabilities,
especially the dynamic programmability of DSPs.
Bajaj et al. [84, 80, 81, 82, 83] explore how DSP blocks can be deployed to produce high-throughput
computational kernels and how their dynamic programmability can be exploited to create efficient implementations of arithmetic expressions. However, their solution suffers from inefficient mapping when it
47
comes to implementing combinational Boolean functions using DSP blocks. In particular, high-level synthesis (HLS) tools do not support time-shared mapping of operations on available resources and usually
rely on backend synthesis tools to efficiently map and schedule the operations on the target architecture.
New applications have arisen that produce large sparse Boolean functions with many input variables
and product terms. Examples of such applications are found in [74, 73, 101] where the problem of efficient
processing of neural networks is formulated as a Boolean logic minimization problem where ultimately,
logic expressions compute output of various filters/neurons. In fact, [74] optimizes a target DNN for a
given dataset and maps essential parts of the computation in the DNN to ultra-low-latency, low-cost,
fixed-function, combinational logic blocks, which can be implemented using LUTs.
However, because neurons designed for state-of-the-art neural networks include tens to hundreds of
inputs, the generated Boolean logic expression is huge. Consider the eighth convolutional layer of the
VGG16 neural network [92] trained on the CIFAR-10 dataset [54]. This layer consists of 512 3 × 3 filters
that are applied to an input volume of 4 × 4 × 256. Therefore, the number of inputs to each filter is
3 × 3 × 256 = 2, 304 while the number of input patches is 4 × 4 = 16. Therefore, to realize this layer
using the approach presented in [74], hundred of thousands LUTs are needed. Our experiments show that
generated Boolean logic expressions for state-of-the-art networks such as VGG16 cannot be fitted into one
FPGA if only LUTs are utilized to process logic expressions.
In this chapter, we propose a novel framework to map fixed function combinational blocks to DSPs on
the FPGAs. The proposed methodology starts by transforming a neural network specification to a set of
optimized fixed function combinational logic (FFCL) blocks using the NullaNet framework [74, 73]. Next,
we map each FFCL block to a set of Boolean operations, supported by the DSPs. The Boolean operations
are then scheduled to be executed on DPSs and the compiler orchestrates the data movement from/to a host
device to/from the FPGA to enable loading/storing input/outputs of each DSP block in each computational
cycle. During processing, the input values are transferred to the FPGA, stored in the BRAMs, and loaded
48
to the regsiters associated with each DSP block. DSP blocks then carry out the Boolean operations in
parallel and store the output values in registers. Output values for each FFCL block is transferred to
pre-determined BRAM blocks and later loaded to the DRAM modules interfacing the FPGA. Utilizing the
proposed methodology, the operations associated with any FFCL module, irrespective of the number of
Boolean operations and the number of inputs/outputs, can be mapped to and executed by the DSP blocks
on an FPGA. Hence, the shortcomings associated with mapping a FFCL module to LUTs on FPGAs due to
resource limitations are completely addressed.
4.2 Terminology and Notation
Hereafter, we define the terminology and notation used throughout this chapter.
FFCL Module is defined as the netlist of a combinational circuit written in a hardware description language, such as Verilog.
Compute kernel (CK) is part of an application in the SDAccel framework and is associated with a given
FFCL module running on the FPGA.
Computational fabric (CF) refers to a fabric that carries out the execution of the compute kernel.In this
chapter, we use CF and FPGA interchangeably.
Computational unit (CU) refers to a block on the CF that performs the execution of a logic operation
such as AND, OR, XOR, and etc. In this chapter, we use CU and DSP interchangeably.
Compute Cycle is the number of clock cycles which takes a compute kernel to execute a task.
Logic Level (depth) of a gate in a digital circuit is the maximum number of gates on any path from primary
inputs of the circuit to the gate.
Some of the parameters used in this chapter are summarized in Table 4.1.
49
Table 4.1: Symbols used in chapter ??
Term Definition (bit-width used in this implementation)
λ Ratio of AXI data width to Address data width (36)
δ Ratio of AXI data width to Input data width (10)
ζ Ratio of AXI data width to Operation data width (85)
nsubk Number of subkernels
nregister_count (nsubk_addresses) Count of the registers holding input and output vectors for each CU
nDSP Number of DSPs (CUs)
4.3 Proposed Method
The overall flow of the proposed algorithm is as illustrated in Fig. 5.2. The input to the flow is a description
of a FFCL module in Verilog format. We first parse the Verilog netlist, synthesize the circuit using standard
logic optimization techniques, primarily aimed at reducing the total gate count and depth of the circuit,
and map the circuit to a customized cell library. The Boolean operations supported by the logic gates in the
cell library, such as two input AND, OR, and XOR operation must be supported by the computational unit
(CU), i.e., DSPs on the FPGA. Next, the mapped circuit is levelized. Starting from the primary inputs of the
module, each gate is assigned a logic level value that is one above the maximum logic level of its fanins. In
other words, the logic level of each gate is determined as the maximum number of gates on any path from
a primary input of the FFCL module to any of the inputs of the gate, in addition to one, accounting for the
gate itself. As a gate with a specific logic level does not have any connections to any other gate with the
same logic level, their operations can be executed simultaneously.
Subsequently, the set of operations carried out by the gates in each logic level in the mapped netlist is
decomposed into a set of sub-kernels; The number of sub-kernels is a function of the number of Boolean
operations carried out at each level and the maximum number of available DSPs in each CK. For instance,
if each CK is comprised of 1, 000 DSPs, and in one level of an FFCC module there are 2, 600 Booelan operations, the set of operations in this level is broken down into three sub-kernels. Note that since the FFCL
module is levelized, in case there are enough DSPs available on a CK, all the Boolean operations in different
50
sub-kernels can be performed in parallel as there are no data dependencies between sub-kernels. The total
number of sub-kernels required to implement the overall FFCL module using available DSPs in a CK is
calculated as summation of the number sub-kernels for each logic level, over all the logic levels. Consequently, the number of sub-kernels is a function of both the logic depth as well as the number of Boolean
operations in each logic level of a FFCL module. Therefore, reducing the total number of Boolean operations and depth of the combinational logic during the logic synthesis and technology mapping steps are
both paramount. Once the sub-kernels implementing a FFCL module are determined, we assign input and
outputs of each sub-kernel specific numbers, representing the locations in memory where input operands
are read from and outputs are saved to. These I/O assignments represent the locations in the memory of
the CK which can be implemented using available memory resources on FPGAs, such as BRAMs, URAMs,
or look-up-tables (LUTs). Additionally, the operation corresponding to each DSP in each sub-kernel is configured. Finally, the assignment of memory locations and operation opcodes for each sub-kernel is saved
in a JSON (javascript object notation) format, which will be later used to configure the operation of each
DSP in the CK in each time instance.
In the following sections, we first describe the proposed hardware accelerator and then present the
compiler to support the efficient mapping on the proposed hardware accelerator.
4.4 Hardware Acceleration
HLS tools can significantly reduce development times through abstraction; however, they are seen as an
additional step in the design flow, generating RTL code which must then go through the backend implementation flow, which is very time consuming. HLS design has a better abstraction for external modules
and interfaces, making it easier and faster to implement complex control logic.
When designing systems on FPGAs, we wish to maximize the performance and efficiency of our circuits. This means making the best use of all types of resources available to us. Since designers generally
51
write behavioral code that is then mapped by the implementation tools, performance and efficiency are
controlled for the most part by these tools’ capabilities. As architectures evolve with more complex resources, the tools have to work harder to make full use of them. While HLS tools allow higher-level design
description, the final mapping remains the purview of the backend tools. If these cannot map general RTL
code to exploit the architecture’s capabilities, the resulting implementations can be inefficient. More importantly, the information contained in the high-level design description may help achieve this but be lost
in the translation to generic RTL.
We develop a mix of register-transfer level (RTL)/C++ descriptions of the Boolean logic expression
accelerator where we use the RTL black-boxing feature of high-level synthesis (HLS) tools. This gives
freedom to wrap RTL description of DSPs in optimized synthesizable C++ templates, acting as external
modules or interfaces (like on-board double data rate memories, DDR), achieving low-latency accelerator
designs on FPGAs.
The computation engine of the proposed accelerator is a custom fixed-function combinational logic
fabric. This custom fixed-function generated in Section 4.5.1 differs for each Boolean expression. The
reconfigurability of DSP blocks helps to reuse the same resources for implementing different Boolean
expressions. Every DSP block can perform a 48-bit bitwise logic operation including AND, OR, NOT,
NAND, NOR, XOR, and XNOR. This results in a SIMD scheme where we can perform the same operation
using one OP-CODE for 48 different inputs. We store all input vectors in double data rate (DDR) memories,
bring them all to on-chip ultra rams (URAM). Then, we divide the required computations in several rounds.
In each round, a subset of the inputs (i.e., stored in the URAM) is transferred from URAM to its input block
RAM (BRAM) for computation in each round. We also store opcode and addresses, where DSP registers
must be read/written from/to BRAMs, in addition to input/output vectors. They are stored in opcode,
address memory (c.f., Addr. Mem. buffers in Fig. 4.2), and input vector buffers, respectively.
52
When we have all required data in the determined BRAM, we first read the data from designated
locations for each CU, i.e., a DSP block, from the FPGA BRAMs and move them to the registers of the
DSP blocks, which serve as the input for the custom combinational logic. The addresses for reading such
data is accessible in the address memory buffer (c.f., Fig. 4.2). The computation output is written to the
output register files of the DSP block and then the BRAMs using fetched address. By iterating over the
input vector BRAM and bringing different input data vectors to registers, performing the required Boolean
operations, and storing the results from output registers to output BRAMs, the required computation for
each Boolean expression is completed. This process is orchestrated by a control unit.
4.4.1 Memory Layout and Data Placement
To utilize the off-chip memory bandwidth, we group input data and addresses as well as opcodes before
sending them to the on-chip memory. As shown in Fig. 4.2, the width of the packed data is 512 bits.
Required computations for each Boolean expression are done by iteratively reading and transferring
data from the input BRAM to registers, executing custom combinational logic functions on the data, and
storing results from output registers to BRAMs.
The width of the data stored in the registers and BRAMs for each Boolean expression is 48 to match
the SIMD lines in the DSP block’s logic unit. The width of opcodes are 6 bits while the width of addresses
of data locations are 14 bits.
4.4.2 Hardware and Software Optimization
In the following sections, we will briefly discuss the optimizations considered in our hardware and software designs. To realize our design in FPGA, we use the Xilinx SDAccel and Vivado HLS, which provide
a toolchain for programming and optimizing different applications on Xilinx FPGAs using a high-level
53
language (C, C++ or OpenCL), as well as a runtime tool based on the OpenCL API, which is used by the
host-side software to interact with the hardware accelerator known as hardware kernel.
4.4.2.1 Burst Data Transfer
We load all input data required for a FFCL’s computations at once before any computations for the FFCL
can begin. This lowers the FFCL’s processing time, especially because now loading of input data, addresses,
and opcodes can be done simultaneously (input data and addresses are stored in separate off-chip memory
banks in the target FPGA board and are thus simultaneously accessible). Furthermore, we enable the full
burst read/write of data from/to the memory banks and utilize the maximum possible burst size (512-bit
width) and burst length (256 beats) allowable on the Advanced eXtensible Interface (AXI) bus of the target
FPGA board. Specifically, to enable burst read of input data vectors, addresses, and opcodes, we allocate
a URAM on the FPGA as shown in Fig. 4.2. Employing URAM blocks for implementing large buffers both
prevents over-utilizing the BRAM blocks, and helps us achieve a more balanced utilization of resources in
the target FPGA board.
4.4.2.2 Double Buffering
To further reduce the overhead of loading new FFCL data, during an FFCC’s computations, we must preferably pre-fetch the data for the next FFCL so that the actual computations of the next FFCL can start earlier.
To achieve this goal, we use the idea of double-buffering where we have two sets of buffers. One set of
buffers contains all data (input data vectors, addresses, opcodes) required for the current FFCL’s computations while the other is being filled with the data of the next FFCL that are read from the off-chip memory.
4.4.2.3 Task pipelining
Utilizing double-buffering offers an opportunity to pipeline tasks. We define two tasks, data movements
and hardware kernel computations. In data movement, we transfer data from off-chip memory to URAMs
54
and distribute addresses and opcodes to the pre-assigned BRAMs. In the kernel computation task, we
iterate over vectors of the data and bring one vector of data from URAMs to BRAMs and perfom the onchip operations. The details of these tasks are explained in details in Section 4.7.1 Then, the control unit
send the necessary signal to read addresses from address memory buffers and load the registers with the
required data from pre-defined locations. Finally, control unit send execution signals to CUs to perform
required computations and stores back results to the BRAMs. The data transfer of the second task is shown
with red arrows in Fig. 4.2.
4.4.2.4 Multiple Parallel Accelerators
Utilizing multiple parallel hardware kernels (i.e., accelerators) allows temporal parallelism, where the same
hardware kernel process different sets of data or different FFCL module. This is possible by enqueuing
multiple hardware kernels in a pipelined manner. To acheive this, we follows the OpenCL programming paradigms. Enqueuing multiple acceleratos happens through multiple OpenCL’s clEnqueueTask
commands in the host code, in a pipelined manner. clEnqueueTask is used to enqueue a accelerator to a
command queue. OpenCL’s clCreateCommandQueue API creates a command queue which keeps a track
of queued tasks. In our design we use an out-of-order command queue to concurrently execute multiple
hardware kernels.
4.5 Compiler
As described in Section 5.4, our compiler parses, synthesizes, and levelizes the Verilog netlist to extract
the set of operations carried out by the gates in each logic level in the mapped netlist. Then, our compiler
decides on the total number of sub-kernels required to implement the overall FFCL module using available
DSPs in a CK considering the proposed hardware accelerator and available resources on the given FPGA.
55
Finally, the assignment of memory locations and operation opcodes for each sub-kernel is determined, that
will be later used to configure the operation of each DSP in the CK in each time instance.
In the following subsections each of the above steps are detailed and examples are presented to further
illustrate the proposed methodology.
4.5.1 Mapping to Logic Gates
In this section, we provide details about the process of mapping an FFCC module to a set of logic gates
and the assignment of logic operations to DSP resources in the computational fabric, i.e, FPGA. In the first
step, we use standard logic synthesis techniques to reduce the total gate count and the maximum logic
depth of a circuit. The goal of this step is to transform a Verilog description of a FFCL module to a set of
logic gates comprised of Boolean operations that can be carried out by the computational fabric, i.e, DSP
devices on the FPGA. We use the ABC synthesis tool [10] to map a FFCL module to logic gates using the
following commands:
1 $ resyn; resyn2; resyn2rs; compress2rs; st;
2 $ map; st; dch; map; st; dch; map
The first five commands reduce the size of the AND-Inverter Graph (AIG) representing the input logic
network and are heuristic methods that optimize the AIG network. The map command is a k-feasible cut
based mapper that maps the optimized AIG to a set of logic gates. The st command transforms the network
back to the AIG form. The dch command performs AIG-based synthesis by repeatedly applying a sequence
of technology-independent logic optimizations (using AIG rewriting rules).
Once the aforesaid sequence of commands is completed, the netlist is transformed to a set of 2-input
logic gates, comprised of logic gates such as AND, OR, XOR, etc. Next, we assign logic levels to each gate.
56
In this step, each logic gate is assigned a logic level l as follows. The logic levels of the inputs of the FFCL
module are set to 0. For each gate, the logic level is simply computed as follows:
li = 1 + max
j∈fanini
lj (4.1)
where f anini represents the set of fanin gates for gate i. This step is straightforward and can be accomplished utilizing a breadth first graph traversal starting from primary inputs.
Because gates at the same logic level cannot have any connections to each other (i.e., from output of
one to input of another), their operations can be carried out simultaneously. However, due to computational resource limitations i.e., the total number of available DSPs on the FPGA, it may be that not all the
operations in a logic level can be executed in the same compute cycle. In such a case, the set of logic
operations corresponding to a logic level are broken down into smaller subsets, called sub-kernels, and
operations corresponding to different sub-kernels are carried out in a sequential manner.
We then compute the total number of sub-kernels at each logic level. If there are a total number of ni
Boolean operations in logic level i and there are a total number of nDSP DSPs in that target FPGA, the
total number of sub-kernels of logic level i will be ⌈
ni
nDSP
⌉. Accordingly, the total number of sub-kernels
required for mapping a FFCL module onto the FPGA is the sum (over all the logic levels) of the number of
sub-kernels for each logic level. Hence, minimizing the total number of compute cycles is an optimization
problem, involving both the total gate count and the max logic level of the circuit.
Next, Boolean operations corresponding to each sub-kernel are mapped to DSPs on the FPGA. Additionally, memory locations from which each DSP reads its inputs and locations where it writes its output
must be determined. To do so, we first create a mapping between the inputs/outputs of each logic level
with the BRAM locations on the FPGA. If there are a total number of I inputs and O outputs in a logic
level, we require a total number of I + O BRAM locations to store the data values corresponding to this
57
logic level. The compiler first creates a simple mapping between the inputs/outputs of each logic level
with BRAM locations on the FPGA. Next, for each sub-kernel, the assignment of Boolean operations to
DSP devices as well as the locations for which data values are obtained from and written to are determined;
Assuming there are k Boolean operations in a sub-kernel, we require k DSP devices; Since each DSP device
reads two input values and generates one output value, we require 2 × k memory locations for storing
inputs of the sub-kernel and a total number of k memory locations for storing output values. Initially, each
Boolean operation is assigned to a DSP device, starting from 0 to k −1. Next, the memory locations where
the inputs of the first DSP are obtained from are calculated using the aforementioned mapping between
inputs of the logic level and the BRAM locations. Similarly, the BRAM location where the output of the
Boolean operation should be written to is obtained using the same mapping. These memory addresses are
then saved to BRAM locations; Consequently, during execution, the memory addresses where each DSP
reads/writes data from/to is predetermined.
In the proposed methodology, the memory addresses of the inputs of the pth DSP are saved in BRAM
locations 2 ∗ p and 2 ∗ p + 1. Similarly, the address where the output of the first DSP is written is saved at
BRAM location 2∗k. The memory address where the output of the last DSP is written to is saved in BRAM
location 3 ∗ k − 1. Finally, the look up table containing the assignment of BRAM locations is transferred
to the FPGA and stored on BRAMs. At runtime, the host device reads the BRAM location assignments
and transfers the contents of corresponding BRAM locations to the inputs registers of the DSP devices.
Similarly, the outputs of DSP devices are stored to the pre-determined BRAM locations.
Consider an example where a 4 input and gate is implemented using three 2 input AND gates; such
a simple circuit can be broken down into two logic levels, where in the first logic level two parallel AND
operations are carried out; Therefore, the firs logic level has four inputs and two outputs; Initially, the
inputs (A, B, C, D) and outputs (O1 and O2) are simply mapped to BRAM locations 0 - 5. Next, we
assume that the first AND operation (i.e., O1 = A&B) is mapped to the DSP 1. Similarly, the second
58
AND operation is mapped to DSP 2. Using the proposed methodology, the memory address of the inputs
of DSP 1 are stored in BRAM locations xF F00 and xF F01. Similarly, the memory location where the
output of DSP 1 should be written to is stored in BRAM location xF F02. According to the mapping
between input/output signals of this logic level and BRAM locations, the contents of the BRAM locations
xF F00 − xF F02 would be as follows: 0, 1, 4. Similarly, if the inputs/output of DSP 2 are mapped to
BRAM locations xF F03 − xF F06, these memory locations are will contain the following values: 2, 3, 5.
In this manner, during execution, each DSP simply reads the input values of a logic level, generates the
corresponding output and writes the output to the pre-determined BRAM location.
An illustrating example is presented in section 4.5.3 to further elaborate the proposed methodology.
4.5.2 Compiler Optimizations
In this section, we formulate the number of compute cycles (CCs) as a function of the topology of the
neural network, i.e., its number of layers, number of filters per layer, number of Boolean operations per
filter, numbers of inputs and outputs per filter, characteristics of the input data set, and configurations of
the computational fabric, i.e, the number of available resources such as DSPs, BRAMs, URAMs, LUTs, etc.
for each CK. Subsequently, we deploy the developed model and present a methodology to minimize the
total number of CCs by optimizing the total number of DSPs on each CK.
The number of CCs for carrying out computations of a FFCL is a function of multitude of parameters,
including the topology of the neural network architecture, characteristics of the input data set, and number
and type of available resources in the CF, i.e., the target FPGA. As described in Section 4.5.1, the process of
mapping a FFCL to CKs involves two major tasks: (i) data movement and (ii) Boolean operations. The data
movement task, which transfers data from the host memory to the device memory and vice versa, incurs
a latency that is a function of the available resources on the CF, such as sizes of the BRAMs and URAMs,
and characteristics of the communication means of the CF, such as the PCIe bandwidth and I/O data width.
59
Before (and after) doing any CK computations, the input (and output) data must also be transferred from the
global CF memory to the local CK memory (and vice versa). For example, there are data transfers from/to
BRAMs to/from DSP registers. The data movement task also accounts for this latency. The operation
latency is associated only with that of processing the input vectors of the CKs and producing the output
results. The total CCs for executing a FFCL is defined as:
ncc = ndata_moves + ncompute (4.2)
One of the primary optimizations in the proposed flow is the parallel execution of the two aforesaid
tasks. That is, one can transfer input vectors and do the assignment of memory addresses and opcodes for
one CK while simultaneously carrying out the Boolean operations on the input vectors for another CK.
Consequently, we can minimize the total CCs required for doing multiple CKs (which is the case in the
NN inference) on a target CF by pipelining these two tasks so that the data movement for one CK overlaps
in time with the Boolean operations of another CK. With this pipelining scheme, the overall latency of
executing m FFCLs on a target CF may be calculated as follows:
ncc = (m + 1) × max(ndata_moves, ncompute) (4.3)
More precisely, the data movement cost consists of three parts: i) the cost of reading the input data
and transferring operation codes that specify the Boolean operations to be performed on each DSP, ii) the
cost of setting up addresses for loading from (and storing to) BRAMs before (and after) performing DSP
computations, and iii) the cost of transferring the generated results. It can be written as follow:
ndata_moves = nread_inputs_opcode_mem + nread_addr_mem + nwrite_output_mem (4.4)
60
In our experiments, we observe that the latency associated with storing outputs is negligible. Therefore,
it can be omitted from equation 4.4. Furthermore, the cost of data movements associated with the transfer
of memory addresses is larger than other data movement costs. So, we assign one (external) double data
rate (DDR) SDRAM bank to the input vectors and operation codes and all other available DDR SDRAM
banks to the memory addresses. Since we parallelize the two tasks of reading the input data and opcodes
and the memory addresses load, the cost of data movements may be rewritten as:
ndata_moves = max(nread_inputs_opcode_mem, nread_addr_mem) (4.5)
In the following, we present formulations to estimate each of the variables in (4.5). Data movement
consists of transferring input vectors, i.e., inputs of the FFCLs (which can be pixels of the input data set
or bit-packed vectors of the intermediate layers/filters), transferring Boolean operation codes that should
be executed by the CKs, and transferring addresses of memory locations to load from or store values in
the BRAMs which are being utilized by the CKs. The cost associated with bringing addresses of memory
locations into the global on-chip memory and then distributing them to the local on-chip memories may
be calculated as follows:
nread_addr_mem = nAM_DRAM_to_URAM + (k − 1) × nAM_URAM_to_BRAM (4.6)
where nAM_DRAM_to_URAM is the latency of bringing addresses of memory locations from external DRAM
into the URAM, k denotes the total number of available DDR banks, which is 4 in our case (see Section 4.7
for more details regarding the target FPGA board), and nAM_URAM_to_BRAM is the cost of transferring
data from URAM into BRAM. Note that we use URAM (BRAM) as the global (local) on-chip memory in
our design. k − 1 is multiplying nAM_URAM_to_BRAM since we have to distribute addresses of memory
61
locations read from k − 1 external DRAM in sequence to local on-chip memories.
Now then, we have:
nAM_DRAM_to_URAM = nsubkernels × nsubk_addresses (4.7)
where nsubkernels denotes the number of sub-kernels of a compute kernel and nsubk_addresses is the maximum number of addresses required for a sub-kernel.
Since there are two input and one output registers associated with each DSP, we need to bring three
addresses for each DSP. So, the maximum number of addresses is:
nsubk_addresses = 3 × nDSP (4.8)
Because the bit width of the PCIe bus is much larger than number of bits assigned to each address, we can
pack several such addresses into one bus transaction. Let’s denote this packing number by λ (c.f. table
4.1). Moreover, we assign several memory banks to this task to reduce the total latency associated with
task. Hence, the total latency associated with this task may be rewritten as:
nAM_DRAM_to_URAM =
nsubkernels × nsubk_addresses
λ × (k − 1)
= α × nsubkernels × nDSP (4.9)
where
α =
3
λ × (k − 1) (4.10)
62
The cost of transferring addresses from URAM to BRAM may be expressed as:
nAM_URAM_to_BRAM =
1
2
× nAM_DRAM_to_URAM (4.11)
Notice that nAM_URAM_to_BRAM is half of nAM_DRAM_to_URAM because we rely on true dual port BRAM
with the ability to perform any combination of independent read or write operations in the same clock
cycle. Aggregating (4.6) (4.9), and (4.11) yields:
nread_addr_mem =
k + 1
2
× nAM_DRAM_to_URAM
= β × nsubkernels × nDSP (4.12)
where
β =
k + 1
2
× α (4.13)
The cost associated with transferring the input vectors to each CK plus that of transferring Boolean
operation assignments to each DSP on a CK can be estimated as follows:
nread_inputs_opcode_mem = ⌈
ninput_vectors × nfanin
δ
⌉ + ⌈
nsubkernels × nDSP
ζ
⌉ (4.14)
Where ninput_vectors is the total number of vectors that must apply to FFCL. nfanin is the number of
primary fanins in the given FFCL. Notice that δ and ζ are added in equation (4.14) to capture the effect of
63
data packing similar to equation (4.9). Finally, by combining equations (4.5), (4.12), and (4.14), the overall
data movement cost can be expressed as:
ndata_movement = max(⌈
1
δ
× ninput_vectors × nfanin⌉+
⌈
1
ζ
× nsubkernels × nDSP ⌉,
β × nsubkernels × nDSP ) (4.15)
The CC associated with computing the output of each CK for each input vector may be simplified as
follows:
ncompute_one_CK = nloop_subkernels + noutputs (4.16)
where noutputs is the cost associated with the storing the generated outputs to the local memory assigned
for storing the results. Recall that each FFCL is divided into a collection of subkernels. nloop_subkernels,
which accounts for the cost of feeding the DSP registers with the proper data and then executing the logic
operations in DSPs and storing back the results to the proper locations of local memory, is calculated as
sollows:
nloop_subkernels = nsubkernels × (nBRAM_to_DSP_regs + nexe_logic_ops + nDSP_reg_to_BRAM ) (4.17)
The latency of bringing the data to input registers of DSPs is as follow:
nBRAM_to_DSP_regs =
2
3
× nsubk_addresses (4.18)
64
Note that 2/3 is because two out of three registers associated with a CU is accounted for input data and
one is for the generated output data. We have observed that latency of this task dominates nloop_subkernels
when we increase the number of DSPs. Therefore, it is important to perform this task very efficiently. Since
we already partition the memory for storing the addresses by a factor of λ, as seen in equation (4.11), we
are able to parallelize the data transfer by a maximum factor of λ. However, we must access λ data values
at the same time. Unfortunately, the memory for storing the input vector is not and cannot be partitioned
because there are no patterns in accessing the data which can dictate a reasonable partitioning solution.
So, We use another trick and copy the input vector to multiple on-chip memories to increase the number
of access lines. This results in a considerable reduction in the cost associated with feeding the new data to
DSPs. Hence, after this optimization, equation (4.18) may be rewritten as:
nBRAM_to_DSP_regs =
2
3
×
nsubk_addresses
λ
(4.19)
Moreover, the latency associated with copying an input vector to multiple on-chip locations must be added
to equation (4.16):
ncompute_one_CK = ncopy_mem_in + nloop_subkernels + noutputs (4.20)
where
ncopy_mem_in = nfanin (4.21)
65
Similar to equation (4.19), the latency of transferring the results from DSP registers to the local on-chip
memory is:
nDSP_reg_to_BRAM =
1
2
× nBRAM_to_DSP_regs
(4.22)
Notice that nDSP_reg_to_BRAM is half of nBRAM_to_DSP_regs since there is only one output register assigned for each DSP.
Aggregating equations (4.17), (4.19), and (4.22), we can write:
nloop_subkernels =nsubkernels ×
2 × nDSP
λ
+ nexe_logic_ops +
1
2
×
2 × nDSP
λ
!
(4.23)
where nexe_logic_ops denotes the latency of a CU for executing a logic operation. If we have multiple data
vectors, then the total latency of the compute part can be summarized as follows:
ncompute =ninput_vectors × ncompute_one_CK (4.24)
=ninput_vectors ×
nfanins+
nsubkernels ×
2 × nDSP
λ
+ nexe_logic_ops
+
1
2
×
2 × nDSP
λ
+ noutputs!
66
By aggregating the aforementioned equations in this section, one can obtain the final CCs for a single
CK carrying out the computation on an arbitrary number of input vectors as follows:
ncc =(m + 1) × max
max
1
δ
× ninput_vectors × nfanin
+
1
ζ
× nsubkernels × nDSP , β × nsubkernels × nDSP!
,
ninput_vectors ×
nfanins + nsubkernels ×
⌈
2 × nDSP
λ
⌉
+ nexe_logic_ops + ⌈
1
2
× ⌈2 × nDSP
λ
⌉⌉
+ noutputs!! (4.25)
where m denotes the number of FFCLs as stated above.
Additionally, the number of subkernels associated with each FFCL is a function of the number of logic
levels, the number of Boolean operations per logic level, and the number of available DSPs as follows:
nsubkernels =
X
L
l=1 &
n
l
gates
nDSP '
(4.26)
where l = 1 . . . L denotes the logic levels for each FFCL. As seen from (4.25), the number of CCs is a
non-linear function of the number of available CUs on the CF.
4.5.3 Illustrating Examples
In this section, we outline the process of mapping two FFCL modules to a set of CUs (DSPs) on a CF (FPGA),
using the proposed methodology.
Consider two FFCL modules, each implementing a Boolean expression of 4 input values, as depicted
in Figures 4.3 and 4.4. In the proposed flow, initially, each of the designs is levelized, i.e., operations in the
design are assigned to specific logic level based on as-soon-as-possible scheduling strategy. As a result,
each Boolean operation is assigned a logic level l, where l denotes the smallest value larger than the logic
67
level of all the input operands. Consequently, designs 1 and 2 each comprise 2 and 3 logic levels. Next,
according to the number of available resources in the CF (i.e., the number of available CUs), the set of
Boolean operations in each logic level are clustered into one or multiple subkernels. In this example, there
are only two CUs available. Consequently, each level in design 1 is divided to one subkernel, whereas level
1 in design 2 is divided into two subkernels. Due to data dependencies, the operations of all subkernels of
a level should be completed before subkernels of the next level can be launched. The number of subkernels
for all the levels of an FFCC module determines the number of clock cycles it takes for the CF to compute
the output result of the FFCL for one vector of its input values. Accordingly, the computations for designs
1 and 2 are completed within 2 and 4 cycles, respectively.
Now consider each of the designs g1 and g2. The contents of the input data buffer, opcode buffer, and
address memory buffer for realizing function g1 are depicted in Table 4.2.
Table 4.2: The contents of the input data buffer, opcode buffer, and address memory buffer for realizing
function g1 (c.f. fig. 4.3).
Index Data Vec. Buf. Addr. Mem. Buf. Opcode Buf.
0 0x0000 [ 2, 3, 4, 5 ] [AND,AND]
1 0xFFFF [ 6, 7, 0, 0 ] [AND, NOP]
2 a [ 6, 7, 0, 0 ] -
∗
3 b [ 8, 0, 0, 0 ] -
4 c - -
5 d - -
6 w1 = a & b - -
7 w2 = c & d - -
8 out = w1 & w2 - -
9 0x0000 - -
∗Values represented with "-" means that there is no data in corresponding indices. It is obvious that address memory and opcode
buffers are smaller than data vector buffers in this example.
In our proposed methodology, indices 0 and 1 of the input data vector are always filled with constant values of 0 and 1, representing constant values in Boolean expressions. Furthermore, the compiler
populates the next four indices with the values of inputs of the FFCL module, i.e., values of inputs a –
d. As outlined before, the compiler creates a mapping between internal values, i.e., nodes in the graph
68
Table 4.3: The contents of the input data buffer, opcode buffer, and address memory buffer for realizing
function g2 (c.f. fig. 4.4).
Index Data Vec. Buf. Addr. Mem. Buf. Opcode Buf.
0 0x0000 [ 3, 4, 3, 2 ] [XOR_OP, XOR_OP]
1 0xFFFF [ 6 (w1), 7 (w2), 0, 0 ] [XOR_OP, OR_OP]
2 a [ 5, 2, 5, 4 ] [XOR_OP, AND_OP]
3 b [ 8 (w3), 9 (w4), 0, 0 ] [AND_OP, NOP]
4 c [6, 8, 7, 9] -
5 d [ 10 (w5), 11 (w6), 0, 0 ] -
6 w1= b ^ c [ 11 (out), 0, 0, 0] -
7 w2 = b ^ a - -
8 w3 = d ^ a - -
9 w4 = d ∥ c - -
10 w5 = w1 ^ w3 - -
11 w6 = w2 & w4 - -
12 out = w6 & w5 - -
representing the Boolean function, and locations in the memory. Consequently, the intermediate values
corresponding to nodes w1 and w2 are mapped to locations 6 and 7 of the input data vector buffer. Finally,
the output value is stored in location 8 of the data vector buffer. As observed, the total size of the data
vector buffer for realizing a FFCL is calculated as the total number of nodes of the DAG representing the
Boolean expression.
The contents of the address memory buffer are also listed in Table 4.2. As shown, the cardinality of
each vector in the address memory buffer is equal to 2× the number of CUs, as each CU requires two
operands. In the first compute cycle, the first CU reads the value of the operands from indices 2 and 3 of
the memory while the second CU obtains its operands by reading input data vector buffer locations 4 and
5, respectively (cf. Table 4.2, Column address memory buffer, Index 0). Subsequently, CUs write the output
values into locations 6 and 7 of the data vector buffer.
In the second compute cycle, the first CU performs operation on values obtained from data vector
buffer locations 6 and 7 and stores the output in data vector buffer location 8. In this cycle, the second CU
does not perform any operations. As shown in Table 4.2, the Boolean operation carried out by each CU
69
is stored in the opcode memory, in a similar fashion to address memory buffer, albeit with the difference
that the cardinality of the vectors in opcode memory is equal to number of of CUs.
Similar to design g1, the contents of the memories for design g2 is listed in Table 4.3. As shown, since
the number of subkernels and operations carried out by FFCL module g2 is larger than g1, the size of the
memories are increased.
4.6 Application of Proposed Method to NN Inference
In this section, we explore the application of the proposed method to FFCL-mapped neural networks are
discussed. As an example, we explore Nullanet as natural fit for our framework and then present the
compiler optimizations specific to neural network inference. A summary of the NullaNet [74] flow is
given in 2.2
4.6.1 Optimization Problem
When running a given data set (e.g. CIFAR-10) through a set of FFCLs and carrying out computations
associated with a given neural network (e.g. VGG-16), minimizing the total CCs (determined by the summation of CCs for each neural network layer) is of paramount importance. Assuming the computation of
each layer is carried out in a sequential manner (i.e., we finish computations of one layer before starting
computations of the next layer), the total CCs for computing the outputs of a set of FFCLs can be formulated
as follows:
n
nn
cc =
X
M
i=1
n
i
f ilter × n
i
cc (4.27)
70
In the above example, i = 1 . . . M represents various layers of a neural network, n
i
f ilter denotes the
number of filters in each layer i of the network, and n
i
cc is the average CCs for applying a filter to input
feature map of each layer i of the network.
As previously mentioned, depending on the available resources on the CF, one can reduce the total
cycle count by launching multiple CKs in parallel, all executing the same Boolean function on the set of
input vectors. In this manner, the total number of CCs for a set of CKs implementing a FFCL module is
greatly reduced by the number of parallel executing CKs (denoted by nparallel_factor) as follows:
n
tot
cc =
n
nn
cc
nparallel_factor
(4.28)
Because the number of input vectors, the number of outputs, and fanin count corresponding to each
filter in each layer of the network is determined by the characteristics of the input data set, network topology, and deployed pruning algorithms, minimizing the total CCs is a multi-tier optimization. However,
assuming a fixed network, data set, and compression algorithm, one can obtain the optimal number of
CUs, i.e., DSPs, by minimizing equation (4.27):
minX
M
i=1
⌈
n
i
f ilter × n
i
cc
nparallel_factor
⌉
subject to nDSP ≤ N
DSP (4.29)
where N DSP denotes the number of DSPs in the target FPGA.
Using the above problem formulation, the design space for the given neural network (associated with
different values of CUs) is explored by using a simple binary search algorithm in order to find the best
solution for the given neural network (see Fig. 4.5).
71
4.7 Simulation Results
For evaluation purposes, we targeted a high-end Virtex® UltraScale+ FPGA (Xilinx VU9P FPGA, which
is available in the cloud as the AWS EC2 F1 instance). This FPGA platform includes 64 GiB DDR4 ECC
protected memory, with a dedicated PCIe x16 connection. There are four DDR banks. This FPGA contains
approximately 2.5 million logic elements and approximately 6,800 DSP units∗
.
First, we assess the proposed model described in the section 4.5.2. Then, we discuss the efficiency
of proposed accelerator presented in the section 4.4. Finally, We evaluate our proposed method on two
well-known CNN, i.e., VGG16 [92] and LENET-5 [58] and two commonly used computer-vision dataset
for object recognition i.e., the CIFAR-10 [54] and MNIST [27] datasets.
4.7.1 Efficacy of Model Used in Compiler Optimization
To show the efficacy of the proposed analytical model used in the compiler optimizations, we compare
the expected latency results from our compiler to actual results after running FPGA. For our comparison,
we employ the parameterized analytical modeling tool presented in section 4.5.2 and extract the expected
latency. Then, we evaluate the affinity of our model to actual realization according to latency values vs
usage of different number of DSPs. Fig. 4.5 shows a comparison between our proposed model used in the
compiler and actual hardware implementation in terms of the achieved performance for layer 7 of VGG16
network. Our model can predict actual performance by 10% error.
As can be seen in Fig. 4.5, the design space can be inspected through a binary search algorithm to
find the best solution in terms of the performance. The best solution is the one that minimize the latency
represented by the number of cycles needed to compute all FFCLs by varying the number of CUs required
for computations. Reducing the number of CUs boosted the performance since the data movements cost
∗
https://aws.amazon.com/education/F1-instances-for-educators/
72
decrease. We observed that the total latency increase after a certain point by reducing the number of CUs
because the computation cost became dominant.
4.7.2 Analytical Comparison: Memory communications vs Computations
We analyze and quantify the strengths of our proposed accelerator in terms of latency (e.g., number of
spent cycles) of memory communication vs computations. Fig. 4.6 depicts the proportional percentage
of latency spent on memory communications and computations phases. Our proposed compilation and
mapping algorithms achieve a balance between memory communication and main computation latency
in various number of utilized DSPs. Computation latency outweight the memory communication latency
when the number of DSPs are reduced. This is expected since we reduce the total number of resources, so
the computations takes longer to finish.
Fig. 4.7 shows the importance of achieving a balance between data movements and computation latency in our proposed framework. If we don’t consider this balance between both tasks in our design, we
have to accept sub-optimal solutions because we pipeline the tasks as explained in Section 4.4.2.3. In other
words, since the throughput of a pipeline cannot be better than that of its slowest stage (i.e., a task latency
in our case), the designer and the compiler should try to divide the work and resources among the stages
so that the tasks take the same to time to be completed.
4.7.3 Comparison Between MAC-based, XNOR-based and Nullanet-based Implementation
on CNNs
We use VGG-16 with CIFAR-10 dataset as a case study. The 16 in VGG16 refers to its 16 layers that have
weights. This network is a pretty large network and it has about 138 million parameters. We implement
intermediate convolutional layers 2-13 in VGG-16 using the proposed framework and fixed-function combinational logic functions.
73
As a baseline for the state-of-the-art generic MAC array-based accelerator for the layers realized using
conventional MAC calculations, we used the open-source implementation of [93] with some improvements
including transferring all weights required for the computation of the layer from the external memory
into BRAMs, where these weights get reused for calculations corresponding to different patches of input
feature maps. Furthermore, partial sums of accumulation for processing the output of a filter/neuron are
also stored in the register file of the same processing element. Considering these improvements, we reduce
the latency of VGG-16 inference employing the generic MAC array-based accelerator.
We used FINN [102] for our XNOR-based baseline and replaced the LUT-based XNOR unit with a
DSP-based XNOR unit in the Matrix-Vector–Threshold Unit (MVTU).
Fig. 4.8 shows the achieved performance of all three types of implementations in different number
of DSPs. As illustrated in the figure, the XNOR implementation achieve significant saving since we keep
all intermediate and weights on on-chip memories and there is no cost associate with off-chip memories.
Whereas this is not the case for MAC-based and our proposed implementations and memory communications constitutes a great proportion of the total latency. The latency of the MAC- and XNOR-based
implementations is increased when the number of DSPs are reduced while the latency of our implementation forms a Pareto shape (c.f. Fig. 4.5). The proposed implementation can achieve better performance
(runtime of 2.99 ms) with less number of DSPs compared to MAC-based implementation with 1024 DSPs
(5.72 ms). Using the proposed method, the total latency for VGG16 on CIFAR-10 datasset is reduced by
around 2x compared to employing the MAC array accelerator design. Furthermore, the obtained accuracies using both of these approaches are relatively close. The model accuracy when all layers are mapped
using MAC array accelerator design is obtained as 93.04%, while it is obtained as 92.26% when layers 2-13
are mapped using Nullanet method.
74
Furthermore, Fig. 4.9 shows the performance results of our flow for LENET-5 networks on MNIST,
alongside a comparison with the other two type of implementations. The network has 5 layers with learnable parameters and hence named LENET-5. It has three sets of convolution layers with a combination of
max pooling. After the convolution and max pooling layers, we have two fully connected layers. It was
successfully applied for identifying handwritten numbers provided in MNIST dataset.
As shown in the figure, our proposed implementation achieved a better performance for LENET-5 in
terms of obtained inference latency compared to other types of implementations (e.g., achieving up to 20%
latency improvement for 140 DSPs), while using comparable amount of resources on the target FPGA. The
reason that our implementation outperforms the XNOR-based implementation in the case of LENET-5 is
that parallelization potentials for XNOR-based implementation is not that high. To explain the reason in
detail, first we describe the connection between dataflow and spatial loop unrolling in neural networks
compilers.
The computational flow for a convolutional layer in CNN can be represented by a six-level nested
loop (seven-level nested loops when considering the iteration over images in a mini-batch) known as a
computational block [96]. We describe the dataflow of an accelerator through the mapping of particular
loops to the parallel computation structures. In other words, the data communication pattern is determined
by which loops are spatially unrolled in hardware, and which are not. In XNOR-based implementation,
FINN [102] unrolls the input channel and output channel loops (a.k.a. weight stationary pattern), where
the weights stay and are reused within the same PEs, but the inputs and outputs are spatially broadcast
or accumulated. Since the number of input/output channels are limited in the LENET-5 architecture, then
unrolling them and mapping to parallel computation structures don’t improve the performance.
75
4.8 Conclusion
In this chapter, we presented a novel design and optimization methodology for the compilation and mapping of fixed function neural networks to digital signal processors (DSPs) on the FPGAs employing highlevel synthesis flow. The proposed methodology maps the fixed function combinational logic blocks to
a set of Boolean functions and Boolean operations are mapped on DSP devices rather than look-up tables (LUTs) on the FPGAs to take advantage of the high performance, low latency, and parallelism of DSP
blocks.
76
Start
DNN Model
Construct the computational graph
Calculate memory usage of the
graph
FPGA
Resources
Optimize the tiling size and order
for all MAC-based layers
Generate and schedule off-chip memory
access addresses and strides
Generate computational
instructions for ALUs
Synthesizable C-level
description of the accelerator
C/OpenCL code for platform setup,
allocating/reading/writing global
memories, and running the
Accelerator
HLS templates
Yes
End
1. Pre-processing
3. SDAccel Code generator
2. Optimizer & Scheduler
Schedule operations for PEs
Enough On-chip
Memory?
No
Yes
Accelerator configuration
files
Pre-trained
Weights Test Data
Quantize weights, biases, and test
images
Pad weights, biases, and test images
Is data quantized?
No
Is padding required? No
Yes
Reshape weights, biases, and test
images
Run software inference using quantized
data and weights
Is accuracy
acceptable?
Retrain the
quantized network
No
Yes
Yes
Memory models
Decide on software optimization
techniques based on remaining resources
Figure 4.1: A high-level view of a local area of the Xilinx FPGA layout.
77
DRAM Input Vector Buffer
Opcode Buffer Addr. Mem. Buffer Input Vector URAM Addr. Mem. URAM #1 Addr. Mem. URAM #2
Addr. Mem. URAM #3
Opcode URAM
Input Vector Buffer
Input Vector Buffer
...
Control Unit
Reg
File
Reg
File
Reg
File
CU...
512 bits
6 bits
48 bits
14 bits
CU
CU
Figure 4.2: hardware architecture
78
Level 1
Level 2
Subk 1
Subk 1
g1 = a & b & c & d
g1
Figure 4.3: Illustration example 1.
79
Level 1
Level 2
Level 3
Subk 1
Subk 2
Subk 1
Subk 1
g2 = ((a & d) ^ (b ^ c)) & ((a ^ b) & (c | d))
g2
Figure 4.4: Illustration example 2.
80
0
0.2
0.4
0.6
0.8
1
1000 500 250 180 100
Runtime (ms)
# of DSPs
Actual Expected
Figure 4.5: A comparison between our proposed model used in the compiler and actual hardware implementation in terms of the achieved performance for layer 7 of VGG16 network.
0% 20% 40% 60% 80% 100%
2005
1000
500
250
120
100
% of total CCs
# of DSPs
Memory Communications Computations
Figure 4.6: The proportional percentage of latency spent in memory communications and computations
phases.
81
Data Movement
Computations
FFCL 1 Pip.
Time(ms)
FFCL 1 Pip.
Tasks
FFCL 2 Pip.
FFCL 2 Pip.
FFCL 3 Pip.
FFCL 3 Pip.
Figure 4.7: The overall dataflow of the proposed implementation.
0
5
10
15
20
1000 250 180 100
Runtime (ms)
# of DSPs
MAC NullaDSP XNOR
Figure 4.8: Comparison between MAC-, XNOR-, and NULLANET-based implementation. The main resource for the computations in all three type of the implementations are DSP blocks.
82
0
0.5
1
1.5
2
2.5
250 140 100
Runtime (ms)
# of DSPs
MAC NullaDSP XNOR
Figure 4.9: Comparison between MAC-, XNOR-, and NULLANET-based implementation. The main resource for the computations in all three type of the implementations are DSP blocks.
83
Chapter 5
Algorithms and Hardware for Efficient Processing of Logic-based
Neural Networks
5.1 Introduction
Deep learning models, in particular, new models such astransformers [103] and MLPMixers [99], offer stateof-the-art (SoA) performance in various artificial intelligence applications and have surpassed the accuracy
of conventional machine learning models in many challenging domains, including computer vision [55,
116, 44] and natural language processing [43, 28]. Comprising self-attention layers, multi-layer perceptrons
(MLPs), and skip connections, transformers have resulted in numerous breakthroughs on vision tasks [64].
Unfortunately, their computational cost is high. To reduce the transformer model complexity, MLPMixers
[99], which replace the multi-head self-attention module in transformers with a two-layer spatial MLPs,
have been introduced. The emergence of deeper and more complex deep neural network (DNN) models,
which has been a key reason for their remarkable results in many application domains, has required huge
computing and memory resources.
To improve the performance and efficiency of network inference, some prior work formulates the
problem of efficient processing of NNs as a Boolean logic minimization problem where ultimately, logic
expressions compute the output of various filters/neurons. NullaNet[74] optimizes a target DNN for a
84
given dataset and maps all components of the computation in the DNN, including multiplication, addition,
nonlinear activation function, to combinational logic blocks, which can be implemented using look-up
tables (LUTs) on FPGAs. This radical optimization method creates truth tables from the enumeration of
input/output feature map and then generates Boolean logic networks, completely eliminating the use of
arithmetic operations during NN inference. However, once the target FFCL for a specific NN model is
synthesized based on LUTs resources (in FPGA) or standard cell library (in ASIC), the synthesized fabric
can only be used for the inference task of that particular NN model with fixed parameters, which makes
ASIC realization of FFCL impractical.
Moreover, because neurons designed for SoA NNs include tens to hundreds of inputs, the generated
Boolean logic expression is huge. For instance, to realize the eighth convolutional layer of the VGG16 network [92] using the approach presented in [74], hundreds of thousands LUTs are needed. Our experiments
show one cannot fit the generated Boolean logic expressions for SoA networks such as VGG16 into a single
commercially available FPGA when LUTs are used to realize the logic expressions.
To address this issue, [89] presents a framework for mapping FFCL to a set of Boolean functions where
Boolean operations in each function are mapped to programmable digital signal processing (DSP) blocks
rather than LUTs. However, we recognized that [89] suffers from high data movement costs because the
data access pattern for each round of computations is highly irregular.
None of the discussed methods is feasible when application-specific integrated circuits (ASICs) are
considered as the target hardware because of the need to implement different logic functions for different
binary neural networks (BNNs), a requirement that can only be met by general-purpose programmability
of a processor or the reconfigurability of an FPGA. A Boolean processor is essentially vectors of Boolean
logic units and interconnects, wherein the Boolean logic unit is operable to perform Boolean operations of
each Boolean function associated with a FFCL extracted from a BNN, as introduced in [73, 74], is critical
85
to doing inference with different BNNs. Therefore, designing efficient Boolean processors, as logic-based
NN inference engines, which can be used in a variety of applications, is highly desirable.
Compilation and scheduling of an arbitrary Boolean logic graph associated with a Boolean function to
be mapped onto a Boolean processor is a challenging task from dual viewpoints of the Boolean processor
design and the compiler design. The compiler needs to group the operations of all gates that can be
executed simultaneously considering hardware resource limitations (i.e., the number of Boolean logic units
per Boolean processor). The next challenge is that each node in logic level i of a given logic graph can be
connected to any node in logic level i − 1 of the graph so that a naive interconnection scheme among the
Boolean logic units may (and will likely) result in significant routing congestion and delay.
To address these challenges in this chapter, we propose a Boolean processor that can be configured for
the inference of various Boolean networks (i.e., various logic graphs). The contributions of the chapter are
as follows:
• We present the novel design of Boolean processor which can process large Boolean functions. It
includes a programmable under-provisioned multicasting switch network, which empowers fast and
scalable dataflow required for processing logic graphs.
• We present an innovative optimization methodology for compiling and mapping BNNs utilizing
FFCL into this Boolean processor. The proposed compiler generates customized instructions for
static scheduling of all operations of the logic graph during inference.
• We demonstrate the parallelizations of multiple logic processing units (LPUs) within a Boolean processor and a scheduling algorithm to increase the throughput.
• We present a circulation strategy and a routability-aware graph partitioning algorithm to efficiently
deal with very large logic graphs.
86
• Our experimental evaluations across several datasets and NNs demonstrate the superior performance of our framework in terms of the inference latency and power efficiency compared to prior
art BNN and full-precision accelerators. We achieve 263x higher throughput (frame-per-second) on
VGG-like model compared with the fastest SoA BNN accelerator, while outperforming it by 10.8x
in energy-per-frame. We run MLPMixer models 42x faster, on average, on the proposed Boolean
processor in comparison to XNOR and MAC-based accelerator.
The remainder of the chapter is organized as follows. Section 5.2 presents prior work on BNNs. Section
5.3 defines terminology and notation used throughout the chapter while Section 5.4 outlines the proposed
design flow. The hardware architecture and proposed switch network are presented in Section 5.5. Section
6.4 introduces the compiler design while Section 5.7 reports our results benchmarking well-known NNs
including VGG16 and MLPMixers. The chapter is concluded in Section 5.8.
5.2 Prior Work
Many prior art references including, for example, XNORNet [78] and BinaryNet [23] advocate network
training with binary weights and activations to speedup the inference phase. Binary weights and activations reduce the hardware complexity and result in substantially lower inference latency and energy consumption although they have no effects on the total number of operations that must be performed in order
to do inference. For example, a single XNOR-gate can function as a multiplier in binary networks whereas
the accumulation step of a typical convolution operation may be replaced by a low-cost pop count operation [2]. Additionally, by minimizing the number of bits per loaded item (weight or activation), the off-chip
memory bandwidth and the on-chip memory foot print are greatly reduced, resulting again in speed and
power efficiency improvements. Although BNNs are fairly well-suited to run on current general-purpose
computing platforms, dedicated and low-power hardware accelerators can more fully exploit the potential
latency reduction and energy savings offered by such aggressively quantized models [70].
87
0
1
2
0
0
0
0
1
1
1
1
0
0
1
1
0
0
1
1
0
1
0
1
0
1
0
1
0
0
0
0
1
0
1
1
0
1
1
1.1
0.4
2
-0.5
00 01 11 10
0
1
0
0
0
0
1
1
1
0
2
0
1
=
0
1 +
0
2
Neuron
Model
Input/Output
Enumeration Truth Table
2
0
1
Realized FFCL
Block
Figure 5.1: An example of neuron realization with FFCL. Weights are shown on edges; without loss of generality, a
step-function nonlinearity is assumed with a threshold value of 1.
A highly efficient approach, based on the idea of converting certain layers of a target DNN to fixedfunction, combinational logic blocks followed by the mapping of these blocks to LUTs, was presented in
NullaNet [74] and elaborated in LogicNets [101]. More precisely, NullaNet [74] records different input
combinations and their corresponding output values for each filter/neuron in a BNN, constructs a truth
table for each such filter/neuron, and optimizes the truth tables, thereby replacing the MAC operations
involving weights and activations into Boolean logic expressions involving activations only. In this way,
all weight lookups are eliminated while expensive convolution operations in vision models are replaced by
simple Boolean function approximations. An example of NullaNet approach is demonstrated in Fig. 5.1.
In most cases, including those used in this study, the average accuracy drop for binary implementation
is less than 4%. However, a programmable Boolean logic-based inference engine for BNNs is lacking. In
particular, when targeting ASIC implementation of a BNN inference accelerator, it is necessary to develop
an array of general-purpose Boolean logic processing elements that can work together to perform the logic
operations of the aforesaid FFCLs in a highly efficient way. A taxonomy overview is shown in Table 5.1.
5.3 Terminology and Notation
This section provides the terminology and notation used throughout this chapter:
88
Table 5.1: A taxonomy overview of existing neural network accelerators
Characteristic Precision Memoryless Fabric Programmability Full Mixed Binary Arithmetic-based Logic-based
ChewBaccaNN [2] ✓ ✓ ✓
Stripes [50]; UNPU [59]; DaDianNao [14] ✓ ✓ ✓
Gemmini [37] ✓ ✓ ✓
LogicNets [101]; NullaNet [74] ✓ ✓ ✓
Our work ✓ ✓ ✓ ✓
89
• Fixed-function combinational logic (FFCL) block: A netlist of a pure combinational logic circuit
written in a hardware description language, such as Verilog.
• Logic processing element (LPE): A programmable block performing (two-input) logic operations
such as AND, OR, XOR, etc.
• Logic processing vector (LPV): Hardware containing a fixed number of LPEs. LPVs are linearly
ordered relative to each other.
• Logic processing unit (LPU): Hardware containing a fixed number of LPVs. Boolean processor
consist of an LPU or several bus-connected LPUs.
• Maximal feasible subgraph (MFG): A directed acyclic graph (where nodes are Boolean operations
and edges are data dependencies), which can be mapped onto an LPU. MFG is the maximal subgraph
greedily extracted from an FFCL without exceeding the LPU capacity. In other words, when mapping
each logic level of an MFG to the corresponding LPV, the number of nodes in each level must be less
than the number of LPE within the LPV. The given FFCL is decomposed into a tree of MFGs by
iteratively applying the MFG extraction process presented in Section 6.4.
• Logic level of a gate: The maximum number of gates on any path from primary inputs (PIs) of the
circuit to the gate. In this chapter, we use logic level and logic depth interchangeably.
5.4 Proposed Design Flow
The overall flow of the proposed framework is as depicted in Fig. 5.2. The input of the flow is a description
of an FFCL block in the Verilog language. Please note that the framework can be structured to accept any
specification of an FFCL block as the input. Yosys [110] and ABC [10] can be used to generate synthesizable
Verilog code from any behavioral specification. NullaNet [74] generates the FFCL block in Verilog format
90
Figure 5.2: An overview of the proposed framework.
and it will be used as the upper stream engine. We synthesize the circuit using standard logic optimization
techniques, primarily aimed at reducing the total gate count and depth of the circuit, and map the circuit
to a customized cell library containing 2-input logic gates supported by the LPE opcodes specified in Table
5.2. Next, the mapped circuit is levelized according to the definition provided in Section 5.3. The logic
synthesis and levelization are the same as the one presented in [89].
After levelization, each gate in the mapped gate-level netlist is associated with a certain logic level and
full path balancing is achieved by inserting relay nodes (in the Boolean processor, they are performed by
91
LPEs acting as BUFFERs) such that the equal number of nodes exist along all propagation paths between
any two connected nodes (i.e., the same topological length for different paths).Full path balancing guarantees no data dependencies exist between 2 non-adjacent logic level of gates resulting in a deep-pipelined
design. Because a gate that belongs to a specific logic level in a target circuit has no connections (i.e.,
dependencies) to any other gates at the same logic level, operations of all gates at the same logic level
can be executed simultaneously. However, these operations may have to be assigned to different compute
cycles due to hardware resource limitations, i.e., the fixed number of LPVs per LPU (the depth issue) or
the fixed number of LPEs per LPV (the width issue). Multiple LPUs can be assembled in parallel or series
configuration for large graphs to complete the required computations for the given logic graph at the extra
area/power cost. The parallel configuration will be explained in Section 5.5.3 (Serially connected LPUs are
not explored in this work). Our compiler has the ability to map any logic graph to an arbitrary-size LPU.
To handle the width issue, the compiler decomposes the filter/neuron functions into MFGs, each of which
is then mapped onto the LPU one after the other. The details of this algorithm and the scheduling of these
MFGs will be discussed in 6.4.
The other challenge is that each node in the current logic level can be connected to any node of the
preceding logic level (note that the circuit graph is fully path balanced). So, naive routing between all
nodes is not possible. We propose a programmable non-blocking multicasting switch network, which can
address this challenge (cf. Section 5.5.2). Once the MFGs implementing an FFCL block are determined, the
compiler schedules them and generates instructions for programming the switch network to enable the
propagation of output values of the nodes in the current logic level to inputs of the nodes in the succeeding
logic level.
92
5.5 Boolean Processor Architecture
The architecture of the Boolean processor is shown in Fig. 5.3. The Boolean processor is a data-driven
architecture in the sense that the streaming data is received and processed by multiple stages of LPVs
without having to store any intermediate results in some scratchpad memories. More precisely, a Boolean
processor comprises a set of LPVs that are linearly ordered. Each LPV contains m LPEs, each of which
receives two inputs and produces one output. Therefore, each LPV receives up to 2m input operands and
produces a vector of up to m output results. To support logic operation packing and increase hardware
efficiency, each operand has a width of 2m bits, which translates into 2m Boolean variables or m 4-valued
logic variables, etc. In the case of processing an FFCL block extracted from a convolutional neural network
(CNN) model, the 2m bits of data come from different patches of an input feature volume in a CNN or from
different images for batch-based inference tasks.
The number of LPEs per LPV and the number of LPVs per LPU determine the size of the FFCL block
that can be processed by an LPU. With the parameter values described previously, an LPU can process a
logic block with a maximum width of m and a maximum depth of n, where the width refers to the number
of logic operations at any logic depth in the graph and the depth refers to the logic depth from any graph
inputs to any graph outputs.
Because of the data dependency between two consecutive logic levels of a Boolean graph, the intuitive
way of computing a Boolean graph is to process the entire predecessor logic level before starting to process
the next level, i.e., level-by-level execution. When mapping a certain logic level of the Boolean network
onto a certain LPV, considering the tight hardware budget and desirable high throughput, it is unlikely to
use scratchpad memory to store temporary results (i.e., results that are produced but not yet consumed)
between two logic levels (or, between two LPVs). To solve this problem, we proposed a noval MFG-byMFG execution paradigm and scheduling algorithm, which are detailed in Section 6.4, that makes use of
93
Figure 5.3: The LPU architecture.
distributed snapshot registers to store temporary results as shown in Fig. 5.3. In addition, considering highly
unpredictable connection patterns between logic levels within a Boolean network, an under-provisioned
multi-stage (i.e., five-stage) interconnect (Fig. 5.4) that performs permutational multi-cast circuit switching
between two adjacent LPVs is proposed.
5.5.1 Programmable LPE unit
As shown in Fig. 5.3, each LPE contains a logic unit where an elementary Boolean operation can be performed, and two snapshot registers where each of the LPE inputs can be temporarily stored. Each LPE takes
a 9-bit instruction as defined in Table 5.2.
94
Table 5.2: Instruction set specification
Opcode Data Src&Dest Valid bit
8 7 6 5 4 3 2 1 0
10: AND 00: raw Src A: Src B: Reg A: Reg B:
0: invalidate 01: ¬ inputA
11: OR 10: ¬ inputB 0: LPE input A 0: LPE input B 0: hold data 0: hold data 11: ¬ output
01: XOR/XNOR Not used 0: XOR 1: snapshot reg A 1: snapshot reg B 1: snapshot data 1: snapshot data 1: keep (in)valid 1: XNOR
00: Buffer/NOT Not used 0: Buffer 1: NOT
95
In the instruction set specification shown in Table 5.2, the most significant two bits of Opcode (i.e.,
bit[8:7]) specify the type of Boolean operation. Notice two types of logic operation can be performed,
namely, a multiple-input single-output (MISO) operation including AND, OR, XOR/XNOR and a singleinput single-output (SISO) operation including NOT/BUFFER. BUFFER is introduced to achieve full path
balancing. The next 2 bits (i.e., bit[6:5] of Opcode) specify an optional inversion on any of the following:
Input A/Input B/Output of the logic unit. By enumerating all possible combinations of bits of Opcode, logic
operations that LPEs can perform are: ¬A, ¬B, A ∧ B, ¬A ∧ B, A ∧ ¬B, ¬(A ∧ B), A ∨ B, ¬A ∨ B,
A ∨ ¬B, ¬(A ∨ B), A ⊕ B, and ¬(A ⊕ B).
The higher two bits of Data Src&Dest (i.e., bit[4:3]) specify the two inputs of the Logic Unit inside the
LPU: whether they are from the inputs of the LPE or the snapshot registers of that LPE. The rest of the Data
Src&Dest bits (i.e., bit[2:1]) determines the destiny of the LPE inputs: whether to be captured by the pair
of snapshot registers. Thus, the lifespan of a value in snapshot registers is determined. Bit[0] determines
if the output of LPE needs to be invalidated, the reason for which will be explained in Section 6.4 as a part
of the depth issue handler.
The Boolean network has multiple inputs and a single output and there is a single Boolean function per
node in the Boolean network. It is worth noting that even though each LPE has two output ports, only one
is in use due to the single-out nature of the Boolean gate. Boolean networks are characterized as having
multiple fan-out. We use the proposed multi-cast network to replicate the LPE output and delivery them
to the next level through multiple pipeline stages. Therefore, there’s no need to use the second output port
of a LPE to replicate the output the mapped Boolean gate generates. Furthermore, the interconnection
network designed under-provisioned doesn’t mathematically guarantee strictly non-blocking multi-cast
capability. More details about the multi-cast blockage issue is explained in [115]. This chapter considers
a more practical re-transmission technique to address the multi-cast blockage which will be discussed in
Section 6.4.
96
5.5.2 Programmable non-blocking multicast switch network
To speed up the data passing in between a parent (predecessor) logic level and a child (successor) logic
level so as to increase the overall computational throughput, a statically configured circuit switching network is considered. A crossbar switch is the simplest single stage switch network providing simultaneous
connections and multi-casting capability. However, the corresponding hardware cost of crossbar becomes
unacceptably large as the number of inputs to the crossbar increases. It is known that the hardware cost
of crossbar switch grows in O(n
2
).
In an effort to reduce the hardware cost, C. Clos [21] presented an instance of a multistage switch
network. This work was generalized in [65]. As stated in [115], multistage networks can be classified into
3 categories:
• Strictly nonblocking: Guarantees that any new multicast request which initiates a connection
from one input to a number of idle outputs can always be satisfied without disturbing the existing
connections;
• Wide-sense (algorithm-dependent) nonblocking: Guarantees any new multicast request can be
satisfied without disturbing the existing connections if the paths are carefully selected so as not to
block any future connections;
• Rearrangeable nonblocking: Guarantees any new multicast request can be satisfied when paths
of existing connections can be rerouted to enable finding a path for the said new request.
A symmetrical three-stage switch network (comprising a first ingress stage, a middle stage, and a final
egress stage) may be notated as (m, n, r) [65, 115]. Parameters m, n and r refer to the number of switch
modules in the middle stage, the number of inputs each of the switch modules in the ingress stage has, and
the number of ingress/egress stage switch modules there are. Each switch module is a crossbar switch,
97
which provides strictly nonblocking multicast capability. The hardware cost, which is proportional to the
cross point count, is primarily determined by m.
As proved by Y. Yang and G. Masson [115], the condition to have a 3-stage wide-sense nonblocking
multicast switch network is the following: m > min1≤x≤min{n−1,r}{(n − 1)(x + r
1
x ))}. The hardware
complexity grows according to O(n
log r
log log r
). In practice, however, the described nonblocking multicast
switch network is rarely implemented due to being even costlier than a single stage crossbar switch under
a relatively small scale of the switched fabric.
Our Boolean processor implements a 5-stage switch network (see Fig. 5.4), which meets the rearrangeable non-blocking condition for unicast traffic, which is given as m ≥ n by Slepian–Duguid [6]. Furthermore, the network can achieve an extremely low complexity at the expense of providing limited multicast
capability, i.e., simultaneously routing a number of high fanout connections may cause network blockage.
Mathematically speaking, the proposed switch network does not guarantee the multicast nonblocking
capability. But, we can still use it for sparse multicast traffic as long as a re-transmission technique is
implemented. It contains an ingress stage, a three-substage middle stage, and an egress stage. Specifically,
the ingress and egress stages each consist of sixteen 4 × 4 switch modules, and the middle stage consists
of four 16 × 16 switch modules each of which is recursively constructed in 3 substages as shown (the four
16 × 16 switch modules are shown in yellow, orange, green, and purple, respectively). The network is
notated by (m = 4, n = 4, mm = 4, nn = 4, rr = 4, r = 16) where mm, nn, and rr denote parameters
of the four 16 × 16 middle-stage switch modules. By using the proposed 5-stage 1280-cross-point switch,
a large hardware cost reduction can be achieved compared to the single-stage 4096-cross-point crossbar
while having nearly the same throughput for both cases (see Section 5.7.3). The linear routing algorithm
of Y. Yang and G. Masson [115] is adopted and implemented as part of our compiler design, as explained
in Section 6.4.
98
Figure 5.4: Architecture of 64 × 64 5-stage interconnection network.
Based on our experiments on a wide range of Boolean networks, the proposed 5-stage switch network
can handle nearly all Boolean network multicast traffic patterns. Because the non-blocking condition is
not met in our design, a re-transmission technique may be needed from time to time. The idea of retransmission is implemented in the compiler’s repartitioning phase, which is discussed in Section 6.4. The
5-stage switch network is implemented as a 5-stage pipeline. Consisting of 16 4 × 4 crossbars, each stage
requires a minimum of 16 × (4 × log2 4) bits to program. A group of 5 instruction queues for configuring
the 5-stage switch network is depicted as part of the instruction queue array in Fig. 5.3.
5.5.3 Parallelization of multiple LPUs
To further accelerate the inference of BNNs, we developed a multi-LPU design where same-size LPUs are
connected through a bus architecture. The parallelzation of the hardware is exploited through the distribution of the workload. Fig. 5.5 shows an example of connecting 4 LPUs. More details about scheduling
is provided in Section 5.6.3.
99
LPU LPU LPU LPU
Bus State
Controller
Bus
Figure 5.5: An example of parallelizing 4 LPUs.
5.6 Compiler
A compiler tailored to the Boolean processor described in Section 5.5 is adopted from [22]. The compiler
parses a technology-mapped Verilog netlist to extract the set of operations that are carried out at each
logic level of the circuit netlist, creates a DAG to represent these gate operations and their directional data
dependencies, then decomposes this DAG into MFGs such that each MFG can be fitted into the given set
of LPUs. The compiler also determines the schedule of running these MFGs on the underlying hardware.
The implemented algorithm for partitioning is a hierarchical breadth-first-search (BFS) in the given
graph in order to find all MFGs. The BFS first starts from the primary output to find the root MFG. After
that, the traversal continues by finding MFGs rooted at input nodes of the just extracted MFG in the similar
procedure to form a tree of MFGs. The traversal continues until we reach the PIs of the Boolean network.
The procedure to construct an MFG rooted at some node V keeps adding nodes to the MFG (again in BFS
traversal manner) until it reaches a logic level in its transitive fanin cone that has more nodes than the
LPEs within a LPV.
It is obvious that some multicast connections may be unroutable in a given cycle because the existing
connections have used up the limited number of paths connecting the ingress/egress to/from the middle
stage. The reason is that the our partitioning algorithm aims to greedily incorporate more nodes into
an MFG, resulting in a higher LPE utilization rate in an LPV and denser traffic streaming into the switch
network. So, it increases the chances of encountering blockages during routing but also results in reducing
100
the number of MFGs. To overcome this blockage problem, a re-transmission mechanism is implemented.
The re-transmission of data determines the order of transmission, meaning that a group of nodes is allowed
to propagate its outputs to the next stage, whereas some other group has to wait for the next clock cycle
when all paths become idle before it can propagate its outputs to the next stage.
For most of the Boolean networks generated by Nullanet [74], the mapped netlists have sizes exceeding
predetermine logic levels (i.e., # of LPVs) and contain a lot more than predetermined nodes per logic level
(i.e., # of LPEs per LPV). We refer to the former problem as the depth issue and the latter problem as the
width issue, which are addressed in the compiler and with the aid of snapshot registers that exist in the
hardware.
5.6.1 Addressing the width issue
Consider a fully path balanced directed-acyclic Boolean network G = (V, E) and its sub-graph H =
(V
′
, E′
). Logic levels of gates of a sub-graph H are from Lbottom(H)to Ltop(H). Now, given a partitioning
solution P of graph G such that P consists of |P| sub-graphs, we have:
∀H ∈ P, 0 ≤ Lbottom(H) ≤ Ltop(H) ≤ Lmax
where 0 and Lmax denote the levels of primary inputs and primary outputs of G, respectively. We define
the set of nodes in a logic level l by nodes(l). Moreover, given a set of nodes S, let inputs(S) denote the
set of distinct nodes that feed into the set of nodes S. A MFG H is a sub-graph of the Boolean gate network
that satisfies the following properties:
101
• Self-containedness
∀l ∈ [Lbottom(H) + 1, Ltop(H)] =⇒ (5.1)
inputs(nodes(l)) ∈ H
• Conformity
∀l ∈ [Lbottom(H), Ltop(H)], |nodes(l)| ≤ m (5.2)
where m denotes the number of LPEs per LPV, the MFG extraction algorithm explained before stops incorporating nodes into that MFG when conformity is no longer met, i.e., |inputs(nodes(Lbottom(H)))| > m.
A post-partitioning greedy MFGs merging heuristic is performed to minimize the amount of MFGs. Finally, a tree of MFGs, denoted as P, is obtained from the original DAG. The BFS automatically guarantees
the property of self − containedness, wherein no incoming connections to nodes inside of a MFG are
allowed unless they are bottom level nodes that have inputs(Lbottom) coming from parent MFGs. Note
that MFGs can overlap (i.e., containing same nodes), but there are no interdependencies between overlapped MFGs (i.e., a connection from nodes within the [Lbottom(H) + 1, Ltop(H)] to nodes within the
[Lbottom(H′
) + 1, Ltop(H′
)]).
The MFGs can be scheduled and processed sequentially with the snapshot registers to store the intermediate results generated by nodes(Ltop) of each MFG. In a pipelined manner, each MFG requires
Ltop − Lbottom + 1 number of LPVs for its computation. We number each LPV stage as LP V (x) from
input to output sequentially, as shown in Fig. 5.3. Precisely, the computational resources allocated to MFG
H are LP V ∈ [LP V (Lbottom(H)), LP V (Ltop(H))]. LP V (x) per clock cycle performs the same level l
of different MFGs (i.e., x = l) with different set of Boolean operations (i.e., nodes(l)). The compute cycle
count is (Ltop − Lbottom + 1) × tc clock cycles where tc is the summation of one cycle for computation
102
within a LPE and tsw cycles for data routing (steering) in the proposed switch network (which are 5 cycles
for a 5-stage switch network).
The scheduling algorithm shown in Algorithm 5 determines the memory locations for writing the instructions for each MFG. Each of the MFGs that satisfies Lbottom ̸= 0 has at least 2 child MFGs supporting
all the nodes in nodes(Lbottom). For MFG H, we refer to the child MFGs as Hc = {Hc1, ..., Hcn}. Each of
the MFGs in Hc generates a subset ofinput(nodes(Lbottom(H))); therefore, we have Sn
i=1 nodes(Ltop(Hci)) =
input(nodes(Lbottom(H))). Also, Ltop(Hci) is the same for all child MFGs Hci. We refer to the lastscheduled child MFG that generates a subset of the parent’s input nodes (without having to store them
in the snapshot registers) as the most recent child. As shown in Fig. 5.6a, for instance, MFG F has three
child MFGs: MFG C, D, and E. The most recent child of MFG F is MFG E, which means the MFG F’s
computation is launched immediately after MFG E finishes its computations in Ltop and passes the data
to the next LPV (i.e., LP V (4)). The other two children: MFG C, MFG D finished their computations earlier and stored their outputs to the vector of snapshot registers associated with LP V (4). The schedule of
the aforementioned MFGs is shown in Fig. 5.6b. Different logic levels of the same MFG are encircled and
marked with the same color. The dotted lines represent the data dependencies between MFGs.
In the LPU, a LPV stage and the five stages of the subsequent switch network form a block that needs
a dedicated instruction queue, making the instruction queue array also a 6-stage design where each stage
gets the read address from its predecessor. The queue is accessible through a read address shift register.
The instructions corresponding to a MFG is written to the same address among all instruction queues
associated with all LPVs involving the computations of the given MFG. For instance, the instructions for
computing the MFG F, highlighted in pink in Fig. 5.6b, is written to memLoc3 of instruction queues
associated with LP V (4 ∼ 7) as shown in Fig. 5.6c where the color of a memory location corresponds to
the color of the MFG (cf. Fig. 5.6a).
103
1
0
2
3
4
5
6
7
E D
C
B A
F
(a)
A 0 B 0 D 0 E 0
A 1 B 1 D 1 E 1
C 2 D 2 E 2
C0 C1 C2 C3
C6 C7 C8 C9
C12 C13 C14 C15
C 3 D 3 E 3
C18 C19 C20 C21
F 4
C24 C25 C26 C27
F 5
C30 C31 C32 C33
F 6
C36 C37 C38 C39
F 7
LPV 1
LPV 2
LPV 3
LPV 4
LPV 5
LPV 6
LPV 7
C42 C43 C44 C45
LPV 0
(b)
LPV
0
LPV
1
LPV
2
LPV
3
LPV
4
LPV
5
LPV
6
LPV
7
memLoc0
memLoc1
memLoc2
memLoc3
Instruction that
invalidates output
Instruction that
invalidates output
& does a snapshot
for MFG A
Read Address Shift Register Read Address
Incrementor
Instruction that
computes MFG A
(c)
Figure 5.6: A running example for partitioning, scheduling and instruction queue configuration. a) partitioning where Lmax = 7, b) scheduling of the MFGs, and c) instruction queue configuration corresponding
to the scheduled MFGs. Note: Ci denotes clock cycle i.
Algorithm 5 Scheduling algorithm
Input: A set of MFGs
Output: Memory location to which instructions write
1: memLoc = INTMAX, stack = []
2: topMFG.memLocation() = MemLoc, stack.append(topMFG)
3: while stack not empty do
4: curMFG = stack.pop()
5: curMFG.memLocation() = memLoc
6: if curMFG.parents ̸= None then
7: for childMFG in curMFG.children() do
8: stack.append(childMFG)
9: else
10: memLoc = memLoc - 1
11: for all MFG do
12: MFG.memLocation() = MFG.memLocation() -
memLoc
All MFGs with Lbottom = 0 obtain the primary inputs(PI) values from the input buffer. The compiler
ensures that the required PI values are properly stored in different locations of the input data buffers such
that the desired data is accessed correctly every cycle using a counter. This scheme simplifies the address
generation compared to a random-access addressing system. Therefore, scheduling the MFGs is equivalent
to arranging the instructions and placing them into proper memory locations.
104
5.6.2 Addressing the depth issue
For an LPU with a fixed LPV count, a circulation mechanism must be implemented to resolve the depth
issue as described earlier. The depth issue is left to be handled by the LPU hardware. The compiler is only
responsible for detecting the potential depth issue and rearranging the instructions accordingly (cf. Fig.
5.7c).
To further explain the depth issue handler, the input data buffer and the output data buffer have to be
elaborated. For example, for a LPU with 32 LPEs per LPV, a 64-banked input data buffer is needed to store
the 64 primary inputs (PIs) of the FFCL block. We refer to the border of PIs needed by different FFCLs
as a context switch. To differentiate FFCLs, a 2-bit tag is attached to each chunk of data in the PIs. The
context switch is detected at the last switch network. It is worth noting that the last switch network is
placed before the output data buffer and after the last LPV. If an MFG has Ltop that is larger than LPV
count of a fixed size LPU, the output data buffer will act as the snapshot registers for the LPV Ltop + 1,
which in reality does not exist but is alternatively performed by LPV 0 by resorting through the circulation
mechanism. The instruction for programming LPV Ltop + 1 is relocated to instruction queue associated
with LPV 0 (see Fig. 5.7c). Thus, the circulation mechanism ensures no instruction bits have to be altered,
simplifying the compiler.
For each PI stored in input data buffer, a valid bit is assigned, making the data block within the buffer
width 1 bit wider than the actual data width. LPEs are responsible for invalidating the input data as
specified by opcode[0] in case that LPE is not utilized during a certain time. Considering Figs. 5.7b and
5.7c, the last few LPVs are not utilized much of the time; therefore, they keep producing invalid values
during these idle times. For example, in Fig. 5.7c, MFG A only generates valid result for memLoc0 in LPV0,
LPV1, LPV2, and LPV3. Furthermore, a valid-bit popcount stage is implemented as shown in Fig. 5.3. The
popcount is a counter that counts the number of 1’s in a vector of 0/1s. The valid-bit popcount makes the
105
(a) (b) (c)
Figure 5.7: A running example for partitioning, scheduling and instruction queue configuration with a depth issue
handler. a) partitioning where Lmax = 10, b) scheduling of the MFGs with a circulation mechanism, and c) instruction queue configuration corresponding to the scheduled MFGs with the instruction relocation. Note: Ci denotes
clock cycle i.
hardware aware of whether the chunk of data streaming out of the pipeline needs to be stored in the output
data buffer. For example, as shown in Fig. 5.7b, the output data buffer scans all 64 bits of a data chunk and
decides to not store this data because the valid-bit popcounter keeps returning 0 until the outputs of J10
emerges from the switch network that follows LPV 10.
In addition to the context switch detector, the compiler should notify the hardware when it is time
to circulate the intermediate data stored in the output data buffer back to the LPV 0. Such data goes
through the pipeline for another round of computation to complete the computation of all logic levels
of the given FFCL. We refer to the number of times the Boolean network corresponding to the given
FFCL needs to travel through the LPU as round count. The round count is calculated by the compiler as
Round_count =
Lmax
LP V count
.
An example of the depth issue is presented in Fig. 5.7b, mapping the same set of MFGs of the running
example (cf. Fig. 5.7a) to a smaller LPU, limiting the LPV count to 6. The circulation mechanism is enabled
when rount_count ̸= 0 and context switch are both met. When the circulation criterion is asserted, the
depth issue control unit reads multiple chunks of intermediate values from the output data buffer and feeds
them back to the LPV 0.
106
5.6.3 Utilizing multiple LPUs
The parallelization of the multiple LPUs is exploited through the distribution of the workload. Given that
the computations of different neurons within a layer is independent of each other, we may parallelize the
execution of different neurons. For this purpose, we propose Algorithm 6 where we map a set of Boolean
graphs to arbitrary number of LPUs. First, we sort the Boolean graphs corresponding to each neuron based
on their estimated initiation interval (II), i.e., the interval for sending a new set of data (lines 3 - 5). We aim
to balance the workload of each LPU. Hence, we dispatch the sorted Boolean graphs into different LPUs
through a bus architecture in a round-robin order statically (lines 8 - 10).
The increased II due to the pipeline flushing caused by the "depth issue" (as shown in Fig. 5.7b) can be
resolved by passing the intermediate results in the output buffer to other LPUs through the bus (lines 13
- 17). In other words, after passing the first round of computations within all Boolean graph (as defined
in Section 5.6.2), the compiler checks the II corresponding to the second round of each Boolean graph and
tries to equalize the workload by moving the residue computations of the Boolean graph to those LPUs
with the lowest workloads. An example of this procedure is shown in Fig. 5.8. In this figure, parallelograms
represent the MFGs group as shown in Fig. 5.7b with different II. There are three Boolean graphs mapped
to LPU1 in the first round of computation. The round_workload associated with them are [6,1], [14,5],
and [30,14,1], respectively. Note that compiler only moves around the small amount of data (i.e., the II of
that computation round for a Boolean graph is less than 3 – we find this number empirically) to avoid the
possible congestion in the communication bus. For instance, it moves the second round of a Boolean graph
with [6,1] to LPU2 to alleviate the workload on LPU1. However, it comes at the cost of higher memory
footprint for storing a higher amount of intermediate values.
107
Algorithm 6 Scheduling MFGs to multiple LPUs Pseudo code
Input: G: a set of Boolean graphs corresponding a layer of the given NN, LP U_cnt: Number of available LPUs,
moving_th is the II threshold of moving the workloads
Output: Schedule of mapping of MFGs to LP U_cnt LPUs
1: schedule_dict = {}, i = 0
2: max_round = 0
3: for a graph g in G do
4: g.workload_sum = sum(g.round_workload)
5: max_round = max(max_round, len(g.round_workload))
6: sortedG = sort(G, workload_sum)
7: for a graph g in sortedG do
8: schedule_dict[i % LP U_cnt].append(g.round_workload[0])
9: i = i + 1
10: for round from 2 to max_round do
11: i = 0
12: for a graph g in sortedG do
13: if g.II_round_set[round] < moving_th then
14: schedule_dict[minarg(schedule_dict)].append(
g.round_workload[round])
15: else
16: schedule_dict[i % LP U_cnt].append(
g.round_workload[round])
17: return schedule_dict
5.7 Experimental Results
For evaluation purposes, we targeted a high-end Virtex® UltraScale+ FPGA (Xilinx VU9P FPGA , which is
available in the cloud as the AWS EC2 F1 instance). Note that, we also synthesize our design using Synopsis
Design Compiler tool based on NCSU FreePDK 45nm technology and reach the maximum frequency of
544 MHz. We include the FPGA prototyping results since the SoA implementations use the same FPGA.
The resource utilization and the achieved frequency is reported in table 5.3 for LPV count = 16.
Our benchmarked models can be categorized in two groups, i) models for high-accuracy requirement
(i.e., large models), and ii) models for high-throughput requirement (i.e., tiny models). In the first group, we
use VGG-16, VGG-like model used in ChewBaccaNN[2], and MLPMixers [99] with the CIFAR-10 dataset as
the case studies. We also study LENET5 performance on the MNIST dataset. We also evaluate our Boolean
108
Figure 5.8: An example of workload balancing in 2 LPUs implementation.
Table 5.3: Resource utilization of design of LPV count = 16.
FF(%) LUT(%) BRAM(%) FREQ
478K(20.2%) 433K(36.7%) 12240K (15.8%) 544MHz
processor against extreme-throughput tasks in physics and cybersecurity such as jet substructure classification (JSC) [31] and network intrusion detection (NID) [72]. We used UNSWNB15 dataset to compare our
Boolean processor with other implementations. We employed the same preprocessed training and testing
data as that of Murovic et al. [72] which has 593 binary features corresponding to 49 original features and
two output classes.
For MLPMixers, the resolution of the input image is 32*32, and the patch size that the experiments are
based on is 4*4. So, we have 64 non-overlapping image patches that are mapped to a hidden dimension C.
DS and DC are tunable hidden widths in the token-mixing and channel-mixing MLPs, respectively. The
summary of design specifications is shown in Table 5.4.
5.7.1 Ablation study with the gate fanout limitation
To further evaluate the influence of the limited fanout on the latency of a given FFCL, we performed
ablation with fanout of [2 : 62]. As it is expected and shown in the Fig. 5.9, the performance improves
109
Figure 5.9: Ablation study with the different number of gates’ fanout allowance.
Table 5.4: Summary of design specs for MLPMixers used in this chapter. The "S" and "B" (small and base) model
scales down follow Tolstikhin et al. [99]. The notation "B/4" means the model of base scale with patches of resolution
4 * 4.
Specification
Hidden
size C
MLP
dimension DC
MLP
dimension DS
Number
of layers
S/4 128 512 64 8
B/4 192 768 96 12
Table 5.5: Normalized throughput (%) with respect to a design using strictly nonblocking single-stage crossbar to
demonstrate the switch network bottleneck.
VGG16 LENET5 NID MLP JSC_M JSC_L
99.36% 97.14% 90.06% 98.46% 98.54% 98.96%
as the fanout count is increased. However, it will saturate after a while. The reason is that the number of
nodes is increased as we limit the fanout count. However, allowing more fanout results in a higher number
of overlapping MFGs, which in turn results in re-computation of logic gates. Note also that the generated
netlist remains unchanged after a certain value of the fanout bound is reached.
5.7.2 LPE utilization calculation
Given the multi-in-single-out nature of the Boolean network, the utilization is reduced when it reaches
the end of the network. So, an LPU of size m × n cannot be fully utilized every clock cycle. We evaluate
the LPE utilization by calculating the percentage of LPEs that are active (i.e., they are allocated to perform
a Boolean operation) per LPV, and the percentage of LPVs being configured to produce valid outputs. As
110
Table 5.6: Compare to SoA ASIC BNN Accelerator.
Accelerator
BinarEYE[70]
(scaled∗)
ChewBaccaNN[2]
(scaled)
XNORBIN[3]
(scaled) Huang et al.[45]
P. C. Knag et al.[53]
(scaled)
LPU
(ours)
Computational Paradigm XNOR XNOR XNOR XNOR XNOR Logic-based
Technology (nm) 28(45) 22(45) 65(45) 32 10(45) 45
Voltage (V) 0.66(1.1) 0.4(1.1) 1.2(1.1) - 0.75(1.1) 1.1
Power (mW) 0.6(1.8) 2.9(29.8) 56.0(63.8) - 607.0(424.8) 731.2
Freq (MHz) 1.5(1.0) 154(101) 476(910) 500 622(45) 544
Throughput (TOP/s=TOPS) 0.09(0.06) 0.24(0.16) 0.75(1.43) 8.1 163.0(11.8) 17.80
Power Efficiency (TOPS/W=TOP/J) 145.0(32.5) 8.2(5.3) 13.0(22.3) 19.9 269.0(19.5) 24.3
FPS (VGG-like) 150(100) 2286.2(1502.7) - - - 601500
Energy per Inference (Cifar-10) (µJ/frame) 14.4 1.3(13.0) - - - 1.2
∗scaled power and frequency values are calculated by scaling the technology node size and supply voltage.
111
Figure 5.10: LPE utilization rate.
shown in Fig. 5.7c, instructions with diagonal stripe patterns produce invalidated outputs. Therefore, the
utilization goes up when the LPV count is decreased.
5.7.3 Blocking vs nonblocking switch network
Because we implement a 5-stage switch network where nonblocking operation is not guaranteed, it is also
important to benchmark against the throughput of using a strictly nonblocking single-stage crossbar as the
switch network. Using the proposed partitioning algorithm, we observed cases where the re-partitioning
of the merged MFGs causes extra computation clock cycles. This is not the case if a nonblocking crossbar
switch is implemented (at the cost of increasing area and power consumption) since these clock cycles are
due to the blockage in the deployed switch network.
The achieved normalized throughput (%) for different benchmarks are reported in Table 5.5. 100% is
the case where we use strictly nonblocking single-stage crossbar as the switch network. As can be seen,
for most of the benchmarked models, almost the same throughput is achieved. This drops are caused by 1)
network blockages and 2) four more stages comparing to the single-stage crossbar. However, for smaller
models with fewer FFCL blocks, such as NID, the throughput drop is noticeable (i.e., around 10%). By
112
replacing the 5-stage switch network with a 1-stage network which trades off higher latency for smaller
initiation interval, a smaller model which contains fewer Boolean graphs will be processed faster.
5.7.4 Comparison with SoA ASIC BNN Accelerators
To the best of our knowledge, this is the first ASIC-implemented logic-based NN accelerator. Since there
are no ASIC designs in the same category, we include baseline results for the state-of-the-art ASIC design
of XNOR-based BNN inference accelerators. We compare our performance to this baseline by appropriately scaling their design to the same CMOS technology node and supply voltage level as ours. Table 5.6
provides a comparison between the proposed accelerator and other SoA accelerators. The proposed LPU
accelerator can process 601 K frames per second that shows 263x higher throughput in comparison with
ChewBaccaNN[2], the fastest SoA BNN accelerator. Note that ChewBaccaNN [2] deployed standard-cell
memories (SCMs) instead of SRAMs and some design techniques including hierarchical clock gating and
address/data silencing mechanisms to lower the power consumption. Even though the power consumption
of ChewBaccaNN is much lower, our logic-based accelerator outperforms it by 10.8x in energy efficiency.
We also achieve 1.2µJ/frame in Cifar-10 classification.
5.7.5 Comparison Between MAC-based, XNOR-based and Nullanet-based FPGA
Implementation of NNs
The VGG16 is a huge network and it has about 138 million parameters. We implement intermediate convolutional layers 2-13 in VGG16 using the proposed framework and fixed-function combinational logic
functions. As a baseline for the SoA generic MAC array-based accelerator for the layers realized using conventional MAC calculations, we used the open-source implementation of [93] with some improvements
proposed in [89]. We use FINN [102] for our XNOR-based baseline. We improve this implementation by
packing operations. We also use NullaDSP model [89] as another baseline for mapping FFCL generated by
113
NullaNet to the programmable DSP blocks where it can fit any FFCL with any size. We use the best results
of each implementation reported in [89] and compare them to our proposed architecture in tables 5.7 and
5.8. In the case of JSC and NID, we use the implementation and the associated performance reported in
LogicNets [101], Google and CERN’s optimized implementation [48], and the implementation presented
in [1].
As illustrated in Table 5.7, our implementation shows superior performance compared to other implementations. The LPU and XNOR implementation achieves significant saving, esp. in large DNN model
like VGG16, as there is no cost associated with off-chip memories while this is not the case for MAC-based
and NullaDSP implementation. Our LPU with 16 LPVs has on-chip SRAM of total 328KB. We achieved
14.01x(33.43x), 4.86x(3.93x), and 1.95x(4.89x) in performance improvement on VGG16 (LENET-5) inference
compared to MAC-based, NullaDSP, XNOR-based implementations, respectively.
LogicNets [101] have higher FPS than our implementation. However, they cannot use the same hardware for the other models since they realized each model as a customized hard network of logic gates
(as in random logic blocks). Whereas, our design offers programmable logic processors that can perform
the required logic gate operations of any logic (computation) graphs. The former realization is ideal for
building a highly efficient, yet unchangeable, inference engine whereas the latter realization is desirable
for accelerating the training process and for building inference engines that have to be updated being deployed in the field. Note that NullaNet Tiny [73] that is our upstream for generating FFCL blocks presents
similar implementation as LogicNets [101] and outperforms the LogicNets in similar settings on the same
benchmarks.
5.7.6 Ablation study with LPV count
To determine the influence of the LPV count on the performance of the presented Boolean processor, we
conducted experiments with different LPV counts. As shown in Fig. 5.11, the performance is increased as
114
Table 5.7: FPS Comparison between different implementation of models with high accuracy requirements. LPV
count in LPU is 16.
MAC NullaDSP XNOR LPU
VGG16 0.12K 0.33K 0.83K 103.99K
LENET5 0.48K 4.12K 3.31K 1035.60K
MLPMixer-S/4 4.17K - 50.00K 179.23K
MLPMixer-B/4 0.88K - 16.67K 102.01K
Table 5.8: FPS Comparison between different implementation of models with high throughput requirements. LPV
count is 16.
LogicNets [101] Google+CERN[48] [1] LPU
NID 95.24M - 49.58M 8.39M
JSC-M 2995.00M - - 0.69M
JSC-L 76.92M 76.92M - 0.21M
Figure 5.11: Inference time of VGG16 and LENET5.
we increased the LPV count. The influence of the LPV count is saturated after while. We see similar trends
in the Section 5.7.1. Since, the LPE utilization is reduced as we increased the LPV count, we will carry out
a study to find the trade off and Pareto figure between LPE utilization and the performance. To conduct a
comparative analysis on a NullaDSP presented in [89] against the Boolean processor, we benchmark the
effective LPV threshold, which is defined as the minimum LPV count that an LPU must have in order to
115
achieve at least as the performance of the NullaDSP. According to Fig. 5.11, we require at least 2 LPVs to
achieve such performance for the case of VGG16.
5.7.7 Multiple LPUs
To assess the efficiency of our multiple LPUs within a Boolean processor and its associated scheduler,
We conducted our experiments on bus-connected LPUs each with 16 LPVs and achieved reductions in
the computing cycle count as shown in Fig. 5.12. We increase the number of LPUs and notice that the
improvements saturate after a while. Over the benchmarking models, we observed that a design with
10 LPUs can achieve almost the highest throughput (i.e., the improvements is negligible after this design
point).
As seen in Fig. 5.13, the throughput is increased by more than 10x while we parallelize 10 LPUs. This
improvement is achieved due to the scheduling algorithm presented in Section 5.6.3. The compiler prioritizes the Boolean graphs with highest II and also defers the second round of computations of the Boolean
networks. Differently worded, prioritizing the computations with the higher II boost the throughput. We
validate this proposition by comparing throughput of a same design (i.e., utilize 1 LPU) with different
scheduling algorithms and present results in Fig. 5.13. The throughput-efficient scheduler is presented
in Algorithm 6 while the memory-efficient scheduler is introduced in Algorithm 5. The memory-efficient
scheduler finishes all rounds of computations associated with a Boolean network before moving to another
Boolean network. Hence, it does not need to store the intermediate values and results in a memory-efficient
design.
5.8 Conclusion
The proposed Boolean processor offers a novel hardware design approach afforded by the presented compiler resulting in efficient partitioning and mapping of a given neural network model in the format of
116
Figure 5.12: The computation cycles count of different models for different number of LPUs within the underlying
Boolean processor.
Figure 5.13: FPS Comparison between models mapped to a Boolean processor with 1 LPU and 10 LPUs using different scheduling algorithms.
fixed-function combinational logic. The proposed LPU accelerator outperforms the fastest SoA BNN accelerator by 263x higher in throughput. In future work, we plan to intend to explore the heterogeneous
architecture where the number of LPEs per LPVs and their following switch networks will not be the same
for all LPVs.
117
Chapter 6
Sparse Periodic Systolic Dataflow for Lowering Latency and Power
Dissipation of Convolutional Neural Network Accelerators
6.1 Introduction
Convolutional Neural Networks (CNNs) exhibit state-of-the-art performance in many computer vision
applications such as image classification, object recognition, and scene labeling [55]. However, the high
performance of deep CNNs is achieved at the cost of high computation. This makes it challenging for networks to be deployed to resource-constrained edge devices with strict storage and energy limits. Therefore,
developing CNN architectures with reduced computation and storage costs is of great importance. At the
algorithmic level, methods such as weight quantization [68, 73], model pruning [40, 29], and knowledge
distillation [42] have gained recent popularity.
In particular, model pruning is a widely practiced approach for reducing the memory footprint and
computational cost of neural networks. By removing redundant weights of a network that does not harm
the model accuracy, the model is compressed from a dense to a sparse computational graph. With the
progress in weight pruning methods, pattern-based pruning [75, 57] has emerged as a promising avenue
that seeks to find a sweet spot between the two conventional pruning schemes: 1) structured pruning [109]
118
which has high regularity and is hardware-friendly, but suffers from accuracy degradation; 2) unstructured pruning which retains high accuracy, but suffers from large hardware overhead to manage irregular
weight indices. Pattern-based pruning method compromises between these two pruning schemes by imposing higher levels of regularity through pre-defined patterns. This ameliorates the hardware overhead
compared to unstructured pruning, but it still necessitates a series of auxiliary buffers to manage a unique
set of indexing scenarios with the pattern-based approach. At its core, hardware overhead caused by indexing sparse weights manifests a fundamental design limitation for the accelerator to further optimize
latency, power, and memory requirements.
In this paper, we advance the state-of-the-art in sparse neural network accelerator design by exploiting
the concept of periodicity in pattern-based pruning for the first time in hardware. Prior art [57] mainly
explores the software stack of the periodic pattern-based pruning approach and demonstrates that added
periodicity has negligible accuracy loss. Here, we observe periodicity as an opportunity in hardware to
avoid indexing overhead with its added regularity. We first present our compiler that reorders the weights
according to the periodicity, optimizing for maximum parallelism. Then we present Sparse Periodic Systolic (SPS) dataflow that computes convolutions in a systolic array of processing elements, commonly seen
in Field Programmable Gate Arrays (FPGAs). Then, a dedicated indexing method is designed in hardware
to fetch the pre-defined locations of nonzero weight indices using significantly smaller memory requirements. The main contributions of this chapter are summarized as follows:
• We present a novel Sparse Periodic Systolic (SPS) dataflow that exploits the periodic pattern-based
sparsity in neural networks to achieve an FPGA-friendly accelerator architecture.
• Using a compiler tailored to the SPS dataflow, we effectively solve the long-standing indexing overhead problem for unstructured pruning. We introduce the Period-Pattern-Weight (PPW) compact
storage format and a bespoke accelerator architecture for efficiently fetching weights and activations
in hardware.
119
• We perform the next layer reordering (NLR) optimization method enabled by the periodic patternbased design to further reduce data movement cost.
6.2 Preliminaries and Background
This section includes background on deep neural network (DNN) processing and details the periodic patternbased pruning method, which is the weight reduction technique used for DNN compression in this paper.
6.2.1 DNN processing and Compilers
A convolutional layer receives input feature maps (IFMs) of size win × hin × cin and convolves them
with cout different filters, each filter of size wk × hk × cin to generate output feature maps (OFMs) of
size wout × hout × cout. Here, wx, hx, cx represent width, height, and depth of tensor x, which can
represent the 3D input/output feature map. The IFMs for the next convolutional layer are equivalent
of the current convolutional layer’s OFMs. Such computations can be represented by a six-level nested
loop (seven-level nested loops when considering iteration over images in a mini-batch), i.e., loops over
wout, hout, cout, wk, hk, and cin, known as a computational block, can describe the computational flow for
a convolutional layer in CNN.
6.2.2 Periodic Pattern-based Pruning
Pattern-based pruning will first be explained before the introduction of periodicity. In pattern-based pruning, a pattern is a pre-defined 2D kernel that constrains the locations of nonzero entries, also referenced
as a kernel variant (KV). Thus, any given kernel of a 3D filter can be classified as one of the KV, since the
locations of the weights are restricted to form a pattern while pruning. The number of nonzero entries (the
kernel support) in a wk ×hk kernel is also referred to kernel support size (KSS). KSS is also fixed across all
patterns to support high regularity, which reduces workload imbalance between processing elements (PEs)
120
KV1 KV2
Predefined Patterns
(Kernel Variants)
Kernel 1
Kernel 2
…
Kernel 6
Filter 1 Pruned weight
Nonzero
pattern weight
Filter 2
Filter 3
Figure 6.1: Illustration of periodic pattern-based sparsity. KSS=4 and P=2
in the systolic array.It is useful to note that, unlike unstructured pruning that prunes at the granularity of
individual weights, pattern-based pruning prunes at the granularity of patterns, which adds regularity yet
less flexibility. The regularity helps with achieving higher hardware performance, but less flexibility poses
a relative challenge in retaining model performance.
Periodic Pattern-based pruning is a variant of the pattern-based pruning approach, where a concept of
periodicity further constrains the sequence in which the patterns occur in a given filter. Fig. 6.1 illustrates
an example of periodic pattern-based sparsity with a periodicity of 2 (i.e., KVS = 2 while KSS = 4). Rotation
due to periodicity occurs in two directions, across the kernels and across the filters. Each KV appears in
a repeating sequence of [KV1, KV2] for the first filter. Each filter also rotates the starting KV, where the
121
Pattern-based Periodic Pruning Sparsity-aware copmiler optimization Hardware code generation
KV1 KV2 KV3
Pre-defined
Patterns
Periodically
pruned
model
High-level NN
definition
Periodic kernels
across filter
FV1 FV2 FV3
Systolic Padding
FV1 FV1 FV2 FV2 FV3 FV3 PAD PAD PAD
FPGA
Resource
Systolic Array dimension
Kernel Reordering
FV1 FV1 FV2 FV2 FV3 FV3
HLS
Templates
SPS Dataflow
Def convolution():
For …
for …
for …
Accelerator
config files
Run inference
SDAccel Code
generator
FV1 FV1
Filter Reordering
FV1 FV2 FV3 FV1 FV2 FV3
FV1 FV1 FV2 FV2 FV3 FV3
Figure 6.2: Overall flow of the SPS Acceleration Framework
second filter repeats a sequence of [KV2, KV1]. This heterogeneity adds flexibility to improve network
accuracy.
We rotate the sequence of kernel variants, beginning each filter (of P consecutive filters) with a different
kernel variant, to maintain periodicity across distinct 3D filters while offering some heterogeneity. If the
first 3D filter starts with KV1 (i.e., the kernel where the non-zero weights are highlighted in orange), then
KV2 (i.e., the kernel where the non-zero weights are highlighted in green), and then repeats the sequence,
the second 3D filter should start with KV2 to generate a repeating sequence of [KV2, KV1]. As a result,
the sequence of repeating kernels is preserved modulo rotation.
A key insight here is the simplicity at which the KV can be indexed in any given filter. Due to the
modulo rotation in the interval of periodicity, the burden of storing locations of patterns can be reduced
to a single scalar value P, which is also the number of KVs. This means the weights associated with each
KV can also be accessed by P with minimal overhead. Thus, every KV of the same type can be indexed
by iterating across the filter with offset P, which is significantly simpler than iterating over the indices of
each KV type and its respective location that irregularly occurs across the filter.
122
6.3 Overall Flow
Given the golden opportunity to design a hardware that does not suffer from indexing overhead while preserving the network accuracy, we propose a novel end-to-end FPGA-friendly DNN acceleration framework
that can fully exploit the new periodic pattern-based dimension in its dataflow.
Thanks to the added regularity through pre-defined patterns and periodicity, some have explored their
promise. Authors in [75] have explored a pattern-based pruning approach without prediocity to design
an entirely new weight storage format and solve the indexing overhead problem fairly well, but it is still
limited by a new set of auxiliary buffers for indexing that degrades hardware performance. On the other
hand, [57] employs the concept of periodicity in addition to the pattern-based approach to achieve good
model accuracy and compression ratio. It explored how periodicity can help reduce the storage demand
for conventional formats such as compressed sparse row (CSR).
That said, we first delve into the essence of the indexing overhead problem which is the efficiency of
hardware operations. Through an architecture-first approach, we develop a highly tailored dataflow and
a much simpler indexing unit, taking maximum advantage of the high parallelism that naturally occurs
with higher regularity. Fig. 6.2 shows the overall flow of the proposed SPS acceleration framework that
consists of three stages. First is the model pruning stage, where we employ a pattern-based periodic pruning method developed by [57]. Second is the sparsity-aware compiler optimization (cf. Section 6.4) that
performs a series of periodicity-driven weight reshaping operations. Given the highly constrained nature
of this pruning scheme, our compiler provides the maximum degree of flexibility for model compression
parameters, namely the pattern shape, periodicity, and KSS. Systolic padding is also applied to sustain the
parallelism that occurs across the two dimensions of the systolic array. Last is the periodic sparsity-aware
architecture (cf. Section 6.6), which carries out the SPS dataflow in FPGA-friendly hardware architecture.
123
Table 6.1: Parameters used in the proposed compiler
Symbol Calculation Description
WNUM P × KSS total # of predefined nonzero weights
ICp ⌈cin/P⌉ # of input channels per group
INCp ⌈ccin/P/sysw⌉ # of input channels tiled iterations per group
OCp ⌈cout/P⌉ # of output channels per group
ONCp ⌈cout/P/sysh⌉ # of output channels tiled iterations per group
6.4 Compiler Tailored to Periodic Pattern-based Sparsity
Once the model is trained using pattern-based periodic pruning, sparse weights must be stored in an
efficient format. Otherwise, the indexing of nonzero weights will cause significant memory and latency
overhead so that the expected benefits of pruning will be lost. As such, this section introduces a compiler
that shapes the initial weight tensor to a compact format that we call Period-Pattern-Weight (PPW) format.
From the initial 4D weight tensor that is pattern-pruned with periodicity P and KSS number of nonzero
weights per KV, we apply a series of transformations to arrive at the sparsity parameters described below
in Table I.
6.4.1 Kernel and Filter Ordering
Key challenges of hardware acceleration for unstructured pruning can be reduced to heavy control-flow
instructions, as well as thread divergence and load imbalance [25]. These are largely solved by promoting
parallelism during the dataflow. Grouping filters with similar kernel sequences achieves better inter-thread
parallelization, while grouping same patterns within a filter improves intra-thread parallelization [75].
Taken together, these provide a key insight as to how the patterns should be rearranged by the compiler
to maximize parallelism.
124
W1,1,2,1 W1,2,2,1 W1,2,3,1 W1,3,2,1 W1,1,2,4 W1,2,2,4 W1,2,3,4 W1,3,2,4
WOC,KH,KW,IC
W1,1,2,2 W1,2,1,2 W1,2,2,2 W1,2,3,2 …
W4,1,2,1 W4,2,2,1 W4,2,3,1 W4,3,2,1 W4,1,2,4 W4,2,2,4 W4,2,3,4 W4,3,2,4 W4,1,2,2 W4,2,1,2 W4,2,2,2 W4,2,3,2 …
W2,1,2,1 W2,2,2,1 W2,2,3,1 W2,3,2,1 W2,1,2,4 W2,2,2,4 W2,2,3,4 W2,3,2,4 W4,1,3,2 W4,2,2,2 W4,2,3,2 W4,3,3,2 …
…
…
…
W3,1,3,1 W3,2,2,1 W3,2,3,1 W3,3,3,1 W3,1,3,4 W3,2,2,4 W3,2,3,4 W3,3,2,4 W4,1,2,2 W4,2,2,2 W4,2,3,2 W4,3,2,2 …
…
…
…
W11 W12 W13
W21 W22 W23
W31 W32 W33
KV1
W11 W12 W13
W21 W22 W23
W31 W32 W33
KV2
W11 W12 W13
W21 W22 W23
W31 W32 W33
KV3
W…
W…
…
W…
W…
…
…
…
…
…
…
…
hk
PAD PAD PAD PAD
OCp
wk
(ICp) x (KSS)
Figure 6.3: Overview of the PPW storage format.
Kernel and filter orderings both group the KVs according to the periodicity. A running example is
illustrated in Fig. 6.2 where it shows a grouping of three distinct KVs together. Periodicity of three results
in a first group KV1 that increments in multiples of three with the offset of one, which is the KV number.
Similarly, KV2 increments 3x with offset of 2 and so on. This results in 3 arrays of input channels (ICs) in
the form of [1+P*0, 1+P*1, 1+P*2, ...], [2+P*0, 2+P*1, ...], and [3+P*0, 3+P*1, ...] assuming 1-based indexing.
Each array has a size of ⌈cin/P⌉ which is denoted by ICp.
After the same KVs are grouped together through Kernel Ordering, the same sequence of kernels in
a given filter is grouped together as well. The the number of KV produces the same number of unique
sequences of filter variants, where three groups of output channels (OC) are produced with each group
having a size of ⌈cout/P⌉, which is denoted as OCp.
125
6.4.2 Systolic Padding
To support the SPS dataflow (cf. Section 6.5) that processes groups of input channels ICp and OCp together,
padding is applied when the cin or cout dimension is not fully occupied by the systolic array dimension.
It is worth noting that padding is usually interpreted as a wasteful operation in hardware which should
be avoided whenever possible, for example by pipelining. More advanced optimizations may be applied
by computing two different patterns if the resources are to be wasted due to excessive padding. However,
in our experience, the number of patterns required to achieve high performance is 6-8. This means that if
the dimension of the weight matrix (cin, cout) is divisible by number of patterns, say there are 8 patterns,
no padding will be necessary. Fortunately, most well-known networks such as VGG16 and ResNets have
weight dimensions in multiples of 8, with popular ones being 64, 128, 256, and 512. Thus, the cost incurred
from padding is usually small enough (if present at all) so as not to degrade the hardware performance.
Algorithm 7 Sparse Periodic Systolic Dataflow
Input :
The nonzero weight array W;
The input activation A;
The Kernel size wk × hk;
The output feature map size hout × wout × cout;
Output :
Result stored in OFM;
1: for oh = 0; oh < hout; oh++ do
2: for ow = 0; ow < wout; ow++ do
3: Begin Convolution
4: for g = 0; g < P; g++ do
5: for cc = 0; cc < ONCp; cc++ do
6: for kv = 0; kv < P; kv++ do
7: for w = 0; w < KSS; w++ do
8: fetch nonzero weight indices wh and ww
9: for rr = 0; rr < INCp; rr++ do
10: //systolic operation begins
11: for i = 0;i < wsys;i++ do
12: #pragma unroll(i)
13: for j = 0;i < hsys; j++ do
14: #pragma unroll(j)
15: do MAC()
16: AdderTree() activated
126
6.5 SPS Dataflow
The SPS Dataflow guides compact weights to be run in the FPGA-friendly systolic architecture. It has two
major functionalities: 1) fetching the corresponding activations from the PPW weight tensor and 2) tiling
across the INCp and ONCp by concurrently executing all MAC operations in the systolic array. First,
the temporal component of the SPS Dataflow may be understood by looking at a single PE unit in the 2D
systolic array of the hardware accelerator. As section 6.6 describes, weights are stored in the BRAMs in
each PE and the output of the associated MAC unit is stored in a local register inside the PE. Thus, the
dataflow is a combination of weight stationary and output stationary dataflow. This order maximizes the
reuse of weights as well as outputs, while paying some cost to stream the activations to the computation
units.
The spatial component of the SPS Dataflow is mapped to match the dimensions of the systolic array.
The compiler has already grouped the input and output channels according to weight patterns, and two
additional inner loop nests (i and j iterators in Algorithm 1) further blocks out the subgroup within ICp
by sysw and OCp by sysh, resulting in INCp and ONCp, respectively.
The crux of the SPS dataflow lies in the simplicity of decoding the compressed PPW format to fetch
the corresponding input activations. Many state-of-the-art sparsity-supported accelerators use storage
formats such as the coordinate list (COO), compressed sparse row (CSR), and compressed sparse column
(CSC) [15], where the storage requirement for nonzero indexing polynomially increases with the network
size. However, the storage for PPW is network architecture-agnostic, meaning it can stand on a constant
storage requirement only dependent on pruning parameters P and KSS, regardless of how deep or wide
the network is. The method to facilitate the MAC operation indexing between activation and the weight
is as follows. We create two indexing buffers of size WNUM , each responsible for storing the two spatial
dimensions hk and wk. Similar to the COO format, each weight can be indexed by a single iterator that
127
fetches the height and width of the weight inside the kernel. Thanks to the highly regular occurrence of
the patterns, each weight in a given group, kernel number, and nonzero weight number can be calculated
by ((g+kv)∗KSS+w)%WNUM . The modulus operation wraps the buffer so it continues the periodically
occurring patterns.
Next Layer Reordering (NLR): After the convolution operation is completed, the results of the MAC
operations are stored. Due to compiler’s reordering, the unnatural ordering of output channels produces
results in increments of P. A naive solution is to simply sort it to a natural order (increasing from channel
0 to ccout. Yet, this causes a nontrivial amount of data movement from scanning the entire output channel.
Here, we observe that SPS dataflow uniquely produces the results in the increments of P. Therein, the
compiler can expect the channels to be grouped in certain lengths of P and proactively reorder the channels
for the next layer so that it matches the channel ordering of incoming activations. This is highly efficient
as we can save the total execution time and energy efficiency from reordering after each convolutional
layer, effectively imitating the dataflow for the dense format where such channel indexing problems do
not occur.
6.6 Proposed Periodic Sparsity Architecture
In this section, we describe our FPGA-tailored architecture customized for the proposed periodic patternbased sparsity.
The accelerator contains (i) a 2D array of PEs, i.e., a systolic array which is responsible for executing
the MAC operations associated with convolutional calculations, (ii) a memory hierarchy feeding data to
the said array, which consists of register files, on-chip memory (Block RAMs on FPGA devices), and external off-chip memory (DRAM), and (iii) an input matching unit (IMU) that fetches the nonzero weights
128
DRAM
Input Buffer
Output Buffer
Tree Adders
Vector Processing Unit
weight flow
data flow
Weight Buffer
ALU Instruction Queue
2D Array of Processing
Elements
(Systolic Array)
Input Matching Unit Weight
Index Buffer
Reg
File
Reg
File
BRAM
Reg
File
index flow
Figure 6.4: High-level overview of systolic array accelerator design.
129
in the input feature maps. The systolic array is followed by a vector processing unit, which includes multiple ALUs that conduct neural network operations such as nonlinear activation functions and maximum
pooling, as illustrated in Fig. 6.4.
The available hardware resources in an FPGA device, such as digital signal processing units (DSPs),
Configurable Logic Blocks (CLBs) that contain several look-up tables (LUTs), and Block RAMs (BRAMs)
are placed as resource groups in a column-wise manner. Consequently, the all resources are uniformly
distributed on the FPGA chip, and one should place data that is used by a DSP in a BRAM that is physically
close to the DSP. Hence, A PE in our design comprises one DSP and its adjacent BRAM. We also use some
of the CLBs as distributed memories to store indices of non-zero weights in KVs. Note that, these indices
are low precision (e.g., 4 bits for a 2D kernel with size of 3 × 3 which is common in well-known computer
vision models).
The IFMs are initially cached in an input buffer, then passed through the IMU to drop pruned weights,
and sequentially transmitted onto the first row of PEs in the systolic array. In addition, input data is simply
shifted into the PE array and between nearby PEs on the same row of the systolic array. This technique
does away with the need for global interconnections between the input buffers and all PEs and the costly
multiplexers. We also bring the indices associated with KVs in parallel with weight fetching. This is feasible
since input data, weights, and indices are stored in separate off-chip memory banks in the target FPGA
board and are thus simultaneously accessible. Finally, the registered partial sum results that reside in the
PEs of one row are passed to the adder tree to conduct the required summation and generate the final OFM
value when all computations for one OFM are completed.
130
6.7 Experimental Results
In this section, we first describe configurations used in our experiments. Then, we assess the storage required by proposed format compared to well-known formats for sparsity. Finally, we present the hardware
utilization of our accelerator and compare its energy consumption to state-of-the-art accelerators.
6.7.1 Experimental Configuration
For the storage format experiments, 8 bit unsigned integers are assumed for calculating the weight storage
format. For a fair comparison, the connectivity pruning that allows less weight storage for FKW format
is recognized during calculation. Sparsity constants of KSS=2 and P=8 is used for all experiments, as we
validate the model accuracy (91.2%) [57] that has less than 1% accuracy degradation compared to the nonpruned version. For evaluating hardware performance, we targeted a Xilinx VU9P FPGA using the AWS
EC2 F1 instance. We implemented it on Xilinx Virtex UltraScale+ FPGA board using Vivado HLS design
suite 2019.1. We evaluate our SPS Dataflow on a well-known CNN, VGG16 with a real-life dataset, i.e.,
CIFAR-10, to simulate tasks. We also assess our model on Resnet18 benchmarking tinyImagenet dataset.
6.7.2 Storage Comparison
Figure 6.5: Total storage comparison for unique
VGG16 CONV layers.
Figure 6.6: Indexing storage comparison for unique
VGG16 CONV layers.
131
Figure 6.7: Total storage comparison for unique
ResNet18 CONV layers.
Figure 6.8: Indexing storage comparison for unique
ResNet18 CONV layers.
Fig. 6.5 shows that PPW compresses over 77.8% (4.5x), 72.8% (3.67x), 61.2% (2.57x), 52.9% (2.12x), 10.5%
(1.12x) compared to dense, COO, CSR, CSC, and FKW, respectively over unique convolutional layers in
VGG16. Note that layers 8-10 and 11-13 in VGG16 has the same weight matrix size and are represented
by L8 and L9 in our figures. Similar selection has been adopted for ResNet18. Dense model represents the
non-pruned baseline model.
Fig. 6.6 illustrates that PPW format requires 22044x less indexing storage compared to FKW format,
which is the most competitive storage format. As seen on the graph, PPW enjoys constant storage requirement of small amount of bits across different convolutional layers in VGG16, while others grow at
the order of MegaBytes. Similar effects are shown in selected convolutional layers of ResNet18, where
PPW consistently outperforms total storage while remaining near zero-valued for indexing storage.
At layers with a larger weight matrix, more space is occupied from storing indexing buffers than the
actual weights (see Fig. 6.9). The halfway point (50%) of the total storage is marked in a dotted line, and we
can observe that most storage formats easily exceed this threshold. Also, the influence of indexing storage
is more pronounced for bigger layers. For example, COO uses 73% of its storage at the largest layer just
for indexing, compared to 58% in the smallest layer.
132
Figure 6.9: Percent Storage for unique VGG16 CONV layers.
To conduct a comparative analysis on a single storage format against others, we benchmark the effective
sparsity threshold, which is defined as the minimum sparsity rate that the weight format must achieve in
order to realize a lower storage requirement. This attempts to answer the question: given the pruning
ratio and the corresponding storage requirement of the benchmarked format, how much pruning the other
formats should conduct in order to save more storage?
Figure 6.10: Benchmarking total storage against
baseline for different sparsity.
Figure 6.11: Benchmarking total storage against
PPW for different sparsity.
133
Table 6.2: Hardware Utilization for VGG16 on CIFAR-10
Hardware resource DSP48E LUT BRAM_18K Frequency (MHz)
Usage in our architecture 1038 (15%) 115290 (10%) 512 (12%) 342
Baseline architecture∗
1038 (15%) 115290 (10%) 2942 (68%) 342
∗
the baseline architecture is used for handling the dense format.
Our results from Fig. 6.10 show that most traditional sparse weight storage formats have a relatively
stringent constraint on the sparsity requirement, with COO requiring 22.2% of weights to be unpruned, in
other words requiring more than 77.8% of the weights to be pruned in order to begin saving more storage
than the baseline dense format.
We also benchmark PPW and observe as shown in Fig. 6.11 that the highly compact format of PPW
enforces a strict effective sparsity threshold on all other weight storage formats. The effective sparsity
threshold for FKW, the most competitive format, is 0.1, which means that FKW should prune more than
10x of its weights to begin saving more storage than PPW. Yet, reference [75] reports an 8x pruning rate,
suggesting that the PPW format saves more storage under similar network accuracy (which the sparsity
configuration used for training determines the pruning rate). FKW could employ harsher pruning and
achieve more than 10x pruning rate to realize lower storage, but this would nontrivially sacrifice the network accuracy under 91% which is higher than 1% degradation.
6.7.3 Hardware Utilization and Energy Efficiency Comparisons
In this section, we evaluate the aforementioned sparsity storage formats in the FPGA platform. First, the
hardware utilization of the proposed accelerator tailored to SPS dataflow for running VGG16 on CIFAR-10
dataset is reported in Table 6.2. The baseline architecture is similar to the architecture shown in Fig. 6.4
while removing IMU and weight index buffer. As shown in the table, when employing the accelerator
design discussed in Section 6.7, periodic pattern-based pruning that eliminates 77.8% of the weights stored
134
Figure 6.12: Energy Savings over dense baseline.
in the BRAM alongside with the PPW storage format that requires minimal indexing support in hardware
leads to efficient usage of hardware resources in the FPGA.
Next, we evaluate the energy efficiency of our proposed architecture and dataflow compared to other
formats. CSR and FKW are implemented as they are the competitive formats that exist today. The relative
energy savings is reported in Fig. 6.12, normalized with the dense baseline architecture. Our PPW format
executed by the SPS dataflow achieves 4.9× energy savings over the dense baseline, while CSR acehives
1.4× and FKW achieves 3.1×. To understand the energy savings, we classify the resources of energy cost
in four ways: 1) bringing in weights 2) running MAC operations 3) read/write from/to weight index buffers
and 4) data reordering cost for pattern-based dataflows such as the FKW. The first two costs are relatively
the same as they’re directly proportional to the the number of weights being moved around the hardware
(thus modeled by the weight density). However, the third cost poses a nontrivial challenge to CSR, while
FKW and PPW are relatively immune to the indexing overhead that occurs while supporting the MAC
operation. This follows suit in Fig. 6.6. 4) is unique to pattern-based formats, where FKW pays the cost
of reordering the number of output channels that have been mixed during compiler optimizations. Such
indexing overhead occurs in every layer, as the output feature map is the input feature map of the next
135
layer, and their dataflow expects to be in the original order. On the other hand, SPS dataflow allows outputs
to be grouped together without the need of data reordering so it eliminates the fourth type of cost.
6.8 Conclusion
The SPS dataflow offers a novel hardware design approach afforded by periodic pattern-based pruning,
resulting in neural network weights with a higher degree of regularity and consequently parallelism. The
SPS dataflow achieves higher levels of parallelism while avoiding excessive indexing cost while achieving
high accuracy thanks to its architecture-first compiler approach. The SPS dataflow and neural network
compiler outperform state-of-the-art sparisty formats in convolutional neural network (CNN) accelerator
designs targeting FPGAs.
136
Chapter 7
Conclusions & Future directions
This dissertation aims to provide a compiler and rumtime support for hybrid arithmetic and logic processing of neural networks. The key idea behind the logic processing part is to replace the expensive
multiply-accumulate operations needed to compute the various filter/neuron functions in the DNN with
binary logic expressions. The Boolean logical expression is then mapped to a low-cost computation blocks,
e.g., the native LUT on the FPGAs. In first chapter of this proposal, we presented F2N3. F2N3 is an optimization framework for building resource-constrained, energy-efficient, ultra-low latency FPGA-based
neural network accelerators. FPGA has been used as target hardware since FPGA devices have lower
hardware costs and very low latencies.
We also presented a framework for efficiently compiling and mapping fixed-function combinational
logic on DSPs using high-level synthesis. The methodology described in this proposal maps combinatorial
fixed-function logic blocks to a set of Boolean functions where Boolean operations inside the functions
are mapped to DSP devices instead of LUTs to gain the advantage of high performance, low latency and
parallelism of DSP blocks. This proposal also described a new optimization and design method for NN
compilation and mapping, using fixed-function combinatorial logic for DSP on FPGAs using high-level
synthesizer streams. This flow can be integrated into the F2N3 and map some boolean logic expressions
to the DSP if the total number of LUTs is limited on the target FPGA. In addition, we presented the novel
137
design of a Boolean processor consisting of a programmable under-provisioned multicasting switch network, which empowers the fast and scalable dataflow required to process logic graphs. Following Boolean
processor introduction, we presented an innovative optimization methodology for compiling and mapping
NNs utilizing FFCL into this Boolean processor. The proposed compiler generates customized instructions
for static scheduling of all operations of the logic graph during inference.
We expand our compiler and run-time support to arithmetic-based inference engines. For this purpose,
we advanced the state-of-the-art in sparse neural network accelerator design by exploiting the concept
of periodicity in pattern-based pruning for the first time in hardware. We presented our compiler that
reorders the weights according to the periodicity, optimizing for maximum parallelism. Then, a dedicated
indexing method was introduced in hardware to fetch the pre-defined locations of nonzero weight indices
using significantly smaller memory requirements. Finally, We presented a novel SPS dataflow that exploits
the periodic pattern-based sparsity in neural networks to achieve an FPGA-friendly architecture. Using a
compiler tailored to the SPS dataflow, we effectively solved the long-standing indexing overhead problem
for unstructured pruning. We co-design the period-pattern-weight (PPW) compact storage format and
the corresponding architecture to efficiently fetch weights and activations. We performed the next layer
reordering (NLR) optimization method enabled by the periodic pattern-based design to further reduce data
movement cost between layers.
138
Bibliography
[1] Syed Asad Alam, David Gregg, Giulio Gambardella, Michael Preußer, and Michaela Blott. “On the
RTL Implementation of FINN Matrix Vector Compute Unit”. In: CoRR abs/2201.11409 (2022).
arXiv: 2201.11409. url: https://arxiv.org/abs/2201.11409.
[2] Renzo Andri, Geethan Karunaratne, Lukas Cavigelli, and Luca Benini. “ChewBaccaNN: A Flexible
223 TOPS/W BNN Accelerator”. In: IEEE International Symposium on Circuits and Systems, ISCAS
2021, Daegu, South Korea, May 22-28, 2021. IEEE, 2021, pp. 1–5. doi:
10.1109/ISCAS51556.2021.9401214.
[3] Andrawes Al Bahou, Geethan Karunaratne, Renzo Andri, Lukas Cavigelli, and Luca Benini.
“XNORBIN: A 95 TOp/s/W hardware accelerator for binary convolutional neural networks”. In:
2018 IEEE Symposium in Low-Power and High-Speed Chips, COOL CHIPS 2018, Yokohama, Japan,
April 18-20, 2018. IEEE Computer Society, 2018, pp. 1–3. doi: 10.1109/CoolChips.2018.8373076.
[4] Avi Baum, Or Danon, and Daniel Chibotero. “Structured Weight Based Sparsity In An Artificial
Neural Network Compiler”. US20200285950A1. Sep. 10, 2020.
[5] Avi Baum, Or Danon, Hadar Zeitlin, Daniel Ciubotariu, and Rami Feig. “Neural Network
Processor Incorporating Separate Control And Data Fabric”. US20180285719A1. Oct. 4, 2018.
[6] Václav E Beneš et al. Mathematical theory of connecting networks and telephone traffic. Academic
press, 1965.
[7] Stephen P. Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. “Distributed
Optimization and Statistical Learning via the Alternating Direction Method of Multipliers”. In:
Foundations and Trends in Machine Learning 3.1 (2011), pp. 1–122.
[8] John Brady, Marco Mecchia, Patrick F. Doyle, and Stanislaw Jan Maciag. “Hardware agnostic deep
neural network compiler”. US20190392296A1. Dec. 26, 2019.
[9] John Brady, Marco Mecchia, Patrick F. Doyle, Meenakshi Venkataraman, and
Stanislaw Jan Maciag. “Control of scheduling dependencies by a neural network compiler”.
US20190391796A1. Dec. 26, 2019.
139
[10] Robert K. Brayton and Alan Mishchenko. “ABC: An Academic Industrial-Strength Verification
Tool”. In: International Conference on Computer Aided Verification. Vol. 6174. Lecture Notes in
Computer Science. Springer, 2010, pp. 24–40.
[11] John W. Brothers and Joohoon Lee. “Neural network processor”. US20170011288A1. Jan. 12, 2017.
[12] Kurt F. Busch, III Jeremiah H. Holleman, Pieter Vorenkamp, and Stephen W. Bailey. “Pulse-width
modulated multiplier”. US20190050721A1. Feb. 14, 2019.
[13] Yu-Hsin Chen, Joel S. Emer, and Vivienne Sze. “Eyeriss: A Spatial Architecture for
Energy-Efficient Dataflow for Convolutional Neural Networks”. In: International Symposium on
Computer Architecture. IEEE Computer Society, 2016, pp. 367–379.
[14] Yunji Chen, Tao Luo, Shaoli Liu, Shijin Zhang, Liqiang He, Jia Wang, Ling Li, Tianshi Chen,
Zhiwei Xu, Ninghui Sun, and Olivier Temam. “DaDianNao: A Machine-Learning
Supercomputer”. In: 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.
2014, pp. 609–622. doi: 10.1109/MICRO.2014.58.
[15] John Cheng, Max Grossman, and Ty McKercher. Professional CUDA c programming. John Wiley &
Sons, 2014.
[16] Ping Chi, Shuangchen Li, Cong Xu, Tao Zhang, Jishen Zhao, Yongpan Liu, Yu Wang, and
Yuan Xie. “PRIME: A Novel Processing-in-Memory Architecture for Neural Network
Computation in ReRAM-Based Main Memory”. In: International Symposium on Computer
Architecture. IEEE Computer Society, 2016, pp. 27–39.
[17] Pi-Feng Chiu, Won Ho Choi, Wen Ma, and Martin Lueker-Boden. “Shifting architecture for data
reuse in a neural network”. US20200117982A1. Apr. 16, 2020.
[18] Jungwook Choi, Zhuo Wang, Swagath Venkataramani, Pierce I-Jen Chuang,
Vijayalakshmi Srinivasan, and Kailash Gopalakrishnan. “PACT: Parameterized Clipping
Activation for Quantized Neural Networks”. In: CoRR abs/1805.06085 (2018).
[19] Yoo Jin Choi, Mostafa El-Khamy, and Jungwon Lee. “Method and apparatus for neural network
quantization”. US20180107926A1. Apr. 19, 2018.
[20] Dan C. Ciresan, Ueli Meier, and Jürgen Schmidhuber. “Multi-column deep neural networks for
image classification”. In: Conference on Computer Vision and Pattern Recognition. IEEE Computer
Society, 2012, pp. 3642–3649.
[21] Charles Clos. “A study of non-blocking switching networks”. In: Bell System Technical Journal
32.2 (1953), pp. 406–424.
[22] Jason Cong and Yuzheng Ding. “FlowMap: an optimal technology mapping algorithm for delay
optimization in lookup-table based FPGA designs”. In: IEEE Trans. Comput. Aided Des. Integr.
Circuits Syst. 13.1 (1994), pp. 1–12. doi: 10.1109/43.273754.
140
[23] Matthieu Courbariaux and Yoshua Bengio. “BinaryNet: Training Deep Neural Networks with
Weights and Activations Constrained to +1 or -1”. In: CoRR abs/1602.02830 (2016). arXiv:
1602.02830. url: http://arxiv.org/abs/1602.02830.
[24] William J. Dally, Angshuman Parashar, Joel Springer Emer, Stephen William Keckler, and
Larry Robert Dennison. “Sparse convolutional neural network accelerator”. US10860922B2. Dec.
8, 2020.
[25] Shail Dave, Riyadh Baghdadi, Tony Nowatzki, Sasikanth Avancha, Aviral Shrivastava, and
Baoxin Li. “Hardware Acceleration of Sparse and Irregular Tensor Computations of ML Models: A
Survey and Insights”. In: Proceedings of the IEEE 109.10 (2021), pp. 1706–1752. doi:
10.1109/JPROC.2021.3098483.
[26] James J. Davis, Joshua M. Levine, Edward A. Stott, Eddie Hung, Peter Y. K. Cheung, and
George A. Constantinides. “STRIPE: Signal selection for runtime power estimation”. In: 27th
International Conference on Field Programmable Logic and Applications, FPL 2017, Ghent, Belgium,
September 4-8, 2017. Ed. by Marco D. Santambrogio, Diana Göhringer, Dirk Stroobandt,
Nele Mentens, and Jari Nurmi. IEEE, 2017, pp. 1–8.
[27] Li Deng. “The mnist database of handwritten digit images for machine learning research”. In:
IEEE Signal Processing Magazine 29.6 (2012), pp. 141–142.
[28] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. “BERT: Pre-training of Deep
Bidirectional Transformers for Language Understanding”. In: Conference of the North American
Chapter of the Association for Computational Linguistics: Human Language Technologies.
Association for Computational Linguistics, 2019, pp. 4171–4186.
[29] Xiaohan Ding, Guiguang Ding, Xiangxin Zhou, Yuchen Guo, Jungong Han, and Ji Liu. “Global
Sparse Momentum SGD for Pruning Very Deep Neural Networks”. In: Advances in Neural
Information Processing Systems. 2019, pp. 6379–6391.
[30] Zidong Du et al. “ShiDianNao: shifting vision processing closer to the sensor”. In: International
Symposium on Computer Architecture. ACM, 2015, pp. 92–104.
[31] Javier M. Duarte, Song Han, Philip C. Harris, Sergo Jindariani, Edward Kreinar, Benjamin Kreis,
Jennifer Ngadiuba, Maurizio Pierini, Ryan Rivera, Nhan Tran, and Zhenbin Wu. “Fast inference of
deep neural networks in FPGAs for particle physics”. In: CoRR abs/1804.06913 (2018). arXiv:
1804.06913.
[32] Thomas J. Duerig, Hongsheng Wang, and Scott Alexander Rudkin. “Systems and Methods for
Performing Knowledge Distillation”. US20200401929A1. Dec. 24, 2020.
[33] Ali Farhadi and Mohammad Rastegari. “System and methods for efficiently implementing a
convolutional neural network incorporating binarized filter and convolution operation for
performing image classification”. US10311342B1. Jun. 4, 2019.
[34] Laura Fick, David T. Blaauw, Dennis Sylvester, Michael B. Henry, and David Alan Fick.
“Floating-gate transistor array for performing weighted sum computation”. US9760533B2. Sep.
12, 2017.
141
[35] Takashi Fukuda, Samuel Thomas, and Bhuvana Ramabhadran. “Soft label generation for
knowledge distillation”. US20190205748A1. Jul. 4, 2019.
[36] Mingyu Gao, Jing Pu, Xuan Yang, Mark Horowitz, and Christos Kozyrakis. “TETRIS: Scalable and
Efficient Neural Network Acceleration with 3D Memory”. In: International Conference on
Architectural Support for Programming Languages and Operating Systems. Ed. by Yunji Chen,
Olivier Temam, and John Carter. ACM, 2017, pp. 751–764.
[37] Hasan Genc, Ameer Haj-Ali, Vighnesh Iyer, Alon Amid, Howard Mao, John Charles Wright,
Colin Schmidt, Jerry Zhao, Albert J. Ou, Max Banister, Yakun Sophia Shao, Borivoje Nikolic,
Ion Stoica, and Krste Asanovic. “Gemmini: An Agile Systolic Array Generator Enabling
Systematic Evaluations of Deep-Learning Architectures”. In: CoRR abs/1911.09925 (2019). arXiv:
1911.09925. url: http://arxiv.org/abs/1911.09925.
[38] Vinayak Gokhale et al. “A 240 G-ops/s Mobile Coprocessor for Deep Neural Networks”. In:
Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 2014, pp. 696–701.
[39] Richard HR Hahnloser, Rahul Sarpeshkar, Misha A Mahowald, Rodney J Douglas, and
H Sebastian Seung. “Digital selection and analogue amplification coexist in a cortex-inspired
silicon circuit”. In: Nature 405.6789 (2000), pp. 947–951.
[40] Song Han, Jeff Pool, John Tran, and William J. Dally. “Learning both Weights and Connections for
Efficient Neural Networks”. In: CoRR abs/1506.02626 (2015). arXiv: 1506.02626.
[41] Kazuma Hashimoto, Caiming Xiong, and Richard Socher. “Deep neural network model for
processing data through multiple linguistic task hierarchies”. US20180121788A1. May 3, 2018.
[42] Geoffrey E. Hinton, Oriol Vinyals, and Jeffrey Dean. “Distilling the Knowledge in a Neural
Network”. In: CoRR abs/1503.02531 (2015). arXiv: 1503.02531.
[43] Sepp Hochreiter and Jürgen Schmidhuber. “Long Short-Term Memory”. In: Neural Computation
9.8 (1997), pp. 1735–1780.
[44] Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Weinberger. “Densely Connected
Convolutional Networks”. In: Conference on Computer Vision and Pattern Recognition. IEEE
Computer Society, 2017, pp. 2261–2269.
[45] Xinming Huang and Yuteng Zhou. “A 20 TOp/s/W Binary Neural Network Accelerator”. In: IEEE
International Symposium on Circuits and Systems, ISCAS 2019, Sapporo, Japan, May 26-29, 2019.
IEEE, 2019, pp. 1–5. doi: 10.1109/ISCAS.2019.8702686.
[46] Julian Ibarz, Yaroslav Bulatov, and Ian Goodfellow. “Sequence transcription with deep neural
networks”. US9454714B1. Sep. 27, 2016.
[47] Sergey Ioffe and Christian Szegedy. “Batch Normalization: Accelerating Deep Network Training
by Reducing Internal Covariate Shift”. In: International Conference on Machine Learning. Ed. by
Francis R. Bach and David M. Blei. Vol. 37. JMLR Workshop and Conference Proceedings.
JMLR.org, 2015, pp. 448–456. url: http://proceedings.mlr.press/v37/ioffe15.html.
142
[48] Claudionor José Nunes Coelho Jr., Aki Kuusela, Shan Li, Hao Zhuang, Jennifer Ngadiuba,
Thea Klaeboe Aarrestad, Vladimir Loncar, Maurizio Pierini, Adrian Alan Pol, and Sioni Summers.
“Automatic heterogeneous quantization of deep neural networks for low-latency inference on the
edge for particle detectors”. In: Nat. Mach. Intell. 3.8 (2021), pp. 675–686. doi:
10.1038/s42256-021-00356-5.
[49] Claudionor N. Coelho Jr., Aki Kuusela, Hao Zhuang, Thea Aarrestad, Vladimir Loncar,
Jennifer Ngadiuba, Maurizio Pierini, and Sioni Summers. “Ultra Low-latency, Low-area Inference
Accelerators using Heterogeneous Deep Quantization with QKeras and hls4ml”. In: CoRR
abs/2006.10159 (2020). arXiv: 2006.10159. url: https://arxiv.org/abs/2006.10159.
[50] Patrick Judd, Jorge Albericio, and Andreas Moshovos. “Stripes: Bit-Serial Deep Neural Network
Computing”. In: IEEE Comput. Archit. Lett. 16.1 (2017), pp. 80–83. doi: 10.1109/LCA.2016.2597140.
[51] Duckhwan Kim, Jaeha Kung, Sek M. Chai, Sudhakar Yalamanchili, and Saibal Mukhopadhyay.
“Neurocube: A Programmable Digital Neuromorphic Architecture with High-Density 3D
Memory”. In: International Symposium on Computer Architecture. IEEE Computer Society, 2016,
pp. 380–392.
[52] Diederik P. Kingma and Jimmy Ba. “Adam: A Method for Stochastic Optimization”. In: 3rd
International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9,
2015, Conference Track Proceedings. Ed. by Yoshua Bengio and Yann LeCun. 2015.
[53] Phil C. Knag, Gregory K. Chen, H. Ekin Sumbul, Raghavan Kumar, Mark A. Anders,
Himanshu Kaul, Steven K. Hsu, Amit Agarwal, Monodeep Kar, Seongjong Kim, and
Ram K. Krishnamurthy. “A 617 TOPS/W All Digital Binary Neural Network Accelerator in 10nm
FinFET CMOS”. In: 2020 IEEE Symposium on VLSI Circuits. 2020, pp. 1–2. doi:
10.1109/VLSICircuits18222.2020.9162949.
[54] Alex Krizhevsky, Geoffrey Hinton, et al. “Learning multiple layers of features from tiny images”.
In: (2009).
[55] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. “ImageNet Classification with Deep
Convolutional Neural Networks”. In: Advances in Neural Information Processing Systems. 2012,
pp. 1106–1114.
[56] Alexey Kruglov. “Channel pruning of a convolutional network based on gradient descent
optimization”. US20200394520A1. Dec. 17, 2020.
[57] Souvik Kundu, Mahdi Nazemi, Massoud Pedram, Keith M. Chugg, and Peter A. Beerel.
“Pre-Defined Sparsity for Low-Complexity Convolutional Neural Networks”. In: IEEE Trans.
Computers 69.7 (2020), pp. 1045–1058. doi: 10.1109/TC.2020.2972520.
[58] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. “Gradient-based learning applied to document
recognition”. In: Proceedings of the IEEE 86.11 (1998), pp. 2278–2324. doi: 10.1109/5.726791.
[59] Jinmook Lee, Changhyeon Kim, Sanghoon Kang, Dongjoo Shin, Sangyeob Kim, and Hoi-Jun Yoo.
“UNPU: An Energy-Efficient Deep Neural Network Accelerator With Fully Variable Weight Bit
Precision”. In: IEEE J. Solid State Circuits 54.1 (2019), pp. 173–185. doi: 10.1109/JSSC.2018.2865489.
143
[60] Seungjin Lee, Sung Hee Park, and Elaina Chai. “Compiling and scheduling transactions in neural
network processor”. US20190340010A1. Nov. 7, 2019.
[61] Yu Li, Zhongxiao Li, Lizhong Ding, Peng Yang, Yuhui Hu, Wei Chen, and Xin Gao. “SupportNet:
solving catastrophic forgetting in class incremental learning with support data”. In: CoRR
abs/1806.02942 (2018).
[62] Dexu Lin, Venkata Sreekanta Reddy Annapureddy, David Edward Howard,
David Jonathan Julian, Somdeb Majumdar, and II William Richard Bell. “Fixed point neural
network based on floating point neural network quantization”. US10373050B2. Aug. 6, 2019.
[63] Shikun Liu, Zhe Lin, Yilin Wang, Jianming Zhang, and Federico Perazzi. “Neural network
architecture pruning”. US20210264278A1. Aug. 26, 2021.
[64] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo.
“Swin Transformer: Hierarchical Vision Transformer using Shifted Windows”. In: International
Conference on Computer Vision. Computer Vision Foundation / IEEE, 2021. doi:
10.1109/ICCV48922.2021.00986.
[65] Gerald M Masson and BW Jordan Jr. “Generalized multi-stage connection networks”. In: Networks
2.3 (1972), pp. 191–209.
[66] Warren S McCulloch and Walter Pitts. “A logical calculus of the ideas immanent in nervous
activity”. In: The bulletin of mathematical biophysics 5.4 (1943), pp. 115–133.
[67] Asit K. Mishra and Debbie Marr. “Apprentice: Using Knowledge Distillation Techniques To
Improve Low-Precision Network Accuracy”. In: International Conference on Learning
Representations. OpenReview.net, 2018.
[68] Asit K. Mishra, Eriko Nurvitadhi, Jeffrey J. Cook, and Debbie Marr. “WRPN: Wide
Reduced-Precision Networks”. In: International Conference on Learning Representations.
OpenReview.net, 2018.
[69] Pavlo Molchanov, Stephen Walter Tyree, Tero Tapani Karras, Timo Oskari Aila, and Jan Kautz.
“Systems and methods for pruning neural networks for resource efficient inference”.
US20180114114A1. Apr. 26, 2018.
[70] Bert Moons, Daniel Bankman, Lita Yang, Boris Murmann, and Marian Verhelst. “BinarEye: An
always-on energy-accuracy-scalable binary CNN processor with all memory on chip in 28nm
CMOS”. In: 2018 IEEE Custom Integrated Circuits Conference, CICC 2018, San Diego, CA, USA, April
8-11, 2018. IEEE, 2018, pp. 1–4. doi: 10.1109/CICC.2018.8357071.
[71] Maryam Moosaei, Guy Hotson, Parsa Mahmoudieh, and Vidya Nariyambut Murali. “Brake light
detection”. US10853673B2. Dec. 1, 2020.
[72] Tadej Murovič and Andrej Trost. “Massively parallel combinational binary neural networks for
edge processing”. In: Elektrotehniski Vestnik/Electrotechnical Review 86 (Jan. 2019), pp. 47–53.
144
[73] Mahdi Nazemi, Arash Fayyazi, Amirhossein Esmaili, Atharva Khare, Soheil Nazar Shahsavani,
and Massoud Pedram. “NullaNet Tiny: Ultra-low-latency DNN Inference Through Fixed-function
Combinational Logic”. In: 29th IEEE Annual International Symposium on Field-Programmable
Custom Computing Machines, FCCM 2021, Orlando, FL, USA, May 9-12, 2021. IEEE, 2021,
pp. 266–267. doi: 10.1109/FCCM51124.2021.00053.
[74] Mahdi Nazemi, Ghasem Pasandi, and Massoud Pedram. “Energy-efficient, low-latency realization
of neural networks through boolean logic minimization”. In: Proceedings of the 24th Asia and
South Pacific Design Automation Conference, ASPDAC 2019, Tokyo, Japan, January 21-24, 2019.
Ed. by Toshiyuki Shibuya. ACM, 2019, pp. 274–279.
[75] Wei Niu, Xiaolong Ma, Sheng Lin, Shihao Wang, Xuehai Qian, Xue Lin, Yanzhi Wang, and
Bin Ren. “PatDNN: Achieving Real-Time DNN Execution on Mobile Devices with Pattern-based
Weight Pruning”. In: ASPLOS ’20: Architectural Support for Programming Languages and Operating
Systems, Lausanne, Switzerland, March 16-20, 2020. Ed. by James R. Larus, Luis Ceze, and
Karin Strauss. ACM, 2020, pp. 907–922. doi: 10.1145/3373376.3378534.
[76] Antonio Polino, Razvan Pascanu, and Dan Alistarh. “Model compression via distillation and
quantization”. In: International Conference on Learning Representations. OpenReview.net, 2018.
[77] Mansi Rankawat, Jian Yao, Dong Zhang, and Chia-Chih Chen. “Determining drivable free-space
for autonomous vehicles”. US20190286153A1. Sep. 19, 2019.
[78] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. “XNOR-Net: ImageNet
Classification Using Binary Convolutional Neural Networks”. In: European Conference on
Computer Vision. Vol. 9908. Lecture Notes in Computer Science. Springer, 2016, pp. 525–542.
[79] Sylvestre-Alvise Rebuffi, Alexander Kolesnikov, Georg Sperl, and Christoph H. Lampert. “iCaRL:
Incremental Classifier and Representation Learning”. In: Conference on Computer Vision and
Pattern Recognition. IEEE Computer Society, 2017, pp. 5533–5542.
[80] Bajaj Ronak and Suhaib A. Fahmy. “Efficient mapping of mathematical expressions into DSP
blocks”. In: 24th International Conference on Field Programmable Logic and Applications, FPL 2014,
Munich, Germany, 2-4 September, 2014. IEEE, 2014, pp. 1–4. doi: 10.1109/FPL.2014.6927419.
[81] Bajaj Ronak and Suhaib A. Fahmy. “Experiments in Mapping Expressions to DSP Blocks”. In:
22nd IEEE Annual International Symposium on Field-Programmable Custom Computing Machines,
FCCM 2014, Boston, MA, USA, May 11-13, 2014. IEEE Computer Society, 2014, p. 101. doi:
10.1109/FCCM.2014.34.
[82] Bajaj Ronak and Suhaib A. Fahmy. “Mapping for Maximum Performance on FPGA DSP Blocks”.
In: IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 35.4 (2016), pp. 573–585. doi:
10.1109/TCAD.2015.2474363.
[83] Bajaj Ronak and Suhaib A. Fahmy. “Minimizing DSP block usage through multi-pumping”. In:
2015 International Conference on Field Programmable Technology, FPT 2015, Queenstown, New
Zealand, December 7-9, 2015. IEEE, 2015, pp. 184–187. doi: 10.1109/FPT.2015.7393146.
145
[84] Bajaj Ronak and Suhaib A. Fahmy. “Multipumping Flexible DSP Blocks for Resource Reduction
on Xilinx FPGAs”. In: IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 36.9 (2017),
pp. 1471–1482. doi: 10.1109/TCAD.2016.2629421.
[85] Frank Rosenblatt. “The perceptron: a probabilistic model for information storage and
organization in the brain.” In: Psychological review 65.6 (1958), p. 386.
[86] Jonathan Ross and Andrew Everett Phelps. “Computing convolutions using a neural network
processor”. US10438117B1. Oct. 8, 2019.
[87] Nadav Rotem, Jordan Fix, Saleem Abdulrasool, Summer Deng, Roman Dzhabarov,
James Hegeman, Roman Levenstein, Bert Maher, Nadathur Satish, Jakob Olesen, Jongsoo Park,
Artem Rakhov, and Misha Smelyanskiy. “Glow: Graph Lowering Compiler Techniques for Neural
Networks”. In: CoRR abs/1805.00907 (2018). arXiv: 1805.00907.
[88] Ali Shafiee, Anirban Nag, Naveen Muralimanohar, Rajeev Balasubramonian, John Paul Strachan,
Miao Hu, R. Stanley Williams, and Vivek Srikumar. “ISAAC: A Convolutional Neural Network
Accelerator with In-Situ Analog Arithmetic in Crossbars”. In: International Symposium on
Computer Architecture. IEEE Computer Society, 2016, pp. 14–26.
[89] Soheil Nazar Shahsavani, Arash Fayyazi, Mahdi Nazemi, and Massoud Pedram. “Efficient
Compilation and Mapping of Fixed Function Combinational Logic onto Digital Signal Processors
Targeting Neural Network Inference and Utilizing High-Level Synthesis”. In: ACM Trans.
Reconfigurable Technol. Syst. (Aug. 2022). Just Accepted. issn: 1936-7406. doi: 10.1145/3559543.
[90] Yakun Shao, Rangharajan Venkatesan, Miaorong Wang, Daniel Smith, William James Dally,
Joel Emer, Stephen W. Keckler, and Brucek Khailany. “Efficient neural network accelerator
dataflows”. US20200293867A1. Sep. 17, 2020.
[91] Hardik Sharma et al. “From high-level deep neural models to FPGAs”. In: International
Symposium on Microarchitecture. IEEE Computer Society, 2016, 17:1–17:12.
[92] Karen Simonyan and Andrew Zisserman. “Very Deep Convolutional Networks for Large-Scale
Image Recognition”. In: International Conference on Learning Representations. 2015.
[93] Atefeh Sohrabizadeh, Jie Wang, and Jason Cong. “End-to-End Optimization of Deep Learning
Applications”. In: FPGA ’20: The 2020 ACM/SIGDA International Symposium on
Field-Programmable Gate Arrays, Seaside, CA, USA, February 23-25, 2020. Ed. by
Stephen Neuendorffer and Lesley Shannon. ACM, 2020, pp. 133–139.
[94] Dave Steinkrau, Patrice Y. Simard, and Ian Buck. “Using GPUs for Machine Learning Algorithms”.
In: International Conference on Document Analysis and Recognition. IEEE Computer Society, 2005,
pp. 1115–1119.
[95] Xinyao Sun, Xinpeng Liao, Xiaobo Ren, and Haohong Wang. “System and method for
vision-based flight self-stabilization by deep gated recurrent Q-networks”. US10241520B2. Mar.
26, 2019.
146
[96] Vivienne Sze, Yu-Hsin Chen, Tien-Ju Yang, and Joel S. Emer. Efficient Processing of Deep Neural
Networks. Synthesis Lectures on Computer Architecture. Morgan & Claypool Publishers, 2020.
doi: 10.2200/S01004ED1V01Y202004CAC050.
[97] Vivienne Sze, Yu-Hsin Chen, Tien-Ju Yang, and Joel S. Emer. “Efficient Processing of Deep Neural
Networks: A Tutorial and Survey”. In: Proceedings of the IEEE 105.12 (2017), pp. 2295–2329.
[98] Lucas Theis, Iryna Korshunova, Alykhan Tejani, and Ferenc Huszár. “Faster gaze prediction with
dense networks and Fisher pruning”. In: CoRR abs/1801.05787 (2018). arXiv: 1801.05787.
[99] Ilya O. Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai,
Thomas Unterthiner, Jessica Yung, Andreas Steiner, Daniel Keysers, Jakob Uszkoreit, Mario Lucic,
and Alexey Dosovitskiy. “MLP-Mixer: An all-MLP Architecture for Vision”. In: Advances in
Neural Information Processing Systems 34: Annual Conference on Neural Information Processing
Systems, NeurIPS. 2021.
[100] Frederick Tung and Gregory Mori. “System and method for knowledge distillation between
neural networks”. US20200302295A1. Sep. 24, 2020.
[101] Yaman Umuroglu, Yash Akhauri, Nicholas James Fraser, and Michaela Blott. “LogicNets:
Co-Designed Neural Networks and Circuits for Extreme-Throughput Applications”. In: 30th
International Conference on Field-Programmable Logic and Applications, FPL 2020, Gothenburg,
Sweden, August 31 - September 4, 2020. Ed. by Nele Mentens, Leonel Sousa, Pedro Trancoso,
Miquel Pericàs, and Ioannis Sourdis. IEEE, 2020, pp. 291–297.
[102] Yaman Umuroglu, Nicholas J. Fraser, Giulio Gambardella, Michaela Blott, Philip Heng Wai Leong,
Magnus Jahre, and Kees A. Vissers. “FINN: A Framework for Fast, Scalable Binarized Neural
Network Inference”. In: Proceedings of the 2017 ACM/SIGDA International Symposium on
Field-Programmable Gate Arrays, FPGA 2017, Monterey, CA, USA, February 22-24, 2017. Ed. by
Jonathan W. Greene and Jason Helge Anderson. ACM, 2017, pp. 65–74. url:
http://dl.acm.org/citation.cfm?id=3021744.
[103] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,
Lukasz Kaiser, and Illia Polosukhin. “Attention Is All You Need”. In: CoRR abs/1706.03762 (2017).
arXiv: 1706.03762.
[104] Stylianos I. Venieris and Christos-Savvas Bouganis. “fpgaConvNet: Mapping Regular and
Irregular Convolutional Neural Networks on FPGAs”. In: IEEE Transaction on Neural Networks
and Learning Systems 30.2 (2019), pp. 326–342.
[105] Stylianos I. Venieris, Alexandros Kouris, and Christos-Savvas Bouganis. “Toolflows for Mapping
Convolutional Neural Networks on FPGAs: A Survey and Future Directions”. In: ACM Comput.
Surv. 51.3 (June 2018).
[106] Naiyan Wang. “Method and apparatus for neural network pruning”. US20190279089A1. Sep. 12,
2019.
[107] Yu Wang, Fan Jiang, Xiao Sheng, Song Han, and Yi Shan. “Method of pruning convolutional
neural network based on feature map variation”. US20200311549A1. Oct. 1, 2020.
147
[108] Xuechao Wei, Yun Liang, Xiuhong Li, Cody Hao Yu, Peng Zhang, and Jason Cong. “TGPA:
tile-grained pipeline architecture for low latency CNN inference”. In: Proceedings of the
International Conference on Computer-Aided Design, ICCAD 2018, San Diego, CA, USA, November
05-08, 2018. Ed. by Iris Bahar. ACM, 2018, p. 58.
[109] Wei Wen, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. “Learning Structured Sparsity in
Deep Neural Networks”. In: Advances in Neural Information Processing Systems. 2016,
pp. 2074–2082. url:
http://papers.nips.cc/paper/6504-learning-structured-sparsity-in-deep-neural-networks.
[110] Clifford Wolf. Yosys open synthesis suite. 2016. url: http://www.clifford.at/yosys/.
[111] Y Xiang and XG Gong. “Efficiency of generalized simulated annealing”. In: Physical Review E 62.3
(2000), p. 4473.
[112] Xilinx. Vivado Design Suite User Guide: High-Level Synthesis. Accessed: 9-7-2020.
[113] Seung-Soo Yang. “Neural network system for reshaping a neural network model, application
processor including the same, and method of operating the same”. US20190080239A1. Mar. 14,
2019.
[114] Xuan Yang, Mingyu Gao, Jing Pu, Ankita Nayak, Qiaoyi Liu, Steven Bell, Jeff Setter, Kaidi Cao,
Heonjae Ha, Christos Kozyrakis, and Mark Horowitz. “DNN Dataflow Choice Is Overrated”. In:
CoRR abs/1809.04070 (2018). arXiv: 1809.04070.
[115] Yuanyuan Yang and Gerald M. Masson. “Nonblocking broadcast switching networks”. In: IEEE
Transactions on Computers 40.9 (1991), pp. 1005–1015.
[116] Sergey Zagoruyko and Nikos Komodakis. “Wide Residual Networks”. In: British Machine Vision
Conference. BMVA Press, 2016.
[117] Gang Zhang. “Method and apparatus for compressing neural network”. US20190205759A1. Jul. 4,
2019.
[118] Michael Zhu and Suyog Gupta. “To Prune, or Not to Prune: Exploring the Efficacy of Pruning for
Model Compression”. In: International Conference on Learning Representations. OpenReview.net,
2018.
148
Abstract (if available)
Abstract
Deep neural networks (DNNs) are deployed widely to solve challenging data analytics, classification, and forecasting problems. To improve their output quality, DNNs are growing in size and complexity, demanding ever more compute cycles, memory footprint, and I/O bandwidth during their training and inference. Given their performance, flexibility, and energy efficiency, field-programmable gate array (FPGA)-based DNN accelerators are gaining traction as a serious contender to replace graphics processing unit- and central processing unit-based platforms. This dissertation aims to provide compiler and runtime support for hybrid arithmetic and logic processing of neural networks. The key idea of the logic processing part is to replace expensive multiply-and-accumulate operations that are required to compute various filter/ neuron functions in a DNN with Boolean logic expressions, which are subsequently mapped to native look-up tables (LUTs) of the FPGA device, resulting in low hardware cost and ultra-low latency. In this proposal, we present F2N3, an across-the-stack design and optimization framework for the construction of resource-constrained and energy-efficient, ultra-low-latency FPGA-based neural network accelerators. Our experimental evaluations across several datasets and DNN architectures demonstrate the superior performance of F2N3 in terms of inference latency, energy efficiency, and output accuracy compared to prior art FPGA-based DNN accelerators.
We also present a framework for efficient compilation and mapping of Fixed Function Combinational Logic on digital signal processors (DSPs) utilizing High-level Synthesis. Mapping the large Boolean function with many input variables and product terms to digital signal processors (DSPs) on Field-programmable gate arrays (FPGAs) needs a new framework considering DSP blocks’ structure and reconfigurability during
this process. The proposed methodology in this proposal maps the fixed-function combinational logic blocks to a set of Boolean functions where Boolean operations inside the functions are mapped on DSP devices rather than look-up tables (LUTs) on the FPGAs to take advantage of the high-performance, low latency, and parallelism of DSP blocks.
We also introduce a novel reconfigurable Boolean processor consisting of multiple logic processing units for processing logic-based NN models comprising large Boolean functions with many input variables and product terms. The Boolean processor is accompanied by a mapping framework for the compilation and mapping of NNs utilizing FFCL into this Boolean processor. We also present a scheduling that includes iterative modulo scheduling of the maximal feasible sub-graphs (MFGs) and a place and route of scheduled MFGs on the Boolean processor architectural resources. A circulation strategy is presented to handle FFCL blocks that cannot straightforwardly fit the Boolean processor.
Finally, we also improved the performance of the arithmetic processing of NNs. Particularly, this dissertation introduces the Sparse Periodic Systolic (SPS) dataflow, which advances the state-of-the-art hardware accelerator for supporting lightweight neural networks. Specifically, the SPS dataflow enables a novel hardware design approach unlocked by periodic pattern-based pruning, resulting in neural network weights with characteristically higher regularity and thus exhibiting higher degrees of parallelism. We achieve this by addressing the central challenge of reducing the overhead incurred by the irregularity of weights. Our sparsity-aware compiler optimally reorders the weights and uses a simple indexing unit in hardware to create matches between the weights and corresponding activations.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Ultra-low-latency deep neural network inference through custom combinational logic
PDF
An FPGA-friendly, mixed-computation inference accelerator for deep neural networks
PDF
Graph machine learning for hardware security and security of graph machine learning: attacks and defenses
PDF
Towards efficient edge intelligence with in-sensor and neuromorphic computing: algorithm-hardware co-design
PDF
Analog and mixed-signal parameter synthesis using machine learning and time-based circuit architectures
PDF
Hardware-software codesign for accelerating graph neural networks on FPGA
PDF
Attacks and defense on privacy of hardware intellectual property and machine learning
PDF
Advanced cell design and reconfigurable circuits for single flux quantum technology
PDF
Performance improvement and power reduction techniques of on-chip networks
PDF
Security-driven design of logic locking schemes: metrics, attacks, and defenses
PDF
Reinforcement learning in hybrid electric vehicles (HEVs) / electric vehicles (EVs)
PDF
Electronic design automation algorithms for physical design and optimization of single flux quantum logic circuits
PDF
Towards green communications: energy efficient solutions for the next generation cellular mobile communication systems
PDF
Query processing in time-dependent spatial networks
PDF
Custom hardware accelerators for boolean satisfiability
PDF
Optimal designs for high throughput stream processing using universal RAM-based permutation network
PDF
Verification and testing of rapid single-flux-quantum (RSFQ) circuit for certifying logical correctness and performance
PDF
High throughput computational framework for synthesis and accelerated discovery of dielectric polymer materials using polarizable reactive molecular dynamics and graph neural networks
PDF
Energy consumption and lifetime/reliability improvement of computing systems using voltage overscaling (VOS) approximation technique
PDF
Efficient processing of streaming data in multi-user and multi-abstraction workflows
Asset Metadata
Creator
Fayyazi, Arash
(author)
Core Title
Compiler and runtime support for hybrid arithmetic and logic processing of neural networks
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Degree Conferral Date
2023-12
Publication Date
06/04/2024
Defense Date
07/20/2023
Publisher
Los Angeles, California
(original),
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
hybrid fabrics,logic processing,ML compiler,neural network,OAI-PMH Harvest
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Pedram, Massoud (
committee chair
), Nakano, Aiichiro (
committee member
), Nuzzo, Pierluigi (
committee member
)
Creator Email
fayyazi@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC113781859
Unique identifier
UC113781859
Identifier
etd-FayyaziAra-12520.pdf (filename)
Legacy Identifier
etd-FayyaziAra-12520
Document Type
Dissertation
Format
theses (aat)
Rights
Fayyazi, Arash
Internet Media Type
application/pdf
Type
texts
Source
20231205-usctheses-batch-1111
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
hybrid fabrics
logic processing
ML compiler
neural network