Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
An FPGA-friendly, mixed-computation inference accelerator for deep neural networks
(USC Thesis Other)
An FPGA-friendly, mixed-computation inference accelerator for deep neural networks
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
An FPGA-Friendly, Mixed-Computation Inference Accelerator for Deep Neural Networks
by
Amirhossein Esmaili
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(COMPUTER ENGINEERING)
December 2023
Copyright 2023 Amirhossein Esmaili
Dedication
To my fiancé, my love, Mahsa, for her unwavering love, support, and understanding throughout the entire
duration of this thesis.
To my beloved parents, Ashraf and Mohsen. Their unconditional love, unwavering support, and constant
encouragement have been the driving force behind my accomplishments. From the earliest stages of my
education to this momentous milestone, they have been my guiding lights, offering guidance and wisdom
every step of the way.
ii
Acknowledgements
I would like to express my deepest gratitude to Professor Pedram for his invaluable support and guidance
throughout the entire research process of this thesis. His expertise, dedication, and encouragement have
been instrumental in shaping the direction and outcome of this work.
iii
Table of Contents
Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii
Chapter 1: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Chapter 2: F2N2: An FPGA-Friendly Framework for Designing High-Performance Neural Network
Accelerators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.1 CNN Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.2 Compiler Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3.3 Architectural Techniques for Exploiting Data Reuse . . . . . . . . . . . . . . . . . . 18
2.3.4 Dataflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3.5 Accelerator Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3.6 SDAccel Environment and Host Code . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4 F2N2: Overall Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.5 Accelerator Design Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.5.1 Accelerator Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.5.2 Memory Layout and Data Placement . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.5.3 Low Latency Accelerator Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.5.3.1 Burst mode transfer and efficient utilization of memory hierarchy: . . . . 27
2.5.3.2 Pre-fetching and double buffering: . . . . . . . . . . . . . . . . . . . . . . 28
2.6 F2N2 Compiler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.6.1 Quantization and Computation Graph Construction . . . . . . . . . . . . . . . . . 29
2.6.2 Optimizer and Scheduler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.6.2.1 Tiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.6.2.2 Cost Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.6.2.3 Generating Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
iv
2.6.2.4 Reducing Overhead of Batch Normalization Layer . . . . . . . . . . . . . 36
2.7 Operation-packing for lower-precision weights and activations . . . . . . . . . . . . . . . . 38
2.8 HW/SW optimization techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.8.1 Hardware (Kernel) Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.8.1.1 Loop transformation: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.8.1.2 Exploiting task-level parallelism: . . . . . . . . . . . . . . . . . . . . . . 43
2.8.2 Software (Host) Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.8.2.1 Hiding data transfer time by software pipelining: . . . . . . . . . . . . . 44
2.8.2.2 Concurrent accelerator executions by software parallelization: . . . . . . 45
2.9 F2N2 Mixed-Computation Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.10 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.10.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
2.10.2 Effect of Accelerator Design Optimizations . . . . . . . . . . . . . . . . . . . . . . 48
2.10.3 Comparison with State-of-the-art Designs . . . . . . . . . . . . . . . . . . . . . . . 50
2.10.4 Results for Multiplication-packing for Lower-precision Weights and Activations . . 52
2.10.5 Software Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
2.10.6 Results of F2N2 Mixed-computation Design . . . . . . . . . . . . . . . . . . . . . . 55
2.10.6.1 Mixed-computation with XNOR-based convolutions . . . . . . . . . . . . 55
2.11 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Chapter 3: SynergicLearning: Neural Network-Based Feature Extraction for Highly-Accurate
Hyperdimensional Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.2 Preliminaries & Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.3 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.4 Proposed Hardware Architecture & Compiler . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.4.1 NN Processing Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.4.2 HD Processing Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.5 Results & Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.5.1.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.5.1.2 Training Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.5.1.3 Neural Network Training . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.5.1.4 Hardware Emulation Framework . . . . . . . . . . . . . . . . . . . . . . 71
3.5.2 The Impact of NNs on the Quality of HD Features . . . . . . . . . . . . . . . . . . . 72
3.5.3 Comparison of Classification Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.5.4 Incremental Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
3.5.5 The Hardware Cost of NN & HD Processing Modules . . . . . . . . . . . . . . . . . 74
3.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
3.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
Chapter 4: Modeling Processor Idle Times in MPSoC Platforms to Enable Integrated DPM, DVFS,
and Task Scheduling Subject to a Hard Deadline . . . . . . . . . . . . . . . . . . . . . . 82
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.2 Models and Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.2.1 Voltage and Frequency Change Overhead . . . . . . . . . . . . . . . . . . . . . . . 84
4.2.2 Task Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.2.3 Energy Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
v
4.2.4 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.3 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.3.1 Constraints of the Proposed Scheduling Model . . . . . . . . . . . . . . . . . . . . 87
4.3.2 Modeling Idle Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.3.3 Objective Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.4.1 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.4.2 Effect of Modeling Idle Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
4.4.3 A Heuristic Approach to Solve the Model . . . . . . . . . . . . . . . . . . . . . . . 96
4.5 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
Chapter 5: Energy-Aware Scheduling of Task Graphs with Imprecise Computations and End-toEnd Deadlines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.2 Prior Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.3 Models and Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.3.1 Task Model and Imprecise Computation . . . . . . . . . . . . . . . . . . . . . . . . 104
5.3.2 Energy Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
5.3.3 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.4 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
5.4.1 Determining the Number of Processor Cycles Assigned to Optional Workloads of
Non-Exit Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
5.4.2 Scheduling Tasks on an MPSoC for Maximizing QoS Subject to Energy and
Deadline Constraints. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
5.4.3 MILP formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
5.4.4 Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
5.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
5.5.1 Simulation Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
5.5.2 Evaluating the Effect of Energy Budget on the obtained QoS . . . . . . . . . . . . . 123
5.5.3 Evaluating the Performance of the Proposed Heuristic versus MILP . . . . . . . . . 124
5.5.4 Evaluating the Effect of imp_label algorithm . . . . . . . . . . . . . . . . . . . . . . 128
5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
Chapter 6: Energy-aware scheduling of jobs in heterogeneous cluster systems using deep
reinforcement learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
6.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
6.2.1 Cluster Model and the Objective Function . . . . . . . . . . . . . . . . . . . . . . . 136
6.2.2 Deep RL Formulation for Deep-EAS Agent . . . . . . . . . . . . . . . . . . . . . . . 138
6.2.2.1 State Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
6.2.2.2 Action Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
6.2.2.3 Rewards Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
6.2.3 Training Deep-EAS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
6.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
6.3.1 Cluster Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
6.3.2 Energy Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
6.3.3 Deep-EAS Training Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
6.3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
vi
6.3.5 Deep-EAS Training Curve and Overhead Analysis . . . . . . . . . . . . . . . . . . 146
6.3.6 Analyzing Why Deep-EAS is Advantageous . . . . . . . . . . . . . . . . . . . . . . 148
6.3.7 Examining the effect of β and c . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
6.4 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
Chapter 7: Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
vii
List of Tables
2.1 Summary of notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2 Effective PCIe Bandwidth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3 PU Operation Instruction Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.4 Systolic Array Configuration Instruction Set . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.5 Resource Utilization and Frequency for our Accelerator Designs . . . . . . . . . . . . . . . 51
2.6 Comparison with Prior Work for MobileNet . . . . . . . . . . . . . . . . . . . . . . . . . . 52
2.7 Results for 2-bit packing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
2.8 Comparison with Prior Work for VGG-16 on ImageNet ∗
. . . . . . . . . . . . . . . . . . . 54
2.9 Comparison with Prior Work for ResNet-18, -50, and -152 Networks on ImageNet . . . . . 55
2.10 F2N2 Mixed-computation Results for Various Models on CIFAR-10 Dataset Using Xilinx
VU9P Device . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.1 Comparison of different characteristics of NNs, HD learning systems, and SynergicLearning. 62
3.2 Top accuracy reported for NNs, HD learning systems, and SynergicLearning on HAR and
ISOLET datasets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
3.3 Comparison of the effect of incremental learning on the accuracy of different models on
the ISOLET dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
3.4 Comparison between the hardware metrics of SynergicLearning (d
h = 16) with pure
HD (d
h = 10, 240) over the ISOLET dataset on Xilinx UltraScale+ VU9P FPGA. The
improvements of our approach compared to other approaches are shown in parantheses. . 75
4.1 Task Graphs Characteristics and Corresponding Energy Consumption Values Obtained from iSCT versus iSC+T 94
4.2 Idle Intervals Characteristics and No. of Used Processors for iSCT versus iSC+T . . . . . . . . . . . . . . 96
viii
5.1 Comparison of characteristics of prior work on imprecise scheduling compared to this
work (SP stands for single-processor while MP stands for multi-processor). . . . . . . . . . 105
5.2 Summary of Key Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
5.3 The number of task graphs a feasible schedule was found for using the proposed heuristic,
alongside the average of their obtained QoS, in each ϵmax value for man_low, man_med,
and man_high case. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
5.4 Comparison between the QoS values obtained with the proposed heuristic and MILP for
each task graph from TGFF0 to TGFF19 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
5.5 The number of task graphs a feasible schedule was found for using the proposed heuristic
versus the baseline approach, alongside the average of their obtained QoS, in each ϵmax
value for the man_mixed case. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
6.1 In hendrerit gravida rutrum quisque non tellus orci ac. Iaculis urna id volutpat lacus
laoreet non curabitur gravida arcu. Mauris ultrices eros in cursus turpis massa. Sed
tempus urna et pharetra pharetra massa massa. Eget sit amet tellus cras adipiscing enim
eu turpis egestas. Morbi blandit cursus risus at ultrices. . . . . . . . . . . . . . . . . . . . . 152
ix
List of Figures
2.1 High-level overview of F2N2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2 High-level overview of systolic array accelerator design. The pipe-shape objects are
first-in-first-out buffers (FIFOs) used to distribute the weights. . . . . . . . . . . . . . . . . 24
2.3 High-level view of a local area of the Xilinx FPGA layout. . . . . . . . . . . . . . . . . . . . 26
2.4 A given CNN and its corresponding DAG. . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.5 INT4-packing for performing four parallel multiplications on a single DSP. Here, $ denotes
the (extended) sign bits and # denotes the padding bits. . . . . . . . . . . . . . . . . . . . . 39
2.6 INT8-packing for performing two parallel multiplications on a single DSP. Here, $ denotes
the (extended) sign bits and # denotes the padding bits. . . . . . . . . . . . . . . . . . . . . 42
2.7 INT2-packing for performing six parallel multiplications on a single DSP. Here, $ denotes
the (extended) sign bits and # denotes the padding bits. . . . . . . . . . . . . . . . . . . . . 43
2.8 Timing diagram of running two accelerators with software pipelining. IFMs and weights
are loaded in WBUF_1 and WBUF_2, respectively. Final results are stored in RBUF. . . . . 46
2.9 Timing diagram of running two concurrent kernels. . . . . . . . . . . . . . . . . . . . . . . 47
2.10 Timing diagrams for layer 5 of VGG-16 (starting at the blue marker and ending at the
yellow marker), when we use the accelerator design shown in Section 2.5.1. Compute
latency of the layer in this case is ≈ 526 µs. . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
2.11 Timing diagrams for layer 5 of VGG-16 (starting at the blue marker and ending at the
yellow marker), when using the optimizations discussed in Section 2.5.3. Compute latency
of the layer in this case is ≈ 107 µs. The timestamp of the blue marker is earlier compared
to that of Fig. 2.12, as the blue marker shows the beginning of the processing of layer 5 in
VGG-16, and the processing of previous 4 layers has finished sooner in Fig. 2.13 compared
to Fig. 2.12, using the optimizations discussed in Section 2.5.3. . . . . . . . . . . . . . . . . 49
x
2.12 Timing diagrams for layer 5 of VGG-16 (starting at the blue marker and ending at the
yellow marker), when we use the accelerator design shown in Section 2.5.1. Compute
latency of the layer in this case is ≈ 526 µs. . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
2.13 Timing diagrams for layer 5 of VGG-16 (starting at the blue marker and ending at the
yellow marker), when using the optimizations discussed in Section 2.5.3. Compute latency
of the layer in this case is ≈ 107 µs. The timestamp of the blue marker is earlier compared
to that of Fig. 2.12, as the blue marker shows the beginning of the processing of layer 5 in
VGG-16, and the processing of previous 4 layers has finished sooner in Fig. 2.13 compared
to Fig. 2.12, using the optimizations discussed in Section 2.5.3. . . . . . . . . . . . . . . . . 53
2.14 soft Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.1 The mean and standard deviation of the normalized absolute error between the input
features and the decoded features of their encoded hypervectors for different values of d
h
and q. Ideally, this error should be zero everywhere. However, the error has a non-zero
value even at extremely high dimensions (d
h ≃ 10, 000). . . . . . . . . . . . . . . . . . . . 66
3.2 A high-level overview of the SynergicLearning framework. First, an encoder-aware NN is
trained to extract high-quality, high-level features (top row of the figure). Next, encoded
NN features are provided to train an HD classifier. Finally, during inference, the feature
extraction layers of the NN and the HD classifier are both utilized to predict each test
sample’s label. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.3 Architectural view of the NN processing module which includes a systolic array, on-chip
memories, tree adders, and ALUs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
3.4 Architectural overview of the HD processing module which includes lookup tables that
store hypervectors representing quantized levels, binding/unbinding units, majority
counters, comparators, tree adders, and tree comparators. . . . . . . . . . . . . . . . . . . . 80
3.5 Two-dimensional (t-SNE) representation of the encoded hypervectors of the HAR dataset
for three different designs: HDL, NN followed by HDL, and encoder-aware NN followed
by HDL. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
3.6 Classification accuracy of different models on HAR and ISOLET datasets for different
values of d
h
and q. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
3.7 the LUT utilization and latency of HD processing modules for different values of d
h
. . . . . 81
4.1 Energy Consumption obtained from the proposed heuristic approach and iSCT for different task graphs . . . 99
5.1 Task graphs of (a) base case 1 and (b) base case 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
5.2 Algorithmic flow chart for determining the number of processor cycles assigned to optional workloads of
non-exit tasks: (a) Step 1, and (b) Step 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
xi
5.3 QoS values obtained for different values of ϵmax for TGFF8 for the man_low case (solid
line), the man_med case (dashed line), and the man_high case (dotted line). . . . . . . . . 126
5.4 Comparison between QoS values obtained for different values of ϵmax for TGFF2 using
the proposed heuristic versus MILP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
5.5 QoS values obtained for different values of ϵmax for TGFF8 using the proposed heuristic
(solid line) and the baseline approach (dashed line) for the man_mixed case. . . . . . . . . 131
6.1 A high-level view of the reinforcement learning with the policy represented by a deep neural
network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
6.2 An illustrative example of the state representation of the cluster system in the middle of the job
arrival process. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
6.3 Comparison of Deep-EAS and ESJF at different job arrival rates, when β = 0.5. . . . . . . . . . . 147
6.4 Deep-EAS learning curve indicating the policy improvement over the training iterations. . . . . . 149
6.5 CDF plots of µj,0 of jobs Deep-EAS is not energy-delay conserving for (holde), alongside the jobs
Deep-EAS is not work conserving for (holdw), when λ = 0.9 and β = 0.5. . . . . . . . . . . . . . 152
xii
Abstract
Presently, machine learning (ML) models, including deep neural networks, find extensive utilization across
diverse industries, such as banking, finance, analytics, security, drug design, high-tech industry, IC design,
visual tasks, language understanding, healthcare, and business. However, conducting the inference, like
object recognition or language understanding, for state-of-the-art DNNs, poses significant challenges due
to the limited computational, memory, and energy resources available. It necessitates substantial advancements beyond the current state of the art. Essentially, there is a need for a lightweight and highly
energy-efficient inference accelerator, which is capable of achieving inference accuracy comparable to using full precision for all the inference computations. Numerous studies have indicated
that employing full-precision computations is unnecessary for many applications. Nevertheless, extremely
low-precision models often lead to a noticeable decrease in accuracy, which can be as significant as 10-15%
when compared to full-precision computation, such as 32-bit floating point or 16-bit fixed-point models.
While many methods have been proposed to improve the accuracy of the low-precision models, so
far, no one has found a solution to this intrinsic accuracy loss. Stepping back from uniformly ultra-lowprecision models, mixed-precision models have been proposed to serve as a better trade-off. Effective
ways have been found to train accurate models with some layers being processed in ultra-low precision
while other layers are processed in high precision [132, 33, 6]. Another important consideration (which
has been the focus of many academic and industry efforts) is the cost of accessing pre-trained weights
from external memory, the memory cost of storing the weights on-chip, and finally, the cost of
xiii
weight data transfers on-chip, which is a substantial bottleneck based on benchmarks across various
hardware platforms [133, 134, 148].
By examining the range of deep learning models concerning number systems and numerical
precision, we can conclude that achieving a trade-off between hardware efficiency and inference
accuracy is possible through the utilization of mixed-computation models. These models should
be appropriately trained and deployed on a heterogeneous hardware platform, commonly referred to as
a mixed-computation accelerator fabric. To accommodate various types of neural network models, it is
crucial to have a specially designed computational fabric that supports multiple number systems, multiple
precision computations, and seamless data conversions between different precision levels. Additionally,
support for distillation, compilation, and runtime optimizations is essential to ensure optimal performance.
In this thesis, we focus on the energy-efficient and low-latency implementation of the neural network
inference, and present F2N2, an end-to-end FPGA-friendly Framework for designing Neural Network (NN)
accelerators for NN models by leveraging the count and intrinsic arrangement of computing and memory
resources of the target FPGA. We apply optimizations to reduce the cost of data movement in cloud FPGAs, which are typically equipped with extensive on-chip memory resources. Furthermore, We employ a
software/hardware co-optimization flow in order to achieve an efficient communication method between
the host CPU and the FPGA accelerator in order to maximize performance. Compared to the state-ofthe-art work, F2N2 achieves a factor of three reductions in end-to-end inference latency under the same
experimental setup while achieving a clock frequency of 342MHz on a Xilinx VU9P FPGA device.
In addition, we will provide an efficient streaming accelerator architecture for carrying out the inference for mixed-precision deep neural networks. In this architecture, we consider packing the operations
associated with low-precision weights to perform multiple operations using the same resources on the target FPGA. This technique, associated with the streaming architecture, can greatly enhance the throughput
of the inference of the neural network inference.
xiv
In addition to utilizing neural networks, we also propose employing brain-inspired hyperdimensional
(HD) learning models for some cognitive tasks. While neural networks (NNs) are well-known for their
high accuracy due to the quality of their automatic feature extraction while brain-inspired HD learning
models are famous for their quick training, computational efficiency, and adaptability. This thesis presents
a hybrid, synergic machine learning model that excels at all the said characteristics and is suitable for
incremental, on-line learning on a chip. The proposed model comprises an NN and a classifier. The NN
acts as a feature extractor and is specifically trained to work well with the classifier that employs the HD
computing framework. We use the proposed accelerator mentioned above and present a parameterized
hardware implementation of the said feature extraction and classification components while introducing
a compiler that maps any arbitrary NN and/or classifier to the aforementioned hardware. The proposed
hybrid machine learning model has the same level of accuracy (i.e. ±1%) as NNs while achieving at least
a 10% improvement in accuracy compared to HD learning models. Additionally, the end-to-end hardware
realization of the hybrid model improves power efficiency by 1.60x compared to state-of-the-art, highperformance HD learning implementations while improving latency by 2.13x. These results have profound
implications for the application of such synergic models in challenging cognitive tasks.
Before shifting my focus to the acceleration of mixed-precision neural networks and brain-inspired
hyperdimensional (HD) learning models on FPGAs, I was actively involved in developing energy-aware
scheduling strategies for real-time, deadline-constrained tasks across different computing devices. These
devices spanned from portable embedded systems to servers in data centers. To provide a comprehensive
view of my research journey, I will incorporate my previous work on energy-aware scheduling strategies
into the last three chapters of my thesis.
xv
Chapter 1
Introduction
Deep neural networks (DNNs) have surpassed the accuracy of conventional machine learning models in
many challenging domains including computer vision [69, 112, 124, 46, 142, 52] and natural language processing [50, 9, 129, 29]. Major advancements in building both general-purpose and custom hardware have
been among the key enablers for shifting deep neural networks from rather theoretical concepts to practical solutions for a wide variety of problems [121, 12, 97, 23]. Alarmingly, the success of DNNs comes
at the cost of high latency and enormous hardware resources which, in turn, prevent their deployment
in latency-critical applications, especially on resource-constrained platforms. The high latency and huge
hardware cost are due to the fact that practical, high quality deep learning models entail billions of arithmetic operations and millions of parameters, which exert considerable pressure on both processing and
memory subsystems.
To sustain the ubiquitous deployment of deep learning models and cope with their computational and
memory complexities, numerous effective methods operating at different levels of the design hierarchy
have been developed. At the algorithm level, methods such as model quantization [98, 56, 152, 71, 149, 154,
83, 20], model pruning [45, 155, 145, 28, 32], and knowledge distillation [49, 82, 93, 125] have gained more
popularity. At the compiler level, domain-specific optimizations, memory-related optimizations (e.g. instruction scheduling, static memory allocation, and copy elimination), and device-specific code generation
1
are employed [101, 15, 108, 130]. At the architecture level, different dataflow architectures which encourage data reuse are utilized to reduce data movement [43, 34, 143, 18, 139]. Finally, at the circuit and device
level, various energy-efficient digital and analog processing elements which contribute to vector-matrix
multiplications are designed [66, 38, 107, 19].
Convolutional neural networks (CNNs), a specific type of Deep Neural Networks (DNNs), are widely
employed for computer vision applications. In order to enhance the efficiency and performance of network inference, researchers have dedicated their efforts to developing specialized accelerators tailored for
CNNs. These accelerators are designed to optimize CNN computations on different hardware platforms
such as Graphics Processing Units (GPUs), Field-Programmable Gate Arrays (FPGAs), and ApplicationSpecific Integrated Circuits (ASICs). The aim is to leverage the strengths of each platform and create
customized solutions that deliver improved efficiency and performance for CNN-based tasks [16, 135, 17,
140]. In this context, FPGAs constitute an excellent hardware platform because they can bridge the gap
between power-hungry GPU-based accelerators and power-efficient (but costly) ASIC-based designs. The
reconfiguration capabilities in FPGAs allow the generation of high-performance, power-efficient hardware
accelerator designs that can be configured to meet system-level requirements such as throughput, latency
and power for diverse applications ranging from forecasting to classification and deployment scenarios
ranging from embedded systems to data centers.
Chapter 2 presents an end-to-end FPGA-friendly framework of building and implementing high performance neural network accelerators on FPGA devices, leveraging the underlying architecture of the target
FPGA device and its available resources. By developing optimizations that target data movement cost reduction in FPGA devices equipped with extensive on-chip memory resources (the so-called cloud FPGAs),
we achieve very low end-to-end inference on a variety of neural network models. Furthermore, several
research studies have demonstrated that employing lower precision weights and activations for specific
2
layers, combined with a quantization-aware training approach, leads to only marginal accuracy degradation. In this chapter, we introduce the hardware support for packing multiply-and-accumulate operations
dedicated to layers utilizing lower precision weights and activations on a shared computer resource. This
packing technique effectively reduces computational latency, enhancing overall performance.
Furthermore, we consider a SW/HW co-optimization flow, achieving an efficient host-FPGA data communication method using techniques such as software pipelining and parallelization. More details are provided in Chapter2. Moreover, In addition to supporting mixed-precision multiply-and-accumulate (MAC)
operations, F2N2 has the ability to map a subset of the neural network layers using XNOR-based operations, whereby the weight-activation multiplication is replaced with a single Boolean XNOR operation
between binary weight and activation [95, 98, 128]. Notice that following this mixed-computation realization the NN is retrained in order to preserve the inference accuracy of the mixed-computation model.
Furthermore, while most of the prior work map the XNOR operations as random logic to look-up tables
(LUTs) in the target FPGA, F2N2 packs XNOR operations into long words and does the required XNOR
operations on DSP resources of the FPGA device. To the best of our knowledge, this is the first work that
maps XNOR operations to DSPs instead of LUTs by packing multiple XNOR operands into long words
that are operated on in parallel. In addition, we will provide an efficient streaming accelerator architecture
for carrying out the inference for mixed-precision deep neural networks. In this architecture, we consider packing the operations associated with low-precision weights to perform multiple operations using
the same resources on the target FPGA. This technique, associated with the streaming architecture, can
greatly enhance the throughput of the inference of the neural network inference.
Furthermore, there are hyperdimensional (HD) learning models that train quickly, are highly adaptable,
and computationally efficient compared to NNs, but suffer from lower levels of accuracy compared to NNs
[65]. HD learning uses randomly generated, high-dimensional vectors to project training data into HD
space such that samples belonging to the same class are placed in close proximity of each other, forming
3
a cluster in the HD space. It then defines HD centroids that represent different classes. This relatively
simple training process only requires one pass over the training data. It also enables efficient incremental,
lifelong learning because updating the model with new training data is as simple as updating the cluster
centroids. The major disadvantage of HD learning is that it works with raw or handcrafted input features,
which are inferior to the ones extracted by NNs.
The complementary characteristics of NNs and HD models encourage the introduction of a mixedcomputation, synergic machine learning model that builds on their strengths while avoiding their shortcomings. However, simply employing NNs for feature extraction and HD models for classification so as
to enable on-chip learning has the following challenges. Not only is the training of NNs for feature extraction an iterative, energy-consuming process but also it requires access to both previous training data
and newly provided data to avoid catastrophic forgetting. Therefore, frequent weight updates of NNs can
be extremely costly in the context of learning on-a-chip. Additionally, the HD learning models that work
well for solving cognitive tasks have a huge number of dimensions, e.g., 10,000, which requires their hardware implementation to time-share resources and therefore, have a relatively high latency. This prevents
real-time fine-tuning of the model when new training data becomes available. Moreover, training NNs
for feature extraction separately from the design of the HD learning model produces suboptimal results
because it does not account for the effect of HD classification layers on the NN feature extraction layers
and vice versa. This means that the prediction/classification accuracy of the overall mixed-computation
solution will suffer. Therefore, in Chapter3, we present SynergicLearning, a mixed-computation learning
framework for incremental, on-line learning on a chip.
Lastly, Energy consumption is one of the most important design criteria of computing devices, ranging
from portable embedded systems to servers in data centers. In embedded systems, with growing demand
for high performance, architectures such as multiprocessor system-on-chip (MPSoC) are becoming more
4
popular for many real-time applications. In order to reduce energy consumption in such embedded systems, two main techniques are used, namely, dynamic voltage and frequency scaling (DVFS) and dynamic
power management (DPM). In DVFS, operating voltage and clock frequency of processors are adjusted
based on workload characteristics. With DPM, processors are switched to a low power state (sleep mode)
when they are not used for execution of any tasks (idle time/interval). This leads to the reduction of static
power consumption. However, switching to a sleep mode has non-negligible time and energy overhead,
and it only causes energy savings when the idle time of a processor is longer than a threshold called breakeven time [42]. In this thesis, by proposing a method for modeling idle intervals in a multiprocessor system,
we present an energy optimization MILP formulation integrating both DVFS and DPM with scheduling of
real-time tasks with precedence and time constraints. By solving the MILP, for each task, we obtain the
optimum processor assignment, execution start time, and the distribution of its workload among available
frequencies of the processor. More details are explained in Chapter 4.
One other technique used to reduce energy consumption of tasks in embedded systems is to utilize approximate computations when there is possibility for it. To elaborate more, in many real-time applications,
it is often preferred for a task to produce an approximate (aka imprecise) result by its deadline rather than
producing an exact (aka precise) result late [141]. In imprecise computations, a real-time task is allowed
to return intermediate and imprecise results of poorer quality as long as it processes a predefined chunk
of work that defines its baseline quality. This work presents a heuristic for scheduling task graphs with
potentially imprecise computations, aiming at maximizing QoS subject to a hard deadline and an energy
bound. It also considers the fact that tasks can be interdependent and the imprecise output of one task
affects the input quality of its child tasks. Therefore, the proposed heuristic takes account of potential
extension in the workload of each task based on the quality of its inputs. More details are explained in
Chapter 5.
5
In addition to embedded systems, energy efficiency in cluster systems is also an important design factor,
as it not only can reduce the operational electricity cost, but also can increase system reliability. Furthermore, these platforms are becoming more popular for many computing-intensive real-time applications
such as image or signal processing, weather forecasting, and so forth [39, 118, 156]. A major portion of
this trend is due to rapid progress in computing power of commodity hardware components and their
relatively low cost [156]. Therefore, developing scheduling strategies that achieve promising performance
metrics for real-time workloads while yielding low energy costs are of great necessity. Inspired by recent
advances in employing reinforcement learning (RL) for addressing resource management problems, in this
work, we examine building intelligent systems which learn by their own to achieve energy-aware scheduling strategies, as an alternative to using manually-tuned heuristics. The obtained scheduling strategy can
be employed in an online scheduling environment and be efficient under varying workload conditions as
we see in more detail in Chapter 6.
1.1 Contributions
The main contributions of this thesis are as follows:
• This thesis focuses on three key contributions. Firstly, we introduce an FPGA-friendly inference accelerator designed specifically for deep neural networks. This accelerator leverages mixed-computation
techniques to enhance efficiency and performance.
Secondly, we present SynergicLearning, a novel learning framework that supports mixed-computation
for incremental and on-line learning directly on a chip. This framework enables efficient and adaptable learning processes.
Lastly, we propose a set of across-the-stack methods for task scheduling in embedded systems and
heterogeneous cloud systems. These methods prioritize energy efficiency while maintaining high
6
performance, leading to optimized resource allocation and task management across different computing environments.
Chapter 2 Detailed Contributions:
• We present an end-to-end framework called F2N2, which generates high-performance NN accelerators suited to a target FPGA device from a high-level description of the NN model in PyTorch, while
making effective use of the available FPGA resources and memory bandwidth. The focus of F2N2 is
on CNNs.
• We develop optimized synthesizable C++ templates achieving low-latency accelerator designs on
cloud FPGAs, which have large amounts of on-chip memory, by focusing on optimizations that
lower the overhead of data transfers.
• We introduce a powerful and flexible compiler (F2N2 Compiler) for mapping a given CNN inference
running on any dataset onto our optimized accelerator design, by converting the network model to
a computational graph, scheduling the graph’s execution, and optimizing its nodes by leveraging
intrinsic fusions of convolution and batch normalization layers (which can be pre-calculated).
• Compared to the state-of-the-art work, using our optimizations focused on reducing data transfers
from/to off-chip memory and double-buffering, we achieve significant improvement in the end-toend latency in the same experimental setup.
• We offer efficient hardware support for MAC computations involving quantized weights and activations of precision lower than 16 bits. This is achieved by packing multiple multiplications onto a
single DSP, optimizing the computational efficiency of the hardware..
7
• We employ a SW/HW co-optimization flow, specifically focusing on developing efficient host-FPGA
data communication methods, which can maintain and deliver the performance and energy efficiency gains achieved by optimizing the CNN hardware accelerator by carefully managing and optimizing the host-FPGA interface.
• We support mixed-computation neural network realizations in which the computations for a subset
of layers are carried out by using XNOR-based Boolean operations. XNOR-based computations are
packed and mapped to arrays of DSPs. The mixed-computation network architecture is retrained
to retain the model accuracy as much as possible, while reducing the end-to-end latency even more
compared to when the computations for all layers are carried out using fixed-point
Chapter 3 Detailed Contributions:
• We introduces SynergicLearning, a hybrid learning framework for incremental, on-line learning on
a chip, combining the capabilities of HD and NN models.
• We present a two-step training approach that combines neural network (NN) training with components of the hyperdimensional (HD) learning system. This approach enables automatic feature
extraction and reduces the dimensionality of the HD classifier.
• We develop an on-chip learning module consisting of parameterized NN and HD processing modules.
The NN processing module utilizes a systolic array and ALU for efficient vector-matrix multiplications and operations such as batch normalization and pooling. The HD processing module supports
operations defined in the HD computing framework.
• We implement a custom compiler that performs code optimizations and generates instructions for
efficient scheduling of operations on the target platform, including vector-matrix multiplications
and data movement.
Chapter 4 Detailed Contributions:
8
• By proposing a method for modeling idle intervals in MPSoCs, we present an energy optimization
MILP formulation integrating both DVFS and DPM with scheduling of real-time tasks with precedence and time constraints.
• We also present a heuristic approach for solving the MILP which provided close results compared
to optimum results from solving the MILP directly.
Chapter 5 Detailed Contributions:
• We present a heuristic for scheduling task graphs with potentially imprecise computations, aiming
at maximizing QoS subject to a hard deadline and an energy bound.
• The proposed heuristic takes account of input-quality-dependent workload extension in energyconstrained scheduling of imprecise, interdependent tasks on multiprocessor system-on-chip (MPSoC) platforms. To the best of our knowledge, this is the most comprehensive work in this domain
to date.
• We present a mixed integer linear program (MILP) formulation of the same problem, enabling comparison of the proposed heuristic with optimal solutions.
• The proposed heuristic in some cases is capable of finding the same QoS as the ones found by MILP.
Furthermore, for those task graphs that MILP outperforms the proposed heuristic, QoS values obtained with the proposed heuristic are, on average, within 1.24% of the optimal solutions while
improving the runtime by a factor of 100 or so.
Chapter 6 Detailed Contributions:
9
• We present Deep-EAS, an intelligent online energy-aware scheduler for cluster systems that have
multiple machines with heterogeneous energy profiles, taking into account the uncertainties associated with the workloads of arriving jobs.
• Deep-EAS agent starts from knowing nothing about the scheduling task at hand, and learns nontrivial scheduling policies by modeling the different aspects of the system such as the arrival rate,
duration and resource-demand profile of incoming jobs, current occupation state of servers and energy profile of using each one for scheduling any of the waiting jobs, and so forth.
• Comparing with manual heuristics under varying workload conditions, examine the situations where
using Deep-EAS is advantageous.
10
Chapter 2
F2N2: An FPGA-Friendly Framework for Designing High-Performance
Neural Network Accelerators
2.1 Introduction
Deep neural networks (DNNs) provide state-of-the-art performance in various artificial intelligence applications and have surpassed the accuracy of conventional machine learning models in many challenging
domains including computer vision [69, 142, 52] and natural language processing [50, 29]. The Emergence of deeper and more complex DNN models has significantly contributed to the impressive performance achieved across various application domains. However, these advanced models demand substantial
computational and memory resources. Additionally, many applications require low end-to-end inference
latency, high throughput, and energy efficiency, all while maintaining high accuracy.
Convolutional neural networks (CNNs) are a subclass of DNNs which are mostly used for computer
vision tasks. To improve the performance and efficiency of network inference, researchers have focused on
designing customized accelerators for CNNs targeting various hardware platforms, including graphics processing units (GPUs), field-programmable gate array (FPGA) devices, and application-specific integrated
circuits (ASICs) [16, 135, 17, 140]. In this context, FPGAs constitute an excellent hardware platform because
they can bridge the gap between power-hungry GPU-based accelerators and power-efficient (but costly)
11
ASIC-based designs. The reconfiguration capabilities in FPGAs allow the generation of high-performance,
power-efficient hardware accelerator designs that can be configured to meet system-level requirements
such as throughput, latency and power for diverse applications ranging from forecasting to classification
and deployment scenarios ranging from embedded systems to data centers.
High-level synthesis (HLS) tools have shown substantial progress in generating FPGA-based hardware
accelerator designs from a high level specification of the DNN/CNN models [64]. Existing HLS tools (e.g.,
Xilinx’s Vivado HLS and Intel’s FPGA OpenCL SDK) employ commonly-used programming languages
such as C, C++, and OpenCL in order to facilitate the development and design of neural network hardware accelerators. However, these tools primarily focus on generating efficient designs by mapping and
scheduling low-level primitive operations of the network model onto hardware. Yet, they often overlook
the precise count and intrinsic arrangement of computing and memory resources available on the target
FPGA device. Developing a high-performance neural network accelerator on an FPGA device presents a
challenging endeavor, as it requires thorough exploration of the design space to select a configuration that
closely approaches an optimal accelerator design.
Many approaches for automated mapping of neural networks to FPGA devices have been developed
[131]. Employing these approaches, a neural network accelerator can be constructed starting from a highlevel specification of the neural network (e.g., with Caffe∗
, PyTorch†
, and TensorFlow‡
). The integration of
these accelerator generator tools into the existing high-level deep learning software frameworks enables
the application developers to realize custom hardware implementation of neural network inference engines
with little or no hardware design expertise. This enhances easy and streamlined utilization of FPGA devices
by deep learning systems.
Most existing NN-to-FPGA tool flows employ accelerator designs that are mainly suitable for ASIC
implementation without considering the intrinsic arrangement of computing and memory resources in
∗
http://caffe.berkeleyvision.org/
†
http://pytorch.org/
‡
https://www.tensorflow.org/
12
FPGA devices. This leads to an architecture-device mismatch which can affect the performance of the
FPGA-based NN accelerator adversely [109]. In addition, most of the prior work merely consider hardware
perspectives for their accelerator design guidelines.§ However, the performance gains of the FPGA-based
NN accelerators are often offset by the overhead of data communication between the host CPU and FPGA
device, a topic that has not been adequately considered in the prior art.
This chapter presents F2N2, an end-to-end FPGA-friendly Framework of building and implementing
high-performance Neural Network accelerators on FPGA devices, leveraging the underlying architecture
of the target FPGA device and its available resources. By developing optimizations that target data movement cost reduction in FPGA devices equipped with extensive on-chip memory resources (the so-called
cloud FPGAs), we achieve very low end-to-end inference on a variety of neural network models. Furthermore, we consider a SW/HW co-optimization flow, achieving an efficient host-FPGA data communication
method using techniques such as software pipelining and parallelization.
Moreover, numerous studies have demonstrated that employing lower precision weights and activations, combined with an appropriate quantization-aware training approach, leads to minimal degradation
in accuracy [133, 134, 148]. In particular, it has been observed that utilizing 2-bit, 4-bit, or 8-bit precision
instead of 16-bit fixed-point weights and activations in certain layers of a neural network yields nearly
identical accuracy levels. In this chapter, we introduce hardware support for packing Multiply-Accumulate
(MAC) operations specifically for layers trained with lower-precision weights and activations on the target compute resource, such as DSPs in FPGAs. By implementing packing for these layers, the latency
associated with the operations can be significantly improved, nearly proportional to the packing factor.
In addition to supporting MAC operations, F2N2 has the ability to map a subset of the neural network
layers using XNOR-based operations, whereby the weight-activation multiplication is replaced with a single Boolean XNOR operation between binary weight and activation [95, 98, 128]. After implementing the
§Design choices regarding configurations of hardware instances on FPGAs are referred to as hardware perspectives.
13
mixed-computation approach, the NN is subsequently retrained to ensure the preservation of inference
accuracy achieved by the mixed-computation model. Moreover, while previous studies often map XNOR
operations as random logic to Look-Up Tables (LUTs) within the target FPGA, F2N2 adopts a different
strategy. It packs multiple XNOR operands into long words and performs the necessary XNOR operations
using the DSP resources available on the FPGA device. To the best of our knowledge, this is the first work
that maps XNOR operations to DSPs rather than LUTs by leveraging the parallel processing capabilities
enabled by packing multiple XNOR operands into long words.
The main contributions of this chapter may be summarized as follows.
• We present an end-to-end framework called F2N2, which generates high-performance NN accelerators suited to a target FPGA device from a high-level description of the NN model in PyTorch, while
making effective use of the available FPGA resources and memory bandwidth. The focus of F2N2 is
on CNNs.
• We develop optimized synthesizable C++ templates achieving low-latency accelerator designs on
cloud FPGAs, which have large amounts of on-chip memory, by focusing on optimizations that
lower the overhead of data transfers.
• We introduce a powerful and flexible compiler (F2N2 Compiler) for mapping a given CNN inference
running on any dataset onto our optimized accelerator design, by converting the network model to
a computational graph, scheduling the graph’s execution, and optimizing its nodes by leveraging
intrinsic fusions of convolution and batch normalization layers (which can be pre-calculated).
• Compared to the state-of-the-art work, using our optimizations focused on reducing data transfers
from/to off-chip memory and double-buffering, we achieve significant improvement in the end-toend latency in the same experimental setup.
14
• We offer efficient hardware support for MAC computations involving quantized weights and activations of precision lower than 16 bits. This is achieved by packing multiple multiplications onto a
single DSP, optimizing the computational efficiency of the hardware.
• We employ a SW/HW co-optimization flow, specifically focusing on developing efficient host-FPGA
data communication methods, which can maintain and deliver the performance and energy efficiency gains achieved by optimizing the CNN hardware accelerator by carefully managing and optimizing the host-FPGA interface.
• We support mixed-computation neural network realizations in which the computations for a subset
of layers are carried out by using XNOR-based Boolean operations. XNOR-based computations are
packed and mapped to arrays of DSPs. The mixed-computation network architecture is retrained
to retain the model accuracy as much as possible, while reducing the end-to-end latency even more
compared to when the computations for all layers are carried out using fixed-point computations.
2.2 Related Work
DNNWeaver [108], FpgaConvNet [130], TGPA [135], Cloud-DNN [17], and FlexCNN [114] are some of the
prior art references that introduce end-to-end tool flows specifically designed for neural network acceleration on FPGA devices. They describe their accelerator using C/C++ in HLS coding style except DNNWeaver
which utilizes customized RTL templates. FpgaConvNet employs the synchronous dataflow paradigm [70]
for mapping and scheduling neural network operations on FPGAs. This is achieved by translating neural
network operations to a synchronous dataflow hardware graph, applying various transformations on the
resulting graph, and solving an optimization problem that maximizes throughput or minimizes latency. FpgaConvNet presents a streaming architecture while the previous work utilize single computation engine.
In a streaming architecture, there is one distinct hardware component for each layer and each component
15
is optimized separately, while in single computation engine scheme, a generic accelerator architecture is
reused for doing the required computations of all layers. TPGA, tile-grained pipeline architecture [135],
adopts a similar architecture design methodology to improve throughput. Their design aims to optimize
throughput by supporting pipelined execution of tiles corresponding to different computation layers for an
input image, on multiple heterogeneous accelerators. Additionally, they deliver an end-to-end automation
flow to generate the accelerator from the high-level CNN model description.
Some prior work aim to provide a framework to automate structural optimizations during the FPGA
implementation, and create network description with pre-designed C++ template library. Cloud-DNN [17]
targets cloud FPGAs and proposes specialized optimizations with cloud-platform characteristics and the
support of easier and streamlined implementation while FlexCNN [114] presents a flow that can called
within the TensorFlow framework. FlexCNN also delivers high computation efficiency for different types
of convolution layers using techniques including dynamic tiling and data layout optimization. HybridDNN
[140] proposes a framework for building high-performance FPGA-based hardware implementations using
techniques such as a scalable architecture with a hybrid Spatial/Winograd convolution processing engine,
a comprehensive design space exploration tool, and an automated design flow to support accelerator design and implementation. FCNNLib [137] proposes a convolution algorithm library for CNN inference
on FPGAs, which uses various scheduling algorithms (e.g., spatial, temporal, etc.,) to coordinate multiple
CNN-implementation algorithms, which are diverse in arithmetic complexity, resource requirement, etc.,
on FPGAs.
Authors in [123] provide a comprehensive survey about the recent advances towards the goal of enabling efficient processing of DNNs using hardware design solutions or via joint hardware design and DNN
algorithm solutions.
16
2.3 Preliminaries
This section includes background on CNN processing, compiler optimizations, and a description and taxonomy of hardware architectural approaches for designing CNN accelerators. The section also describes
the SDAccel environment, which is used for developing CNN accelerator hardware and host software.
2.3.1 CNN Processing
Table 2.1 summarizes this work’s notation in regards to neural networks. The computational flow for a
convolutional layer in CNN can be represented by a six-level nested loop (seven-level nested loops when
considering the iteration over images in a mini-batch) known as a computational block. See Algorithm
1. Indeed, a convolutional layer receives input feature maps (IFMs) of size win × hin × cin, and convolves
them with cout different filters, each filter of size wk × hk × cin to generate output feature maps (OFMs)
of size wout × hout × cout. The convolution stride for each filter is represented by s. The set of OFMs of
the current convolutional layer constitute the IFMs for the next convolutional layer.
Algorithm 1 MAC computations of a convolutional layer
1: for m in 0 .. cout − 1 do
2: for y in 0 .. hout − 1 do
3: for x in 0 .. wout − 1 do
4: Y [m][x][y] = 0
5: for n in 0 .. cin − 1 do
6: for ky in 0 .. hk − 1 do
7: for kx in 0 .. wk − 1 do
8: Y [m][x][y] += X[n][x + kx][y + kx] · W[n][m][kx][ky]
9: end for
10: end for
11: end for
12: end for
13: end for
14: end for
17
Table 2.1: Summary of notation
Symbol Meaning
x/x/X input (scalar/vector/2-, 3-, or 4-D tensor)
y/y/Y output (scalar/vector/2-, 3-, or 4-D tensor)
w/w/W weight (scalar/vector/2-, 3-, or 4-D tensor)
L number of layers
win/hin input width/height
wout/hout output width/height
wk/hk kernel width/height
ctin/ctout number of input/output channels of a tile
cpin/cpout number of padded input/output channels to be multiple of wsa/hsa
wt/ht tile width/height
wsa/hsa systolic array width/height
cin/cout number of input/output channels
s stride
p padding
wp/hp width/height of pooling window
2.3.2 Compiler Optimizations
Compilers are responsible for performing a variety of optimizations to efficiently schedule and map operations defined in neural networks onto general-purpose or custom computing processors. Because the accelerator for each layer should perform the same computation i.e., implement the aforesaid six-level nested
loop, when mapping the convolutional layers of a CNN to a systolic array of MAC units, the search space
for the accelerator may be formally specified by how it transforms (i.e., tiles, reorders, and parallelizes) the
nested loop structure for that layer. Although we reuse the same systolic array for computations of all convolutional layers, each layer has its own unique set of loop transformations. Notice that fully connected
layers perform similar computations, but with only a two-level nested loop.
2.3.3 Architectural Techniques for Exploiting Data Reuse
Processing neural networks involves a large number of MAC operations for calculating outputs of filters/neurons according to the computational block. The MAC operations can be easily parallelized by using
18
spatial architectures, which include an array of ALUs and a memory hierarchy that is comprised of centralized, large memory blocks in addition to distributed, small memory sub-blocks. While accesses to the
large memory blocks incur rather large latency and come with high energy consumption cost, accesses to
small memories are fast and energy-efficient. For large and complex CNN models, it is unlikely that the
complete model (including weights) can be mapped onto the chip. Due to the limited off-chip bandwidth,
it is critically important to increase the on-chip data reuse and reduce the off-chip data accesses to improve
the computing efficiency.
At the high level, a CNN accelerator design on a target FPGA device typically comprises several components, namely, the core compute fabric, the memory hierarchy, and on-/off-chip interconnect. Data to
be processed by the accelerator is typically stored in an off-chip (external) memory. To utilize burst access
to the off-chip memory, data is first cached in on-chip buffers before being fed to the computation engine.
The on-chip interconnect is used for data communication between the computation engine and on-chip
buffer banks. By employing different types of computational engines and different designs for the memory
hierarchy, we can realize different accelerator designs as is explained below.
2.3.4 Dataflow
The dataflow (or data reuse pattern) of a CNN inference is in the form of a directed acyclic graph (DAG),
which can be accelerated in hardware without extering excessive pressure on the memory resources. More
precisely, to avoid frequent data transfers between large and small memory blocks and to reuse data in
each level of hierarchy as much as possible, the inference dataflow is optimized to determine what data
gets read into which level of the memory hierarchy and when each piece of data is processed. Based on
how different data components (e.g., weights, activations, etc.) are reused, various dataflows have been
proposed including weight stationary [43], output stationary [34], and row stationary [18] data flows.
19
2.3.5 Accelerator Architecture
Generally, the CNN accelerator designs on FPGA may be divided into two categories [131]: single computation engine and streaming architectures. The first class of accelerator designs employ a single computation
engine that is used for the computation of all neural network layers. This approach takes one input image
at a time and executes computations of each neural network layer sequentially. This approach, which has
been used in many prior works, including [108, 114]. The streaming architecture on the other hand typically comprises of one distinct hardware resource for each neural network layer where each resource is
optimized separately to exploit the parallelism that exits in its assigned layer. See [130, 135]. The tradeoff
is that one can use a complete set of FPGA hardware resources to process each neural network layer one
at a time or partition these hardware resources into (non-overlapping) parts and assign each hardware
resource part to exactly one layer of the network.
2.3.6 SDAccel Environment and Host Code
SDAccel is a development environment for OpenCL applications targeting Xilinx FPGA-based accelerator
cards. The SDAccel environment provides a framework for developing and delivering FPGA accelerated
applications using standard programming languages. In the SDAccel framework, an application program is
split between a host application and hardware accelerated kernels with a communication channel between
them. The host application, which is written in C/C++ and uses Application Programming Interface (API)
abstractions such as OpenCL, runs on a CPU while the kernels run on the FPGA device(s). Communication
between the host CPU and the FPGA accelerator board takes place via the PCIe bus. The host memory is
only accessible by the host application whereas the global memory, which is used to transfer data between
the host application and the accelerated kernels, is accessible by both the host processor and hardware
accelerators. Host code, which provides an interface to allow data transfer from the host machine to
accelerated kernels, follows the OpenCL programming paradigm and is structured into three code sections
20
for (a) setting the environment, (b) enqueuing kernels for their executions, and (c) post-processing and
releasing the resources.
We ran an experiment to investigate the PCIe bandwidth for transferring data between the host and
FPGA device. Table 2.2 shows that the bandwidth varies widely as a function of the number and size of
buffers used for the data transfer. In general, utilizing a few large buffers is more efficient than many small
buffers. Also, results confirm the significance of data movement cost between the host and FPGA device.
Table 2.2: Effective PCIe Bandwidth
Number of Buffers 1024 256 64 4
Buffer Size (KB) 4 1024 2048 2048 16384 262144
Effective PCIe Bandwidth (MB/s) 5 881 1442 1570 2418 2266
The flow of the host code is as follows, (1) The host application writes the data needed by a kernel into
the global memory of the attached device through the PCIe interface. (2) The host application sets up the
kernel with its input parameters. (3) The host application triggers the execution of the kernel function on
the FPGA. (4) The kernel performs the desired computations while reading data from global memory. (5)
The kernel writes data back to global memory and notifies the host. (6) The host application reads data
back from global memory into the host memory and continues processing as needed.
2.4 F2N2: Overall Flow
F2N2 provides an end-to-end design framework, which takes a high-level description of the CNN specified
in PyTorch and generates a high-performance CNN accelerator design suited to the architecture of the
target FPGA in three steps (cf. Fig. 2.1):
1. Step 1: F2N2 performs quantization/padding on the test data and pre-trained weights and activations. It may also retrain the quantized network to limit accuracy loss due to quantization. Next, it
generates a (computational) inference graph from the network. Details are given in Section 2.6.1.
21
2. Step 2: F2N2 performs optimizations tailored to the employed accelerator design for the implementation of the inference graph. Next, F2N2 compiles the information from the optimizer and extracts
accelerator design parameters and generates a static schedule for performing various operations.
Details are presented in Section 2.6.2
3. Step 3: Finally, using optimized accelerator component templates, F2N2 generates hardware code
(i.e., synthesizable C-level descriptions), which will be used for generating the FPGA bit stream.
Details are provided in Section 2.5.3.
2.5 Accelerator Design Optimizations
This section focuses on the question of how to improve the efficiency and performance of a CNN accelerator using FPGA device-aware optimizations. Solutions include data placement optimizations to reduce
the number of accesses to DRAM and use of the on-chip memory to balance computation vs. memory
bandwidth.
2.5.1 Accelerator Design
The accelerator based on a systolic array of MAC units comprises (i) a 2D array of processing elements
(PEs), which are responsible for executing the MAC operations associated with the convolutional calculations in a DNN, and (ii) a memory hierarchy feeding data to the said array compromising of register
files, the on-chip memory (Block RAMs and Ultra RAMs on FPGA devices), and external off-chip memory
(DRAM). Fig. 2.2 shows the employed design for the systolic array of MAC units. As shown in Fig. 2.2, the
systolic array is followed by a processing unit (PU), which in turn comprises several ALUs, which perform
neural network calculations such as the nonlinear activation function application, average or maximum
pooling, etc.
22
Start
DNN Model
Construct the computational graph
Calculate memory usage of the
graph
FPGA
Resources
Optimize the tiling size and order
for all MAC-based layers
Generate and schedule off-chip memory
access addresses and strides
Generate computational
instructions for ALUs
Synthesizable C-level
description of the accelerator
C/OpenCL code for platform setup,
allocating/reading/writing global
memories, and running the
Accelerator
HLS templates
Yes
End
1. Pre-processing
3. SDAccel Code generator
2. Optimizer & Scheduler
Schedule operations for PEs
Enough On-chip
Memory?
No
Yes
Accelerator configuration
files
Pre-trained
Weights Test Data
Quantize weights, biases, and test
images
Pad weights, biases, and test images
Is data quantized?
No
Is padding required? No
Yes
Reshape weights, biases, and test
images
Run software inference using quantized
data and weights
Is accuracy
acceptable?
Retrain the
quantized network
No
Yes
Yes
Memory models
Decide on software optimization
techniques based on remaining resources
Figure 2.1: High-level overview of F2N2 flow.
23
DRAM
…
…
…
…
…
…
Input Buffer
Output Buffer
Tree Adders
…
…
…
Processing Unit
…
PE PE PE
PE PE PE
PE PE PE
PE PE PE
Output data flow
Input data flow
PTC #0
PTC #1
…
PTC #N-1
Reg
File
Reg
File
BRAM
Weight Buffer
2D Array of Processing Elements
Weight Distribution
Network
Reg
File
ALU Instruction Queue
Reg
File
ALU
Figure 2.2: High-level overview of systolic array accelerator design. The pipe-shape objects are first-infirst-out buffers (FIFOs) used to distribute the weights.
We present an FGPA device-aware accelerator design that utilizes the following salient features of common FPGA devices: (i) The available hardware resources in an FPGA device i.e., digital signal processing
units (DSPs), Configurable Logic Block (CLBs) which comprise several look-up tables (LUTs), Block RAMs
(BRAMs) which are widely used on-chip memory component in FPGA devices, and optional Ultra RAMs
(URAMs) which are cascadable, two port, synchronous on-chip memory blocks available in UltraScale+
FPGA devices are placed as resource groups in a column-wise manner. As seen in Fig. 2.3, there is a column
of DSPs, followed by a column of CLBs, a column of BRAMs, and a second column of CLBs. On an FPGA
device, this resource placement pattern is repeated many times. Consequently, the BRAMs are uniformly
24
distributed on the FPGA chip, and one should place data that is used by a DSP in a BRAM that is physically
close to the DSP. (ii) The resource ratio of DSP count to BRAM count is one (a common ratio in most FPGA
device families), so we can use one BRAM as the on-chip storage unit for at least one DSP. (iii) Although
addition and nonlinear functions can be mapped to the DSP, it is more efficient to map them to a custom
PU which itself is implemented using the CLB resources of the FPGA device.
A PE in our design comprises one DSP and its adjacent BRAM. The IFMs are first cached in an input
buffer and then sequentially passed onto the first row of the systolic array of the said PEs. In addition,
to avoid the need for costly multiplexers, input data is simply shifted into the PE array and between
the neighboring PEs on the same row of the systolic array. This scheme eliminates the need for having
global interconnections that connect the input buffers to all PEs (recall that the delay of an interconnect
scales quadratically with its length). We also implement pipelined data transfer controllers (PDTCs) for
transferring weights to the PEs, where each PDTC only connects to BRAMs for the designated set of PEs in
a row to reduce the critical path delay. PDTCs are chained together in a linear arrangement using FIFOs.
To implement FIFOs, we use LUTRAMs instead of BRAMs. This implementation choice creates a more
balanced utilization of the underlying FPGA resources. After all computations for one OFM are done, the
registered partial sum results that reside in the PEs of one row are sent to the tree adder to do the required
summation and produce the final OFM value.
The employed data flow for the systolic array accelerator design is a combination of both weight stationery and output stationery data flows mentioned in Section 2.3.4, because (i) all weights required for the
computation of the layer are transferred from the external memory into BRAMs and are reused for different patches of input feature maps, and (ii) partial sums computed for generating an OFM are accumulated
and stored in the register file of the same PE. More details on the employed data flow will be discussed in
Section 2.6.2.1.
25
DSP
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
BRAM
DSP
DSP
DSP
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
BRAM
BRAM
BRAM
CLB
CLB
CLB
CLB
CLB
CLB
CLB
CLB
URAM
Figure 2.3: High-level view of a local area of the Xilinx FPGA layout.
2.5.2 Memory Layout and Data Placement
To utilize the off-chip memory bandwidth, we group input/output data as well as weights before sending
them to the on-chip memory. Considering weights, we group hsa of them in the output channel dimension. So, given a convolutional layer with a 4D weight tensor of size < cout, cin, hk, wk >, we first pad
the weights to match the systolic array dimensions by adding 0’s to the cout and cin dimensions. The
padded cpout and cpin are multiples of hsa and wsa, respectively. Next, we split the cout dimension into two
dimensions with sizes of cpout/hsa and hsa. The final shape of the weight tensor for each layer is then a
5D tensor of < cpout/hsa, hk, wk, cpin, hsa >, where cpout and cpin are the padded versions of cout and cin,
respectively.
Similarly, we pad and group activations along the input channel dimension. Since we bring the input
data from BRAM and distribute the said data across different PEs within a row of the systolic array, we
group wsa of this data. The output data will be grouped similarly to the weights where each hsa of this
data is grouped together. Consequently, with a 16-bit fixed point representation for both activations and
weights, the width of the data stored in output registers and BRAMs for weights and output data of each
layer is hsa ×16, whereas the width of data stored in input registers and BRAMs for each layer is wsa ×16.
26
Note that relative widths of the integer and fractional parts of the fixed-point representation are chosen
after training based on the given CNN and the targeted dataset.
2.5.3 Low Latency Accelerator Design
Long latencies associated with accessing the external memory constitute a major performance bottleneck
for CNN accelerator designs not only because neural network models have tens of millions of parameters
that need to be read from memory, but also because of the frequent reading and writing of intermediate
partial sum values generated during the CNN processing itself. We apply the following optimizations to
decrease the number of read/write accesses from/to the external memory and reduce the latency overhead
associated with these accesses. These techniques are applicable in FPGA devices with a large amount of
on-chip memory (e.g., a cloud FPGA device).
2.5.3.1 Burst mode transfer and efficient utilization of memory hierarchy:
We load all input activations required for a layer’s computations at once before any computations for
the layer can begin. In this way, we avoid iteratively loading activations from the off-chip memory and
writing the calculated partial sums back to the off-chip memory. This reduces the layer processing time,
especially because now loading of weights and input activation functions can be done simultaneously.
This is because weights and activations are stored in separate off-chip memory blocks in the target system
(FPGA board) and are thus simultaneously accessible. Furthermore, because intermediate results (partial
sums) will be saved in on-chip memory, we do not need to store them to the off-chip memory. We also
perform full burst read/write of data from/to the memory banks and utilize the maximum possible burst
size (512-bit width) and burst length (256 beats) allowable on the Advanced eXtensible Interface (AXI) bus
of the target FPGA board. Specifically, to enable burst read of weights, we allocate a global weight buffer
on the FPGA device, read all the weights continuously from the off-chip memory bank for weights to this
27
buffer, and then distribute the said weights from the buffer to the distributed BRAM blocks adjacent to the
PEs (see Fig. 2.2). This is more efficient compared to directly reading weights from the memory bank into
distributed BRAM blocks, which cannot make good use of the burst mode of DRAM because the required
indices tend to obstruct the continuous burst read of weights from the off-chip memory. Of course, in this
scheme the indexing and distribution of weights to BRAMs will have to done on-chip. Similarly, we create
global buffers for burst reading (and writing) of the input (and output) activations. The said global buffers
are realized with URAM blocks available on the FPGA device.
2.5.3.2 Pre-fetching and double buffering:
To further reduce the overhead of loading weights, when doing computations of a current layer, we prefetch the weights for the next layer so that the actual computations of the next layer can start earlier. To
achieve this goal, we employ two global weight buffers, namely buffers A and B, where one contains the
weights required for the current layer’s computations while the other is being filled with the weights of the
next layer that are read from the off-chip memory. Notice that roles of buffers A and B are interchanged as
we move from layer i to layer i + 1 processing (this is the notion of double-buffering). Similarly, by using
two on-chip activation buffers, and employing the double-buffering technique, we eliminate the need to
first store the computed output activations of a layer in the external memory, and subsequently, load the
said activations as input activations for the computations of the next layer from the external memory. In
other words, we do away with the time-consuming data transfers from/to the external memory during the
inference computations (except for loading the input activations of the first layer and storing the output
activations of the last layer). The benefit of applying the pre-fetching and double buffering optimizations
will be discussed in Section 2.10.2.
To minimize the overhead of data transfers, a technique can be employed involving two BRAM banks
that are toggled between consecutive layers to store the weights required for computations separately.
28
This approach proves especially beneficial when there is a significant difference in the number of weights
between two consecutive layers, resulting in varying processing times for transferring weights from global
buffers (implemented with URAMs) to the corresponding BRAMs. An example of such a scenario is found
in networks like MobileNet [51], where depthwise convolution and pointwise convolution layers are consecutive, with the latter type typically having a higher number of weights. In Section 2.10.3, a more detailed
evaluation of this technique will be presented.
2.6 F2N2 Compiler
This section describes the F2N2 compiler which optimizes and runs a neural network model on the designed hardware accelerator. The proposed compiler takes a neural network model, converts it to a computational graph, schedules its operations, and more importantly, optimizes its nodes by leveraging intrinsic
fusion of different required operations such as convolution or fully-connected layer computations with
batch normalization.
2.6.1 Quantization and Computation Graph Construction
The F2N2 compiler receives a CNN model with pre-trained weights and test data. First, it tries to quantize the weights to reduce the computational overhead. It runs the quantized network on the hardware
accelerator and checks the inference accuracy on the test data. If the accuracy degradation is more than a
user-specified threshold, the compiler invokes a retraining step by using the QPyTorch engine [144].¶ Finally, it passes the quantized and optimized CNN model to the next step for constructing the computational
graph.
¶QPyTorch is a low-precision arithmetic simulation framework compatible with PyTorch that provides a convenient interface
for minimizing the effort needed to reliably convert an existing training code to perform low-precision training.
29
A CNN can be thought as a directed acyclic graph (DAG) of various types of layers, each layer doing a
specific function, e.g., convolution, activation function application, pooling, batch normalization, and flattening (done by the last few fully-connected layers). We use an in-house translator to capture the structure
of the CNN by a DAG where each node in the DAG is a macro node comprising a convolutional or a fullyconnected layer and its subsequent processing layer(s) such as activation function and pooling layers up
to (but not including) the next convolutional (or fully-connected) layer. Note that the batch normalization is fused into the computations performed by the convolution (or fully-connected) layers later in the
proposed flow. Fig. 2.4 shows a given CNN topology and its corresponding DAG. ResNet models incorporate skip connections (“shortcut”) between two non-adjacent layers to compute the residual functions and
enable more complex interconnections between layers. These characteristics make the ResNet structure
irregular and more complex compared to other CNNs such as VGG models. To handle such structure, the
F2N2 compiler translator first extracts the CNN in the skip connection path as a macro node, and adds an
element-wise adder operation as the last macro node of the forward path.
2.6.2 Optimizer and Scheduler
As explained in Section 2.5.1, computations are scheduled on the systolic arrays of MAC units of the target
FPGA device followed by a PU unit. More precisely, we use the single computation engine scheme and
the MAC-based accelerator design of Fig. 2.2 for implementing each convolutional or fully-connected
layer. However, the compiler can choose a different set of loop optimizations for each layer. Therefore, the
CNN ReLU
Macro node 1 Macro node 2
Max-Pooling Linear ReLU BatchNormalization
CNN/
ReLU/
MaxPooling
Linear/
ReLU/
BN
Figure 2.4: A given CNN and its corresponding DAG.
30
proposed compiler generates custom instructions to determine and distinguish loop alterations and loop
bounds for optimizing the performance of each individual layer.
2.6.2.1 Tiling
Our hardware design takes one input image at a time to be sequentially operated on by all nodes in the
DAG. After inference on the current image is concluded, the next image is fetched for processing and so on
until all images are processed. For each layer, due to the limited amount of resources that are available on
the target FPGA device and in view of much larger volume of required computations and data movements
for that layer, our design first divides an input feature map into tiles. Next, it loads the tiles one after the
other from the off-chip memory to the on-chip memory before processing the said tiles sequentially. We
use all the PE units in our systolic array to process each layer (according to the single computation engine
architecture). To partially avoid the resource inefficiency caused by a fixed tile size, our accelerator design
allows the use of a dynamic tile size for each DAG node.
After extracting the DAG from the given CNN topology, the F2N2 compiler calculates the memory
requirements for each DAG node. When the memory requirements are less than the available resources,
the compiler does not perform any tiling. Otherwise, it calls its optimizer to apply loop tiling to the loops
in the computational block. We make some assumptions for our loop blocking (e.g., tiling) optimization:
(i) the compiler does not apply tiling on kernel width and height (wk and hk in the computational block)
because they are usually small, e.g., 3; (ii) The compiler may change the order of the outer loops as shown
in Algorithm 2, but maintains the fixed order of the inner loops as shown in Algorithm 3. Note that the
order of the inner loops has been chosen so as to minimize the data movement considering the internal
architecture and resource placement of the target FPGA device; and (iii) For the size of the systolic array
(wsa ×hsa), we search among 16x16, 32x32, and 64x64 size array candidates and choose the one that would
yield the minimum cost function as explained in Section 2.6.2.2.
31
Algorithm 2 Tiled computations for a convolutional layer
1: Fill_weight_BRAMs( )
2: for m in 0 .. ⌈cout/ctout ⌉ − 1 do
3: for y in 0 .. ⌈hout/ht⌉ − 1 do
4: for x in 0 .. ⌈wout/wt⌉ − 1 do
5: for n in 0 .. ⌈cin/ctin⌉ − 1 do
6: Load_data( )
7:
8: Do_inner_loops( )
9:
10: Store_data( )
11: end for
12: end for
13: end for
14: end for
Algorithm 3 Inner loop computations
1: for mt in 0 .. ⌈ctout/hsa⌉ − 1 do
2: for yt in 0 .. ht − 1 do
3: for xt in 0 .. wt − 1 do
4: for ky in 0 .. hk − 1 do
5: for kx in 0 .. wk − 1 do
6: for nt in 0 .. ⌈ctin/wsa⌉ − 1 do
7: for i in 0 .. wsa − 1 do
8: #pragma unroll(i)
9: for j in 0 .. hsa − 1 do
10: #pragma unroll(j)
11: Y [(m· ctout +mt)·hsa +j][x·wt +xt][y ·ht +yt]+ = X[(n· ctin +nt)·
wsa +i][x · wt +xt +kx][y · ht +yt +kx] ·W[(n · ctin +nt)· wsa +i][(m · ctout + mt)· hsa +j][kx][ky]
12: end for
13: end for
14: end for
15: end for
16: end for
17: end for
18: end for
19: end for
32
Another key point in the tiling optimization step is the management and scheduling of the memory transfer operations of a convolutional (or a linear) layer. As our accelerator design in Fig. 2.2 illustrates, we bring weights and fill the BRAMs next to each PE before the computational loops are started
(cf. Fill_weight_BRAMs() on line 1 of Algorithm 2). Input/output feature maps are loaded before the
computation engine corresponding to inner loops starts (cf. Load_data() on line 6 of Algorithm 2), and
the generated output feature maps are written back to the main memory (cf. Store_data() on line 8 of
Algorithm 2). The reason that the (partially computed) output feature maps may have to be loaded by
Load_data() is that we also support tiling along the input channel dimension. Indeed, when we tile along
this dimension, the partial products that are calculated based on a first subset of the input channels are
accumulated and written into the main memory; later when a second subset of the input channels are
brought in to continue the pixel value computation for an output feature map, the previously stored pixel
value must be read and subsequently combined with the new accumulated value corresponding to the
second subset of the input channels. Finally, we search over different tile sizes and, based on our computational and memory access models, find the tile size that yields the lowest latency. The computational
performance can be calculated as explained in the next subsection.
The data flow used in our design is a combination of weight stationery and output stationery data
flows where the weights are stored in BRAMs and outputs are stored in registers associated with PEs. We
also map cout and cin partially to the parallel computation units i.e., the PEs in the 2-D systolic array of
MAC units. More precisely, sizes of spatial unrolling factors of loops for cout and cin are determined by
the the height hsa and width wsa of the systolic array, respectively.
2.6.2.2 Cost Function
The systolic array accelerator comprises an array of PEs, an on-chip memory hierarchy, and on-/off-chip
interconnects for accessing activation functions and weights. Therefore, we model the performance of the
33
accelerator in terms of computation and communication cost functions, which in turn depend on the tile
sizes. The space of all feasible tile sizes can be expressed as:
1 ≤ wt ≤ wout
1 ≤ ht ≤ hout
1 ≤ hsa ≤ ctout ≤ cout
1 ≤ wsa ≤ ctin ≤ cin
. (2.1)
The loop ordering in Algorithm 2 is not fixed in advance, and is instead optimized by the the proposed
compiler. First, we formulate the computation and data movement costs as follows. Given a specific tile
size combination of wt, ht, ctout , ctin, the computational latency of a convolutional layer may be calculated
as:
Tcomp = ⌈
cout
ctout
⌉ · ⌈ cin
ctin
⌉ · ⌈wout
wt
⌉ · ⌈hout
ht
⌉·
(hk · wk · ⌈ ctout
hsa
⌉ · ⌈ ctin
wsa
⌉ · wt · ht · u + tsa), (2.2)
where u denotes the initiation interval of the pipeline, which is defined as the number of clock cycles that
must elapse between issuing two consecutive input feature maps into the systolic array (u = 1 in our
design), and tsa is the (cycle count) latency of passing one input feature map through the systolic array.
Another important factor in the accelerator performance is the data movement (transfer) latency, which
is proportional to the total amount of external data accesses from the off-chip memory and thus is calculated as:
Tdatmov = DAext · text,
DAext = αin · Bin + αw · Bw + αout · Bout,
(2.3)
34
where αin, Bin, αw, Bw, αout, and Bout denote the trip counts and buffer sizes of memory accesses to input
feature maps, weights, and output feature maps, respectively. text denotes the latency of accessing the offchip memory. Trip count is the total number of transfers made between off-chip memory and buffers.
These trip counts and buffers sizes are calculated as follows:
αin = αout = ⌈
cout
ctout
⌉ · ⌈ cin
ctin
⌉ · ⌈wout
wt
⌉ · ⌈hout
ht
⌉,
αw = 1,
Bin =
ctin · (sht + hk − s) · (swt + wk − s)
wsa
,
Bw =
cout · cin · hk · wk
hsa
,
Bout =
ctout · hout · wout
hsa
.
(2.4)
The reason for dividing all buffer sizes by wsa (hsa) is described in section 2.5.2. αw is 1 since we bring all
the required weights in one external memory trip. By enumerating all possible loop orders and tile sizes,
one can generate a set of computational and data transfer latency pairs. The compiler then selects the
design with the minimum Tdatmov + Tcomp.
The compiler also has the capability to build a roofline model of the accelerator [94, 143], where the
total number of operations (Nop), computational roof (Rcomp) ratio, and computation to data movement
(C2DM) ratio are defined as∥
:
Nop = ctout · ctin · hout · wout · hk · wk,
Rcomp =
Nop
Tcomp
, C2DM =
Nop
DAext
.
(2.5)
2.6.2.3 Generating Instructions
The proposed compiler generates two sets of instructions, namely, PU compute instructions and systolic
array/tiling configuration instructions for each layer of the CNN as explained below. As discussed in
section 2.5.1, the PU comprises ALUs where each ALU carries out arithmetic and logic operations on
the operands. The compiler takes operations like max pooling and provides a series of ALU and data
∥Details of the roofline model are not included in this thesis for brevity.
35
movement instructions to support such operations. Each of these instructions is 32-bit wide. Configuration
instructions are used to dynamically configure tiles for each layer as well as specify addresses for data
movements, while computational instructions are used in the ALUs to configure the ALU to perform the
desired operation, e.g., maximum operation for a max pooling layer or addition for an average pooling
layer.
Supported ALU instructions are shown in Table 2.3. The immediate operands are 13 bits wide, which
is also adopted for memory, e.g., output buffers, and addresses. The next 12 bits are used to represent
the addresses for the source and destination registers. The 25th bit enables writing back to register files
whereas the 26th bit is used to denote whether the second operand is from the second source register file
or an immediate. The next three bits are opcode bits which identify the operation to be performed. The
last two bits represent the instruction type as to whether the instruction is systolic array/tiling configuration or ALU compute instructions. Instructions used for configuring the systolic array are shown in
Table 2.3. They are of two types: (i) instructions for addressing DRAM when we want to load (store) data
(results) from (to) DRAM to (from) global buffers; and ii) instructions for setting the accelerator parameters
including the tiling data size and tiling counts.
2.6.2.4 Reducing Overhead of Batch Normalization Layer
The general idea of our approach for reducing the overhead of the batch normalization layer is to fuse
the batch normalization layer with the preceding convolutional (or linear) layer into a single layer simply
by modifying the layer’s parameter values. This simplification also reduces the chances of encountering
overflow in our implementations because the parameters of this fused layer will directly generate the
normalized results. Algorithm 4 shows the well-known batch normalization algorithm where the two
parameters γ and β are learned during the training process. In addition, ϵ is a small constant value used
to ensure that division-by-zero error is never encountered.
36
Table 2.3: PU Operation Instruction Set
Bits 31 - 29 - 26 25 24 - 20 - 16 - 12 -
30 27 21 17 13 0
PU-R
ADD
INST-TYPE = 1
OPCODE = 1 1 1
DES-ADDR
SRC0-ADDR
SRC1-ADDR
-
SUB OPCODE = 2 1 1
MUL OPCODE = 3 1 1
MVHI OPCODE = 4 1 1
MAX OPCODE = 5 1 1
RSHIFT OPCODE = 7 1 1
PU-I
ADDi OPCODE = 1 0 1
- Immediate
SUBi OPCODE = 2 0 1
MULi OPCODE = 3 0 1
MAXi OPCODE = 5 0 1
RSHIFTi OPCODE = 7 0 1
MEM-ALU
LOAD OPCODE = 0 0 1 DES- - -
ADDR
STORE OPCODE = 6 0 0 - SRC- -
ADDR
Table 2.4: Systolic Array Configuration Instruction Set
Bits 31 - 30 29 - 28 27 - 0
DRAM-ADDR INST-TYPE = 0 BUF-TYPE STARTING-ADDR-POINTER
Bits 31 - 30 29 - 16 15 - 0
CONFIG INST-TYPE = 0 CONFIG-VAL1 CONFIG-VAL2
Batch normalization after a linear layer: When a BN layer is fused into a linear layer (one performing a linear transformation on the input features), the fused layer’s computation may be described
as,
y
′
j = γj ·
(
P
i wij
q
· xi + bj ) − µBj
σ
2
Bj
+ ϵ
+ βj (2.6)
y
′
j =
X
i
w
′
ij · xi + b
′
j
(2.7)
Hence, the new fused parameters can be written as:
w
′
ij =
γj · wij
q
σ
2
Bj
+ ϵ
, b′
j = βj + γj · q
bj − µBj
σ
2
Bj
+ ϵ
(2.8)
37
Algorithm 4 Batch normalization
Require: B = {y0, y1, ..., ym}; γ; β
1:
Ensure: y
′
i
= BNγ,β(yi)
2:
3: µB ← 1
m
Pm
i=1 yi ▷ mini-batch mean
4: σ
2
B ← 1
m
Pm
i=1(yi − µB)
2 ▷ mini-batch variance
5: yˆi ← √
yi−µB
σ
2
B+ϵ
▷ normalize
6: y
′
i ← γ · yˆi + β ≡ BNγ,β(yi) ▷ scale and shift
When the BN layzer is fused into a fully connected layer, different input neurons undergo different normalizations i.e., γ, β,σB and µB will be vectors which sizes are equal to that of the neuron vector size of
the layer (i.e., cardinality of the output vector for the layer).
Batch normalization after a convolutional layer: With a convolutional layer, the same normalization is applied to all neurons. Using algorithm 4 and proceeding as in the previous case, the weights
and bias of the resulting fused convolutional layer may be expressed as:
w′ =
γ · w
p
σ
2
B + ϵ
, b
′ = β + γ · p
b − µB
σ
2
B + ϵ
(2.9)
w′ =
γ · w
p
σ
2
B + ϵ
, b
′ = β + γ · p
b − µB
σ
2
B + ϵ
(2.10)
2.7 Operation-packing for lower-precision weights and activations
UltraScale FPGAs offer several key features for arithmetic operations. These include a 27-bit preadder, an
18 × 27-bit multiplier, and a 48-bit accumulator [138]. Each DSP48E2 block within the FPGA can effectively
implement functions in the given form.:
P = B × (A + D) + C + Pin (2.11)
38
Figure 2.5: INT4-packing for performing four parallel multiplications on a single DSP. Here, $ denotes the
(extended) sign bits and # denotes the padding bits.
When it comes to implementing arithmetic circuits, utilizing DSP hard blocks offers significant advantages in terms of performance, including speed, area utilization, and energy efficiency. These advantages
surpass those achieved by using the standard programmable FPGA fabric. Consequently, when designing arithmetic circuits to achieve optimal results, it is crucial to maximize the utilization of available DSP
blocks. This ensures that the circuits fully leverage the superior capabilities and benefits provided by these
specialized hardware resources, as stated in the reference [1]. The scarcity of DSP blocks on an FPGA chip
makes it crucial to utilize them efficiently. However, this can be challenging when dealing with small bit
width arithmetic operations. Implementing such operations on a DSP block would often result in significant underutilization of the available resources. In domains such as image processing or machine learning,
where quantized data with small bit widths (e.g., 8 bits or less) is commonly encountered, finding efficient
ways to leverage DSP blocks becomes even more important. Solutions that address this issue can significantly enhance resource utilization and improve the overall performance of FPGA-based systems operating
39
in these domains [3, 2]. To enhance the utilization of DSP resources for low-precision arithmetic operations, several techniques have been proposed. These techniques focus on mapping multiple small bit-width
multiplications to a single DSP block. By consolidating these operations, the DSP resources can be utilized
more efficiently, reducing waste and maximizing their potential. These approaches help optimize the implementation of low-precision arithmetic on FPGA devices and contribute to improved performance and
resource utilization [3, 2]. In this work, we build upon the packing strategy initially introduced by Xilinx
for INT4- and INT8- packing. we extend this approach by enabling packed multiplications on a single DSP
for operands of 2-bit precision as well. By expanding the scope of the packing strategy, we aim to further
optimize the utilization of DSP resources and enable efficient computation of multiplications using various
precision operands. This advancement contributes to the overall enhancement of hardware performance
and resource utilization in FPGA-based systems. For INT4-packing, Fig. 2.5 is the packing scheme provided
by Xilinx:
INT4-packing basically computes the outer product of two vectors a and w, with both vectors having
two elements each, a containing a1 and a2, and w containing w1 and w2. As illustrated in Fig. 2.5, by
employing the packing approach, instead of executing four separate multiplications, it becomes possible
to pack these multiplications onto a single DSP. This is achievable when a1 and a2 are unsigned 4-bit
integers, and w1 and w2 are signed 4-bit integers. The strategy is to rearrange the individual inputs a1, a2,
w1, w2 as described in the following equation:
(a2.2
11 + a1).(w2.2
22 + w1) = a2w2.2
33 + a1w2.2
22 + a2w1.2
11 + a0w0 (2.12)
In Eq. 2.12, multiplications with 2
n
can be implemented by fixed shift operations that only require a
rewiring of the individual bits. The computation in Eq. 2.12 can be mapped to the DSP48E2 as follows:
The operand a1 is mapped to the B-Port (see Fig. 2.5) with an offset of 0. a2 is also mapped to the B-Port
40
but with an offset of 11. This is a hardware-efficient way of implementing (a2.2
11 + a1). Furthermore,
Input w1 is mapped to the preadder port A with an offset of zero. Since w1 is signed, the sign bit has to
be repeated for all most significant bits (MSBs) to perform sign extension. Input w2 cannot be mapped to
the same port as w1 because it is signed. Therefore, w2 is mapped to the preadder port D with an offset of
22. The 4 results of the outer product can be extracted from the P-Port. For instance, the result a2w2 can
be extracted from bit 33 to bit 40 from the P-Port. The individual results are separated by # = 3 padding
bits (see Fig. 2.5). This is important when multiple DSPs are chained together using the carry ports (Pin,
Pcout) in order to accumulate their results. Thus, with # bits padding a maximum of 2
# results can be
accumulated without error. When no results are accumulated no padding is needed.
Similar to the scheme in Fig. 2.5 for INT4 packing, Xilinx also provides support for INT8 packing. Fig.
2.6 shows the configuration of operands for INT8 packing including the position of operands, number of
(extended) sign bits, and number of padding bits. In INT8 packing, two multiplications are packed on the
DSP.
Here, motivated by formulas presented in [115] for general INT-N packing, we present the scheme
shown in Fig. 2.7 for 2-bit packing. Using this scheme, we are able to pack 6 multiplications on one DSP.
This factor of 6 packing helps with the obtained latency in the case that for accuracy purposes, instead
of using binary for a subset of layers, we have to use at least 2-bit quantization. Results in Section 2.10.4
show the effect of 2-bit packing in terms of both resource usage and obtained latency.
2.8 HW/SW optimization techniques
To realize our accelerator design in hardware, we use the Xilinx SDAccel and Vivado HLS, which provide
a toolchain for programming and optimizing different applications on Xilinx FPGAs using a high-level
language (C, C++ or OpenCL) or hardware description languages (VHDL, Verilog and SystemVerilog), as
41
Figure 2.6: INT8-packing for performing two parallel multiplications on a single DSP. Here, $ denotes the
(extended) sign bits and # denotes the padding bits.
well as a runtime tool based on the OpenCL API, which is used by the host-side software to interact with
the accelerator.
2.8.1 Hardware (Kernel) Optimizations
We employ the following techniques in our synthesizable C++ templates.
2.8.1.1 Loop transformation:
Loop transformation includes loop unrolling and pipelining. Loop unrolling, which is used to increase the
utilization of the computational resources in the FPGA device, forces the parallel execution of the instructions in the loop at the cost of an area overhead. Loop pipelining is a technique to improve the throughput
by overlapping the execution of operations from different loop iterations. The maximum throughput that
can be achieved is limited both by resource constraints of the FPGA device and data dependencies in the
application.
42
Figure 2.7: INT2-packing for performing six parallel multiplications on a single DSP. Here, $ denotes the
(extended) sign bits and # denotes the padding bits.
2.8.1.2 Exploiting task-level parallelism:
Most operations within a neural network layer can be executed concurrently. As a result, a large number
of operations can be done in parallel (limited only by the available FPGA resource counts). We employ a
double-buffering scheme to hide the external memory access latency.
2.8.2 Software (Host) Optimizations
The optimizations done on the ”compute kernel" (i.e., the CNN accelerator) is often offset by the overhead of
host-FPGA data communication; this results in only a moderate system-wide speedup or sometimes even a
slowdown [25, 114]. In this section, the compute kernel and CNN accelerator terms are used interchangeably. As an example, FlexCNN [114] reports that the time for CNN processing, using their accelerator takes
11.8% of the total run-time whereas the remainder of the run-time is taken by data transfers and synchronization. This fact motivates us to develop an efficient host-FPGA data communication method that will
maintain the benefits of optimizations that are performed on the accelerator. Therefore, we use software
43
pipelining and software parallelization techniques to optimize the host-FPGA interface. Precisely, after
doing accelerator optimizations and scheduling the operations within accelerator, the F2N2 compiler sets
out to optimize the host code. This optimization is done based on the available resources on the FPGA
device as detailed next.
2.8.2.1 Hiding data transfer time by software pipelining:
By default, the accelerator can only start processing new data only when it has finished processing the
current data. By enqueuing data to the device (e.g., FPGA) global memory ahead of the accelerator execution, the data transfer latency can be hidden by such software pipelining. This is achieved by using
OpenCL’s clEnqueueMigrateMemObjects command for data transfer, which enables the OpenCL API to
hide the data transfer time by enqueuing a new set of data while the accelerator is operating on the current
set. It is possible to further improve the performance of the system by calling a replica of the accelerator
with new data while the first accelerator is still processing the current data. The accelerator is implemented
using the AP_CTRL_CHAIN (Pipelined accelerator) execution model where the accelerator is designed in
such a way it can allow multiple accelerators executions to get overlapped and running in a pipelined
fashion, called software parallelization (software parallelization will be explained in Section 2.8.2.2.) The
longer the accelerator takes to process a set of data from start to finish, the greater the opportunity to use
host-to-kernel dataflow to improve performance.
Fig. 2.8 shows the timing diagram for running two accelerators utilizing the software pipelining technique. The host application starts by calling OpenCL built-in functions to allocated buffers for transferring
data from the host memory to the global memory. This is indicated by W_BUF_1 and W_BUF_2. Once the
transfer is completed, the accelerator kernel which is queued in the command queue is notified to start
executing its operations. Meanwhile the data required for the second accelerator execution is transferred
using the OpenCL buffers from the host memory to the global memory ahead of its execution. Thus, as
44
soon as the first accelerator execution is completed, the second accelerator execution can start. In addition, the processed data(R_BUF) of the first accelerator can be read back from the global memory while
the second accelerator is executing.
2.8.2.2 Concurrent accelerator executions by software parallelization:
This allows temporal parallelism, where different stages of the host-to-kernel dataflow process different
sets of data. This is possible by enqueueing multiple accelerators with multiple OpenCL’s clEnqueueTask
commands, in a pipelined manner. clEnqueueTask is used to enqueue a accelerator, i.e., a task, to a command queue. OpenCL’s clCreateCommandQueue API creates a command queue which keeps a track of
queued tasks, i.e., writing to/reading from the buffers and the accelerator execution. In our design, we
use an out-of-order command queue to run tasks in an out-of-order fashion, thus enabling concurrently
execution of multiple accelerators.
Fig. 2.9 shows the timing diagram of concurrent kernel executions. As seen, there are two kernel enqueues, which enable two concurrent kernel executions. As explained above, as soon as the data required
by the first accelerator is transferred from the host memory to the global memory, the kernel execution
starts. Meanwhile, data for the second kernel executions is transferred to the global memory and the
respective accelerator starts its execution as soon as data is available. Because clEnqueueTask and clEnqueueMigrateMemObject are asynchronous calls to enqueue tasks, event synchronization methods are
used to wait and resolve dependencies for all the write events, accelerator execution events, and read
events. To achieve such synchronization for the host application program, the clWaitForEvents API is
used to wait for an event and ensure that the task is finished.
F2N2 compiler tries to map as many accelerated kernels as possible on the FPGA device in order to
take advantage of concurrent kernel executions. To accomplish this goal, we first estimate the resource
usage of a target compute kernel (i.e., the accelerator) using our resource estimation models and then keep
45
Read from
Buffers
Write to
Buffers
Enqueue
Accelerator CNN Accelerator
Host Operations
CNN Accelerator
WBUF
_1
WBUF
_2
WBUF
_1
WBUF
_2
RBUF RBUF
Figure 2.8: Timing diagram of running two accelerators with software pipelining. IFMs and weights are
loaded in WBUF_1 and WBUF_2, respectively. Final results are stored in RBUF.
increasing the number of concurrent accelerators until the available resources are exhausted. Our models
are deployed to calculate the resource utilization (e.g., DSPs, BRAMs, and etc) of the designed accelerator
using some parameters (e.g., the systolic array size and DNN model).
2.9 F2N2 Mixed-Computation Design
Binary networks in which weights and activations are quantized to bipolar values, and thus multiplication of each weight and its corresponding activation can be replaced with a single XNOR operation, are
common in state-of-the-art [95, 98, 128]. This is why we have extended the F2N2 compiler to map such a
network to the FPGA device as well. In fact, we envision that for accuracy reasons one may want to hybridize the mapping process so that some convolutional layers are still realized using mixed-precision MAC
operations while some other layers are realized using binary XNOR operations. We use an architecture
which is similar to Fig. 2.2 for computational layers of the CNN that are to be mapped using XNOR-based
computations. By packing 16 one-bit values required for a XNOR operation, 16 XNORs can be mapped to
a single DSP on the target FPGA and performed in parallel. Similar to MAC-based layers, wsa · wsa DSPs
perform the packed XNOR operations simultaneously.
46
Read from
Buffers
Write to
Buffers
Enqueue
Accelerator
(Row 0)
Enqueue
Accelerator
(Row 1)
WBUF
_1
RBUF RBUF
NN Accelerator
Time(ms)
NN Accelerator
Host Operations
WBUF
_2
WBUF
_1
WBUF
_2
Figure 2.9: Timing diagram of running two concurrent kernels.
2.10 Results and Discussion
2.10.1 Experimental Setup
For evaluation purposes, F2N2 targeted a Xilinx VU9P FPGA in the cloud (available on the AWS EC2 F1
instance). This FPGA device includes 64 GiB DDR4 ECC protected memory, with a dedicated PCIe x16
connection. There are four DDR banks. This device also contains approximately 2.5 million logic elements
and approximately 6,800 Digital Signal Processing (DSP) units,∗∗ In our implementation, we use three DDR
banks, assigning the input feature maps, weights (including bias), and generated instructions, to three
separate DDR banks, respectively. Input images are sent from the host CPU to the on-board DDR4 (using
PCIe), and the output results are sent back to the host CPU.
We evaluate our F2N2 framework on well-known CNNs such as VGG16, ResNets, and GoogLeNet,
with two different input sizes to simulate tasks of real-life CNN applications. Precisely, we use two sets
of datasets: CIFAR-10 and ImageNet. The 32 × 32 and 224 × 224 input sizes correspond to CIFAR-10 and
ImageNet datasets, respectively.
∗∗https://aws.amazon.com/education/F1-instances-for-educators/
47
Our experiments show that the F2N2 accelerator design can achieve a post-place-and route operating
frequency of 342 MHz, when using HLS coding style for mapping to FPGAs. This is the highest frequency
reported for an accelerator design implemented in synthesizable C code that we are aware of.
2.10.2 Effect of Accelerator Design Optimizations
As mentioned in Section 2.5.3, the timing overhead associated with accessing the off-chip external memory
can account for a major portion of the end-to-end inference latency for a given CNN. For instance, to
implement the inference computations of VGG-16 with CIFAR-10 dataset using the Accelerator Design
Choice #1 and when having a 32x32 systolic array of PEs, the end-to-end latency is 22.1 ms, out of which
more than 95% is due to timing overheads associated with data transfer from/to the off-chip memory,
including the time spent for bringing weights needed for computations of each layer, and the time spent
in outer loops for reading/writing tiles of activations from/to the off-chip memory.
We use VGG-16 model as our case study to illustrate the effect of data-transfer-related optimizations
in the F2N2 accelerator design. Fig. 2.12 shows the timing diagrams for computations and data/weights
transfers corresponding to layer 5 of VGG-16, when using Accelerator Design Choice #1 (i.e., without
optimizations discussed in Section 2.5.3). As shown in the figure, data transfers account for a large portion
of the total latency for this layer. Fig. 2.13 shows the resulting timing diagrams for the same transfers after
applying the optimizations discussed in Section 2.5.3. As shown in this figure, simultaneous burst read
of weights and activations significantly improves the computational latency of the layer. Furthermore,
because weights for the next layer are pre-fetched during the weight load step, and because the required
weights for the current layer are already loaded into the on-chip global weight buffer, transferring the
weights to the distributed BRAMs can start at the beginning of the layer’s computations. Employing these
optimizations, the end-to-end latency for the discussed arbitrary network, VGG-16 with CIFAR-10 dataset,
is reduced to 5.05ms. Table 2.5 reports the frequency and resource utilization of the above configurations.
48
Figure 2.10: Timing diagrams for layer 5 of VGG-16 (starting at the blue marker and ending at the yellow
marker), when we use the accelerator design shown in Section 2.5.1. Compute latency of the layer in this
case is ≈ 526 µs.
Figure 2.11: Timing diagrams for layer 5 of VGG-16 (starting at the blue marker and ending at the yellow
marker), when using the optimizations discussed in Section 2.5.3. Compute latency of the layer in this case
is ≈ 107 µs. The timestamp of the blue marker is earlier compared to that of Fig. 2.12, as the blue marker
shows the beginning of the processing of layer 5 in VGG-16, and the processing of previous 4 layers has
finished sooner in Fig. 2.13 compared to Fig. 2.12, using the optimizations discussed in Section 2.5.3.
49
Using the available URAM resources on the FPGA device instead of BRAMs for input, weight, and output
buffers, prevents exceeding the available BRAM resources on the target FPGA, and results in achieving a
more balanced utilization of resources on the FPGA device.
2.10.3 Comparison with State-of-the-art Designs
We compare the performance obtained by F2N2 on different CNN models with those of the prior work.
For this study we consider a cloud FPGA as a target device. To have meaningful and fair comparisons, we
compare our results only with those works that have used the same or a similar FPGA board as ours, with
comparable amount of resources. We provide comparisons in terms of the resource usage (DSPs, LUTs,
etc.) on the target FPGA device.
Table 2.8 shows the performance comparison between F2N2 and prior work references on VGG-16 on
ImageNet dataset (ImageNet is widely used for evaluation of automated CNN-to-FPGA tool flows). We
employ a 64x64 systolic array of PEs to deal with higher width and height of IFMs for the ImageNet.
From the table, one can see that F2N2 achieves about 3x end-to-end latency improvement compared to
the best latency results reported by state-of-the art work, while using the same (or fewer) number of DSPs
and consuming comparable on-chip memory utilization. This shows that our optimizations related to
reducing the data transfer costs (discussed in Section 2.10.2) are highly effective. Furthermore, Table 2.9
shows the performance results of F2N2 for ResNet networks on ImageNet, alongside a comparison with
the state-of-the-art work. As shown in the table, we achieve better performance for ResNet variants in
terms of obtained inference latency compared to prior art (e.g., achieving up to 2x latency improvement
for ResNet-50), while using comparable amount of resources on the target FPGA device.
Table 2.6 shows the performance and resource utilization results of F2N2 for MobileNetV1 and MobileNetV2 networks, and a comparison with the previous work. While the achieved latency in [5] for
MobileNetV1 is lower (better) than what we offer, in their design, they have the assumption that they
50
Table 2.5: Resource Utilization and Frequency for our Accelerator Designs
- Baseline Accelerator Cloud-targeted Accelerator
Device Xilinx VU9P on AWS EC2 F1 instance
Model VGG-16
Dataset CIFAR-10
Precision fixed 16-bit
DSP (%) 15 15
LUT (%) 20 22
BRAM (%) 24 29
URAM (%) 0 33
Frequency (MHz) 342 342
Latency/Image (ms) 22.1 5.05
All the architectural choices described above are parameterizable and can be adjusted
based on the target FPGA device and neural network model.
have already stored all the weights required for computations of all layers on-chip. This is not always a
scalable solution for bigger networks. Therefore, the 2.4 ms achieved by authors in [5] does not account
for weight transfers from DRAM to the on-chip memory. To have a fair comparison, if we eliminate that
time overhead from our design, we achieve 1.3 ms computational time, which is almost twice the 2.4 ms
achieved by [5].
To reduce the time needed for data transfer from DRAM, we can train the MobileNet with lower precision weights and activations and also entail the packing approach introduced in Section 2.7. Using the
lower precision weights (even without lowering the precision of activations) reduces the time overhead
needed for transferring weights from DRAM to global buffers. The reason is that the data burst width for
transferring data from DRAM is 512 bits, and thus using 8-bit weights instead of 16-bit weights can result
in transferring 64 weights in a single data burst instead of 32 weights. For MobileNetV1, using the precision of 8-bit for weights, and even with no packing (and thus keeping the activations at 16-bit precision),
we can achieve 4.6 ms end-to-end latency instead of 6.0 ms. Similarly, using 4-bit weights, we can achieve
3.8 ms.
51
Table 2.6: Comparison with Prior Work for MobileNet
- F2N2 F2N2 Anupreetham et al [5]
Device Xilinx VU9P on AWS EC2 F1 instance Stratix 10 GX2800
Model MobileNetV1 MobileNetV2 MobileNetV1
Dataset CIFAR-10 COCO
Precision fixed 16-bit
DSP 1035 1035 5200
Frequency (MHz) 342 342 350
Latency/Image (ms) 6.0 7.8 2.4
2.10.4 Results for Multiplication-packing for Lower-precisionWeights and Activations
Table 2.7 shows the resource utilization and obtained latency for 2-bit packing. For the evaluation, we are
using the MobileNet network. As elaborated in Section 2.7, in the case of 2-bit, multiplications of 3 weights
and 2 activations (total of 6 multiplications), are mapped to DSPs. As weights are alongside the channel
dimension, and activations are alongside the width dimension, for the 32 × 32 systolic array, we can pack
the operations for a convolutional layer efficiently, if the number of channels and filters are 32 × 3 = 96.
Therefore, to have a fair comparison, we evaluate the results where the number of channels is 96. As shown
in Table 2.7, the speedup we get using the packing is almost in the factor of 6, i.e., the packing factor. This
shows the efficiency of the approach in terms of the achieved latency. However, we see an increase in LUT
usage. This is due to the additional logic needed for extracting the output of the multiplication from the
48-bit vector and keeping track of their accumulation.
2.10.5 Software Optimizations
In this section, we study the impact of software optimizations. Again we rely on VGG-16 as our casestudy model to illustrate the effect of software optimizations. Fig. 2.14 shows the end-to-end latency
of an inference with and without various optimizations (including no software optimization). We have
normalized the latency with respect to the case without any software optimizations. As Fig. 2.14 shows,
if software pipelining is enabled, there will be a significant improvement in the inference latency. The
latency is further reduced when concurrent kernel execution (multiple accelerators) is also enabled. It can
52
Figure 2.12: Timing diagrams for layer 5 of VGG-16 (starting at the blue marker and ending at the yellow
marker), when we use the accelerator design shown in Section 2.5.1. Compute latency of the layer in this
case is ≈ 526 µs.
Figure 2.13: Timing diagrams for layer 5 of VGG-16 (starting at the blue marker and ending at the yellow
marker), when using the optimizations discussed in Section 2.5.3. Compute latency of the layer in this case
is ≈ 107 µs. The timestamp of the blue marker is earlier compared to that of Fig. 2.12, as the blue marker
shows the beginning of the processing of layer 5 in VGG-16, and the processing of previous 4 layers has
finished sooner in Fig. 2.13 compared to Fig. 2.12, using the optimizations discussed in Section 2.5.3.
53
Table 2.7: Results for 2-bit packing
- 2-bit no packing 2-bit with packing
LUTs (%) 33 58
FFs (%) 15 16
BRAMs (%) 95 95
DSPs (%) 15 15
Frequency (MHz) 342 342
Latency/Image (ms) 1.98 0.346
Table 2.8: Comparison with Prior Work for VGG-16 on ImageNet ∗
F2N2 TGPA Cloud-DNN HybridDNN
Device Xilinx VU9P
Model VGG-16 VGG-19 VGG-16 VGG-16
Dataset ImageNet
Precision fixed 16-bit fixed 12-bit
DSP 4096 (60%) 4096 (60%) 5349 (78.2%) 5163 (75.5%)
LUT 1046452 (88%) 493000 (42%) 764909 (64.7%) 706353 (59.8%)
BRAM 2534 (59%) 3380 (78%) 3456 (80.2%) 3169 (73.4%)
URAM 793 (82.60%) 140 (15.6%) 810 (84.4%) 0
Frequency (MHz) 342 210 125 167
Latency/Image (ms) 7.4 22.35 28.96 -
∗
CNN Performance (GOPS) 2667 1510 1068 3375
∗Values represented with "-" are either not reported or cannot be obtained from reported results
in the corresponding work.
be observed that the number of concurrent accelerators should be set to 4. The reason for this phenomenon
is that the latency of data movements for multiple concurrent accelerators will exceed the kernel execution
time as the number of kernels goes above 4.
54
Table 2.9: Comparison with Prior Work for ResNet-18, -50, and -152 Networks on ImageNet
F2N2 F2N2 F2N2 Cloud-DNN FCNNLib
Device Xilinx VU9P
Model ResNet-18 ResNet-50 ResNet-152 ResNet-50 ResNet-152
Dataset ImageNet
Precision fixed 16-bit fixed 16-bit fixed 16-bit
DSP 4096 (60%) 5349 (78.2%) -
∗
LUT 1046452 (88%) 706353 (59.8%) -
BRAM 2534 (59%) 3456 (80.2%) -
URAM 793 (82.60%) 810 (84.4%) -
Frequency (MHz) 342 125 200
Latency/Image (ms) 4.43 6.79 14.28 13.9 14.6
CNN Performance (GOPS) 406 559 791 721 1547
∗Values represented with "-" are either not reported or cannot be obtained from reported results in
the corresponding work.
2.10.6 Results of F2N2 Mixed-computation Design
2.10.6.1 Mixed-computation with XNOR-based convolutions
In this section, we report the latency results of doing a subset of neural network layers as XNOR-based
convolutions and compare them with their corresponding values when all layers are mapped using fixedpoint computations. All evaluations are done on CIFAR-10 dataset and on the Xilinx VU9P device used
in previous comparisons. As shown in Table 2.10, mixed-computation models can achieve significant improvements in terms of end-to-end latency compared to their fixed-point counterparts, up to 10x for some
models, while using only up to 2x more computing resources.
It has been shown that the loss of accuracy for binary networks can be partially avoided by training binary networks that have wider layers [128]. Therefore, we also use this technique for our mixedcomputation models. As shown in Table 2.10, much of the accuracy loss in our mixed-computation models
(which is due to using XNOR-based computations) can be recovered by widening the first few layers of
the mixed-computation networks. To evaluate the effect of widening the layers, we use ResNet-20 as a
55
0
0.2
0.4
0.6
0.8
1 Normalized Latency
Figure 2.14: Normalized latency of running 100 images with VGG-16 when applying different software
optimization techniques.
case study. The accuracy of the mixed-computation implementation for ResNet-20 without widening the
XNOR-based layers (16-bit fixed-point for layers 1 and 20 + XNOR for layers 2-19), is 88.38%. By increasing
the number of filters for layers 2-19 by a factor of 4, the accuracy is 90.35% as shown in Table 2.10.
Table 2.10: F2N2 Mixed-computation Results for Various Models on CIFAR-10 Dataset Using Xilinx VU9P
Device
Model Imp. Style Acc. DSP Count Latency/Inf. (ms) Thruput (Inf./s)
VGG-16 16-bit fixed-point for all layers 93.04 1,024 5.03 200
VGG-16 mixed-computation: 16-bit fixed-point for layers 1 and 16 + 92.20 1,728 0.42 5,882
XNOR for layers 2-15 with 512 filters in each layer
ResNet-20 16-bit fixed-point for all layers 91.35 1,024 1.51 662
ResNet-20 mixed-computation: 16-bit fixed-point for layers 1 and 20 + 90.35 1,280 0.58 1,724
XNOR for layers 2-19 with 4x number of filters in each layer
GoogLeNet 16-bit fixed-point for all layers 95.05∗ 1,035 11.86 80
GoogLeNet mixed-computation: XNOR for the first two inception modules + 92.30∗ 2,068 0.91 130
16-bit fixed-point for all other inception modules
∗The accuracy gap can be reduced by training GoogLeNet with widened layers.
56
2.11 Conclusion
In this chapter, to bridge the gap between the resource-intensive nature of advanced DNN models and the
need for low latency, high throughput, and energy efficiency, we introduced the F2N2 framework. F2N2
is designed to harness the potential of field-programmable gate arrays (FPGAs) as a hardware platform
to accelerate neural network inferences. Its unique strength lies in optimizing the utilization of FPGA
resources and memory bandwidth, focusing on CNNs, and tackling the challenges associated with building
efficient neural network accelerators.
The contributions of this paper are multifaceted. First, F2N2 brings an end-to-end solution for generating high-performance neural network accelerators tailored to a specific FPGA device. By optimizing data
transfer costs in FPGAs with substantial on-chip memory resources, we achieve impressive end-to-end
inference performance across various neural network models. This optimization targets cloud FPGAs and
aims to reduce the communication overhead from/to the off-chip memory.
Furthermore, F2N2 introduces support for mixed-precision operations, which enables lower precision
weights and activations for specific layers without sacrificing accuracy. This approach reduces computational latency and improves overall performance. F2N2 also explores the incorporation of XNOR-based
operations for specific layers, replacing weight-activation multiplications with single Boolean XNOR operations. By packing multiple XNOR operands into long words and leveraging FPGA DSP resources, we
achieve groundbreaking results in terms of inference acceleration.
The F2N2 framework not only focuses on hardware improvements but also considers software/hardware co-optimization. By enhancing host-FPGA data communication methods, using techniques such as
software pipelining and parallelization, F2N2 maintains the gains in performance and energy efficiency
achieved at the hardware level.
In conclusion, F2N2 represents a significant step forward in the quest to make DNNs more accessible
and efficient, particularly on FPGA devices. Its contributions span hardware optimization, mixed-precision
57
support, efficient data communication, and a holistic approach to building high-performance neural network accelerators. We believe that F2N2’s innovative strategies and optimizations will pave the way for
more energy-efficient, low-latency, and high-throughput neural network inferences in diverse application
scenarios, from embedded systems to data centers.
58
Chapter 3
SynergicLearning: Neural Network-Based Feature Extraction for
Highly-Accurate Hyperdimensional Learning
3.1 Introduction
Machine learning models have proven successful in solving a wide variety of challenging problems such
as computer vision and speech recognition. They are commonly characterized by their level of accuracy,
computational/memory complexity, training time, and adaptability among other features. One can categorize machine learning models according to the aforesaid characteristics. For example, neural networks
(NNs) typically achieve high accuracy [47], are computationally expensive [122], have long training times
[76], and tend to forget previously learned information upon learning new information (aka catastrophic
forgetting) [80, 104, 77]. A machine learning model is more viable for on-chip learning (also called learning
on-a-chip which refers to designing a custom chip that can be used for both training and inference) when it
has low computational/memory complexity and supports one-pass training/fine-tuning while maintaining
a high level of accuracy.
The main reason behind the high accuracy of NNs is their ability to automatically extract high-quality,
high-level features from labeled data. AlexNet [68] is an outstanding example that clearly demonstrates
the gap between the quality of features extracted by NNs compared to handcrafted features extracted by
59
experts in the domain (in the ImageNet Large Scale Visual Recognition Challenge [102], AlexNet was able
to achieve 10.8% higher accuracy compared to the runner up, which used handcrafted features). Unfortunately, the high accuracy of NNs is accompanied by an enormous computational/memory cost during
training and inference. Training an NN is a time-consuming, iterative process where in each iteration, all
training data is applied to the model and the parameters of the model are updated according to stochastic
gradient descent.
As another example, hyperdimensional (HD) learning models train quickly, are highly adaptable and
computationally efficient (compared to NNs), but suffer from lower levels of accuracy compared to NNs
[65]. HD learning uses randomly generated, high-dimensional vectors to project training data into HD
space such that samples belonging to the same class are placed in close proximity of each other, forming
a cluster in the HD space. It then defines HD centroids that represent different classes. This relatively
simple training process only requires one pass over the training data. It also enables efficient incremental,
lifelong learning because updating the model with new training data is as simple as updating the cluster
centroids. The major disadvantage of HD learning is that it works with raw or handcrafted input features,
which are inferior to the ones extracted by NNs.
The complementary characteristics of NNs and HD models encourage the introduction of a hybrid,
synergic machine learning model that builds on their strengths while avoiding their shortcomings. However, simply employing NNs for feature extraction and HD models for classification so as to enable on-chip
learning has the following challenges. Not only is the training of NNs for feature extraction an iterative,
energy-consuming process but also it requires access to both previous training data and newly provided
data to avoid catastrophic forgetting. Therefore, frequent weight updates of NNs can be extremely costly
in the context of learning on-a-chip. Additionally, the HD learning models that work well for solving cognitive tasks have a huge number of dimensions, e.g., 10,000, which requires their hardware implementation
to time-share resources and therefore, have a relatively high latency. This prevents real-time fine-tuning
60
of the model when new training data becomes available. Moreover, training NNs for feature extraction
separately from the design of the HD learning model produces suboptimal results because it does not account for the effect of HD classification layers on the NN feature extraction layers and vice versa. This
means that the prediction/classification accuracy of the overall hybrid solution will suffer.
This work presents SynergicLearning, a hybrid learning framework for incremental, on-line learning
on a chip. SynergicLearning is comprised of three components which enable end-to-end learning:
1. A Two-step Training Approach: This training approach first trains an NN while including some
components of the HD learning system in the NN’s training loop to learn high-quality, high-level
features that are specifically tailored for the HD learning system. It then passes training data (including the initial data as well as the ones that are generated during the lifetime of the model) through the
feature extraction layers of the NN to provide features for training/fine-tuning of the HD classifier
(the neural network parameters are fixed at this step). Such a two-level training approach enables
automatic feature extraction while reducing the number of dimensions in the HD classifier by two
to three orders of magnitude∗
.
2. An On-chip Learning Module: This module is comprised of parameterized NN and HD processing modules, which respectively execute operations required by the NN feature extraction layers
and operations required by the HD classifier. The NN processing module includes a systolic array
which performs vector-matrix multiplications and an ALU which supports operations such as batch
normalization, pooling, and ReLU. The HD processing module supports the arithmetic operations
defined in the HD computing including binding, bundling, and distance calculation (Section 3.2 details these operations). The parameterized hardware implementation enables efficient exploration
of the design space to find configurations that satisfy the design constraints such as energy and
resource utilization.
∗While the term hyperdimensional learning is no longer applicable to such a classifier, we keep using the same term to highlight the fact that the operations used in the classifier are based on those defined in the hyperdimensional computing framework.
61
3. A Compiler: The custom compiler performs code optimizations and generates instructions that
efficiently schedule different operations required by the NN feature extraction and HD classification
steps (e.g., vector-matrix multiplications and data movement) on the target platform.
Table 3.1 compares different characteristics of NNs, HD learning systems (HDL), and the proposed SynergicLearning approach. It is observed that SynergicLearning enjoys automatic feature extraction and
high accuracy because it employs an NN that is tailored for HDL. Furthermore, it only requires one pass
to train/fine-tune its HD classifier and last but not least, it does not require accessing previous training
samples to update the model when new data becomes available.
The remainder of this chapter is organized as follows. Section 3.2 explains the preliminaries on HD
computing, discusses some of its shortcomings, and motivates the presented solution. Next, Section 3.3
details the proposed learning framework while Section 3.4 explains the proposed hardware architecture
and compiler for inference. After that, Section 3.5 presents the experimental results while Section 3.6
briefly reviews the related work on HD computing. Finally, Section 3.7 concludes the chapter.
3.2 Preliminaries & Motivation
HD computing defines a new computation framework that relies on high-dimensional random vectors
(aka hypervectors) and the arithmetic operations that manipulate such large random patterns. An HD
system starts by randomly generating d
h
-dimensional, holistic seed hypervectors with independent and
identically distributed (i.i.d) elements. This means that the information encoded into each hypervector is
Table 3.1: Comparison of different characteristics of NNs, HD learning systems, and SynergicLearning.
Machine Learning Automatic High One-pass Adaptable w/o Accessing
Model Feature Extraction Accuracy Training/Fine-tuning Previous Training Samples
NN ✓ ✓ ✗ ✗
HDL ✗ ✗ ✓ ✓
SynergicLearning ✓ ✓ ✓ ✓
62
uniformly distributed over all its elements. Therefore, unlike the conventional computing framework, elements in different bit positions in hypervectors are equally significant. The seed hypervectors are typically
stored in a memory called the cleanup memory. The arithmetic operations defined on the seed hypervectors, e.g. binding and bundling, enable meaningful computations in the corresponding hyperspace. The
focus of this chapter is on binary hypervectors where each element is equally likely to be a zero or one.
Binary hypervectors enjoy simplified, hardware-friendly arithmetic operations.
The distance between two binary hypervectors is measured in normalized Hamming distance, i.e. the
number of bit positions where the values of hypervectors differ, divided by d
h
. Consequently, the distance is always in the range zero to one inclusive. However, because the distance between two randomly
generated hypervectors follows a binomial distribution, most hypervectors are about 0.5 apart from one
another (when d
h
is large) and therefore, are nearly orthogonal (aka unrelated). Additionally, flipping the
values of a relatively large portion of elements in a hypervector, e.g. one-third of all elements, results in a
hypervector that is closer to the original hypervector compared to its unrelated hypervectors. This results
in considerable tolerance to noise and approximation. When the cleanup memory is queried with a noisy
hypervector, it returns the seed hypervector that is closest to the input query, hence the name cleanup.
Two of the commonly used arithmetic operations in HD computing are binding and bundling. The
binding operation is used for variable-value association. Assume variable z and its corresponding value z0
are represented with unrelated hypervectors z and z0, respectively. Then, the bound pair z = z0 can be
represented by z∗z0, where element-wise multiplication (∗) is replaced with element-wise XOR for binary
hepervectors. The resulting hypervector is unrelated to both z and z0. However, each original hypervector
can be recovered from the resulting hypervector given the other, e.g. z0 = S((z ∗ z0) ∗ z), where S(.)
looks up the cleanup memory. This process is called unbinding. The bundling operation condenses a
list of hypervectors into a single representative hypervector that is similar to all its constituents. This is
achieved by summing up all hypervectors, followed by the comparison of each element in the resulting
63
(summation) hypervector with half the number of original hypervectors to create a binary hypervector.
If the original hypervectors are bound, their variables and/or values can be found through unbinding the
bundled hypervector.
The HD computing framework can be used to solve cognitive tasks such as speech recognition and
activity recognition [65, 60, 88]. Alg. 5 summarizes different steps of training an HD model. The inputs to
the algorithm are d
l
-dimensional input features, their corresponding labels/classes, the dimension of hypervectors (d
h
), and the number of quantization levels used to discretize the input values while the outputs
are the HD centroids representing each class. The training starts with the generation of seed hypervectors
for all d
l
features as well as the quantized values they can assume. While the seed hypervectors for features
are generated randomly, the ones for quantized values are found by randomly flipping a specific number
of bits of a seed hypervector to ensure the similarity of the hypervectors representing nearby values. Next,
each feature and its value are bound and the set of all bound hypervectors are bundled into a single hypervector (aka encoding). Finally, the encoded hypervectors are categorized according to their labels and the
set of hypervectors belonging to a class are bundled to find a representative centroid. During inference,
the closest centroid to an encoded test sample (in terms of normalized Hamming distance) determines the
model’s prediction.
By fixing d
h
and increasing q, the hypervectors representing different quantization levels become more
similar because fewer bits are flipped across consecutive hypervectors. This, in turn, complicates the
unbinding process because the cleanup memory may return the wrong values. Fig. 3.1 clearly illustrates
this phenomenon by depicting the mean and standard deviation of the normalized absolute error between
the input features and the decoded features of their encoded hypervectors. Ideally, decoding encoded
hypervectors should return the exact same low-dimensional features as the original inputs (i.e., zero error),
but this does not happen in practice. We believe this phenomenon is the main reason for the relatively poor
performance of HD models compared to some other machine learning models such as NNs. Therefore,
64
Algorithm 5 Training an HD Model
Require:
Xn×d
l = x1..n, xi ∈ R
d
l
▷ the low-dimensional input features
y = y1..n, 1 ≤ yi ≤ c ▷ the target labels/classes
d
h ▷ the number of hyperspace dimensions
q ▷ the number of quantization levels
Ensure:
Tc×d
h = t1..c ▷ the HD centroids
1: generate Sd
l×d
h = s1..dl ▷ seed hypervectors for features
2: p = ⌊
d
h
q
⌋ ▷ number of bits to flip
3: generate q1 randomly
4: for i = 2..q do
5: qi = randomly pick p unflipped bits and flip them in qi−1
6: end for
7: Qq×d
h = q1..q ▷ seed hypervectors for levels
8: for each xi do ▷ encode all samples
9: x
q
i = quantize(xi
, q) ▷ quantize real values to integers
10: Xi = ∅
11: for j in 1..dl do ▷ bind feature-value pairs
12: Xi = Xi ∪ bind(sj , qx
q
ij
)
13: end for
14: x
enc
i = bundle(Xi) ▷ bundle bound hypervectors
15: end for
16: T1 = T2 = ... = Tc = ∅
17: for each x
enc
i do ▷ group encoded inputs by labels
18: Tyi = Tyi ∪ x
enc
i
19: end for
20: for each Tk do ▷ bundle all members of each class
21: tk = bundle(Tk)
22: end for
23: return T
creating input features that are aware of the error due to very similar quantization levels can
improve classification accuracy significantly, especially at lower d
h.
3.3 Proposed Method
Fig. 3.2 demonstrates a high-level overview of the proposed hybrid learning framework. The proposed
framework comprises two major components: an encoder-aware NN for high-quality feature extraction
and an HD classifier.
65
4 6 8 10
log2
(d
h)
0.0
0.1
0.2
0.3
Normalized Absolute Error
q = 4
q = 8
q = 16
Figure 3.1: The mean and standard deviation of the normalized absolute error between the input features
and the decoded features of their encoded hypervectors for different values of d
h
and q. Ideally, this error
should be zero everywhere. However, the error has a non-zero value even at extremely high dimensions
(d
h ≃ 10, 000).
The NN includes feature extraction layers, an HD encoder-decoder pair (i.e. HD codec), and classifier
layer(s). It takes the input features, passes them through the said components (aka forward propagation),
and calculates a loss value by comparing the predicted labels with the expected ones. It then updates the
model parameters, i.e. weights and biases, by backpropagating the loss value using the derivative of the
operations defined in the NN. Because the operations defined in the codec are not differentiable and the
fact that an ideal codec should behave like the identity function, the codec’s derivative is approximated
with that of the identity function during backpropagation.
Pre-processing input features with an NN has numerous important advantages. First, including the
codec in the training loop encourages the NN to adjust its parameters such that it minimizes the
impact of the codec’s error on classification accuracy. Training two identically initialized NNs, one
including a codec and the other without a codec, would result in a completely different set of parameters.
Second, the number of features extracted by the NN (d
NN ) can be much lower than the number of lowdimensional input features (d
l
), which in turn reduces the complexity of the HD classifier. Third, because
66
: input label
ˆ : NN's predicted label
i
i
y
y
ɵ
: input features
: encoder-aware NN features
: encoded NN features
: reconstructed NN features
i
NN
i
enc
i
NN
i
x
x
x
x
i
y
enc
i x
NN l d d
i y
ˆ
i
y
Training
Data
HD Encoder
HD Decoder
Feature Extraction Layers Codec Classifier Layer(s)
1 1
l
d
1
NN d
NN d
1
c
Loss
Function
Class 1 1 2 . . .
h
d
Class 2 1 2 . . .
Class c 1 2 . . .
h
d
HD Classifier
...
h d
i x
NN
i x ɵ
NN
xi
Figure 3.2: A high-level overview of the SynergicLearning framework. First, an encoder-aware NN is
trained to extract high-quality, high-level features (top row of the figure). Next, encoded NN features are
provided to train an HD classifier. Finally, during inference, the feature extraction layers of the NN and
the HD classifier are both utilized to predict each test sample’s label.
the NN extracts encoder-aware features, d
h
can be reduced by two to three orders of magnitude compared
to the existing HD systems. In other words, because the degree of similarity of hypervectors representing
quantization levels is less concerning, lower d
h values can work equally well. Fourth, there is a large body
of work on reducing the complexity of NNs through quantization [21, 153], pruning [146], and knowledge
distillation [48], to name but a few. This allows training NNs that are lightweight, thereby adding little
overhead to the overall hardware cost.
The HD classifier is very similar to the one described in Alg. 5. The only difference is that it takes the
output of NN’s feature extraction layers (x
NN
i
) instead of the original input features (xi
). Therefore, it
not only benefits from the inherent strength of NNs in feature extraction but also enjoys features that are
specifically tailored for the HD encoder.
3.4 Proposed Hardware Architecture & Compiler
The proposed hardware architecture implements an end-to-end, fully-parameterized implementation of
SynergicLearning for inference. It consists of two major hardware components: an NN processing module,
67
which includes a systolic array and an ALU, and a fully-parallel HD processing module which supports
various operations such as binding, bundling, and distance calculation.
3.4.1 NN Processing Module
Fig. 3.3 demonstrates a high-level overview of the NN processing module, which comprises the following
components:
• the systolic array, which consists of a two-dimensional array of processing elements,
• on-chip memories (i.e. weight, input, and output buffers), which act as an intermediate storage
between the DRAM and the systolic array,
• tree adders, each of which performs a summation over a row of the systolic array, and
• ALUs, which support activation functions, batch normalization, pooling, etc.
Processing each layer of the NN requires the following operations. First, the weights are read from the
external memory (DRAM) and stored in the weight buffer while inputs are read either from the DRAM or
the output buffer. Next, the systolic array and tree adders calculate the neurons’ pre-activation values by
implementing vector-matrix multiplications. Then, ALUs apply batch normalization, activation function,
pooling, etc. to pre-activations to generate the output features. Finally, the output features are either
rerouted to the input buffers or written back to the DRAM.
The systolic array implements a weight-stationary dataflow [122], which reuses each weight value
in different computations involved in vector-matrix multiplication and therefore, reduces the overhead
associated with data movement. In this dataflow, the number of cycles it takes to process each layer of the
NN is approximated by
(
dli
wsys
+ log2 wsys) ×
dli+1
hsys
,
68
where dli
is number of neurons in the i
th layer and wsys (hsys) is the number of columns (rows) in the
systolic array. log2 wsys represents the depth of each tree adder.
For mapping a neural network presented in high-level languages to the target FPGA, we developed an
in-house compiler called SynergicCompiler. Since all hardware designs presented in this chapter perform
the same computation, i.e. a three-level nested loop for fully connected layers, the space explorations are
defined by transformations (e.g., block, reorder, and parallelize) on the nested loop. Therefore, the compiler
tries various choices of loop ordering and hardware parallelism for computing these nested loops of NNs
and finds the most efficient one in terms of latency. The SynergicCompiler also generates a static schedule
for the data movements between hierarchies of memories, e.g., between external memories and buffers,
and buffers and registers within PEs. Static scheduling mitigates the need for complex handshaking and
improves the scalability and performance of the processing modules. Finally, the compiler delivers a set of
instructions that efficiently schedule different operations such as vector-matrix multiplications and data
movement on the target platform. More details about the compiler are not included in this chapter for
brevity.
3.4.2 HD Processing Module
Fig. 3.4 demonstrates a high-level overview of the pipelined HD processing module, which comprises the
following components:
• lookup tables (LUTs) that store hypervectors representing quantized levels,
• binding/unbinding units, which perform parallel XOR operations,
• majority counters, which compute the population count of a bit vector by incrementing a (log2
(d
l +
1) + 1)-bit counter when a set bit is encountered and decrementing it when a reset bit is seen,
• comparators, which produce binary hypervectors from integer hypervectors, and
69
• tree adders and tree comparators, which implement a fully-parallel Hamming distance calculation
and therefore, produce outputs in constant time.
The architecture of the proposed HD processing module has the lowest achievable latency but suffers
from high resource consumption at large d
h values compared to other possible architectures such as the
ones explained in [105]. However, because SynergicLearning allows the utilization of HD learning systems with extremely low d
h values, the resource usage of the HD processing module will be negligible.
Furthermore, because all the aforementioned components produce their results in constant time, the final
output of the HD processing module will be produced in constant time too. Additionally, because of the
pipelined implementation of the HD processing module, it can produce an output every cycle, hence very
high throughput.
3.5 Results & Discussion
3.5.1 Experimental Setup
3.5.1.1 Datasets
To study the effectiveness of SynergicLearning, we use two publicly available datasets: Human Activity
Recognition (HAR) [4] and ISOLET [24]. HAR includes 10,299 samples, each of which contains 561 handcrafted features and a label that corresponds to one of six possible activities. ISOLET, on the other hand,
contains 7,797 samples, each of which includes 617 handcrafted features and a label that corresponds to
one of the 26 characters in the English alphabet. The goal is to take the input features and their labels and
train classifiers that predict labels of unseen samples accurately.
70
3.5.1.2 Training Framework
We implement a PyTorch-compatible [92] HD computing library that includes operations such as binding/unbinding, bundling, encoding, and decoding. Because of the compatibility with PyTorch, the operations can be mapped efficiently to either CPUs or GPUs. Additionally, they can be easily integrated into
existing PyTorch designs such as NNs.
We also implement a training ecosystem that takes a user-defined (possibly existing) NN architecture
and the parameters of the HD learning system (e.g. d
h
and q) and automatically glues different components
together to enable encoder-aware training of the neural network. Similarly, it includes easy-to-use HD
training modules. This training ecosystem allows us to quickly explore different designs and compare
their accuracy.
3.5.1.3 Neural Network Training
We train all NNs by minimizing a cross-entropy loss function for 120 epochs, with a batch size of 256, and
an l2 regularizer. Additionally, we use a learning rate scheduler similar to the one described in [113] where
the maximum learning rate is set to 0.01 while the number of steps per epoch is 25.
3.5.1.4 Hardware Emulation Framework
To implement the NN and HD processing modules, we use the Xilinx SDAccel which provides a toolchain
for programming and optimizing different applications on Xilinx FPGAs using a high-level language (C,
C++ or OpenCL) and/or hardware description languages (VHDL, Verilog and SystemVerilog), as well as
a runtime based on the OpenCL APIs that can be used by the host-side software to interact with the
accelerator. We evaluate our proposed architecture using SDAccel on the ISOLET dataset targeting the
Xilinx UltraScale+ VU9P FPGA on AWS EC2 F1 instances. We also use the Vivado power report provided
by Xilinx to assess the power consumption of each design.
71
3.5.2 The Impact of NNs on the Quality of HD Features
In this section, we study the impact of NNs on the quality of encoded HD features by visualizing different
samples of the HAR dataset in two-dimensional (2D) space. The feature extraction layers of the NNs consist
of two fully-connected layers, each of which has 561 neurons. We deliberately keep the number of neurons
in the final feature extraction layer the same as the one for the input features (i.e. d
NN = d
l
) to ensure the
difference across the results of various experiments is only due to the introduction of NNs. We use ReLU
and PACT [21] for the activation functions of the first and second layer, respectively.
Fig. 3.5 shows the 2D representation of the encoded hypervectors of the test set for three different
designs: HDL, NN followed by HDL, and encoder-aware NN followed by HDL (i.e. the proposed flow). To
obtain the 2D representation, we employ t-distributed stochastic neighbor embedding (t-SNE) [78], which
is a technique used for visualizing high-dimensional data. t-SNE tends to provide good visualizations
because it tries to keep the similarities in HD space in the 2D representation as well. The 2D representations
of hypervectors belonging to different classes are shown using different colors. For figures 3.5a-3.5c, we
use d
h = 16, and for figures 3.5d-3.5f, we use d
h = 10, 240. For all experiments, q = 4.
For small values of d
h
(e.g. 16), it is observed that HDL performs poorly in the separation of points in
the HD space (Fig. 3.5a). On the other hand, the addition of an NN to the flow helps with more proper separation of data points (Fig. 3.5b) while introducing an encoder-aware NN leads to a near-perfect clustering
of data (Fig. 3.5c). The accuracy values reported in Fig. 3.5a-3.5c further support this observation. For large
values of d
h
(e.g. 10,240), it is observed that HDL performs relatively well while models that include NNs
still outperform the HDL model by a large margin (Fig. 3.5d-3.5f). In this configuration, the model that
includes an NN and the one that has an encoder-aware NN perform almost equally well.
72
3.5.3 Comparison of Classification Accuracy
Table 3.2 compares the highest values of accuracy reported for NNs and HD learning systems with the proposed SynergicLearning approach on the HAR and ISOLET datasets. It is observed that on these datasets,
the proposed hybrid model outperforms both NNs and HD learning systems used in the prior work.
Fig. 3.6 compares classification accuracy of three different models (HDL, NN followed by HDL, and
encoder-aware NN followed by HDL) for different values of d
h
and q on HAR and ISOLET datasets. It
is observed that the model that includes an NN consistently outperforms HDL while the model that has
an encoder-aware NN outperforms the other two in almost all experiments. On the HAR dataset, the
difference between the model with an encoder-aware NN and the HDL model is as large as about 63% at
d
h = 16 while it decreases to about 14% at d
h = 10, 240. Similarly, On the ISOLET dataset, the difference
between the model with an encoder-aware NN and the HDL model is as large as about 83% at d
h = 16
while it decreases to about 10% at d
h = 10, 240.
Another key observation is that the model with an encoder-aware NN achieves almost the same level
of accuracy at different values of d
h
. This is particularly interesting from a hardware cost perspective,
because we can pick the lowest value of d
h
(16 in these experiments) and achieve significant reduction in
resource utilization while maintaining high accuracy.
We also study the effect of different random seeds for initialization of NN weights and randomly generated seed hypervectors on classification accuracy. Based on our experiments, the difference between the
lowest and highest values of classification accuracy across designs that use different seeds is at most 1%.
We believe such variation in classification accuracy is acceptable.
†They also reported higher accuracy of 97.6 % when they added statistical features and data centering methods to their
convolutional neural network.
73
Table 3.2: Top accuracy reported for NNs, HD learning systems, and SynergicLearning on HAR and ISOLET
datasets.
Dataset Machine Learning Model Accuracy (%) HAR
NN [59]
‡†
95.31 %
HDL [63] 93.4%
SynergicLearning 96.44 %
ISOLET
NN [11, 61]
∗
95.9 %
HDL [61] 93.8 %
SynergicLearning 96.67 %
‡Uses a convolutional neural network.
∗Uses a fully-connected network with 48 hidden layers.
3.5.4 Incremental Learning
Table 3.3 compares the accuracy of HD learning models and SynergicLearning when a portion of data is
initially used for training while the remaining data is used for fine-tuning the model on a chip. Because
on-chip-learning is extremely costly for NNs, we do not consider them in this comparison. As expected,
the HDL model is insensitive to whether the training data is provided incrementally or all at once and
therefore, its accuracy remains constant and relatively low. For the SynergicLearning model, on the other
hand, the accuracy keeps increasing when more data is provided to the NN in the initial training phase
because it allows the NN to find higher quality features. This encourages less frequent, off-line updates to
the NN for increasing the accuracy of the model.
3.5.5 The Hardware Cost of NN & HD Processing Modules
Fig. 3.7 shows the LUT utilization and latency of HD processing modules for different values of d
h while
limiting the number of adders in each stage of tree adders to 16. It is observed that the latency grows very
Table 3.3: Comparison of the effect of incremental learning on the accuracy of different models on the
ISOLET dataset.
Machine Learning Accuracy
Model (Ratio of the Initial Training Data)
HDL 85.76% 85.76% 85.76% 85.76%
(0.25) (0.5) (0.75) (1)
SynergicLearning 86.21% 91.21% 94.03% 95.77%
(0.25) (0.5) (0.75) (1)
74
Table 3.4: Comparison between the hardware metrics of SynergicLearning (d
h = 16) with pure HD (d
h =
10, 240) over the ISOLET dataset on Xilinx UltraScale+ VU9P FPGA. The improvements of our approach
compared to other approaches are shown in parantheses.
Approach Implementation BRAMs-18K (%) DSPs-48E (%) FFs (%) LUTs (%) Latency (µs) Power (W)
SynergicLearning NN+HD 1.8 15.0 0.8 5.1 23.3 5.3
Pure HD [105]
Parallel 0 (N/A) 0 (N/A) 11.0 (93%) 15.0 (66%) 49.5 (53%) 8.5 (38%)
Sequential 0 (N/A) 0 (N/A) 11.0 (93%) 9.0 (43%) 788.7 (97%) 7.7 (31%)
NN [11, 61] Systolic Array 1.7 (-6%) 15.0 (0%) 0.7 (-14%) 3.6 (-42%) 835.9 (97%) 5.1 (-4%)
rapidly when increasing d
h
to values required for meeting accuracy requirements. Additionally, to reduce
the resource utilization for large values of d
h
, we can change the architecture from a fully-parallel architecture to a vector-sequential architecture where all adders and counters operate in a sequential manner
(compare Sequential Implementation with Parallel Implementation entries in Table. 3.4). While our parameterized architecture has a capability to generate both parallel and sequential-vector for HD processing
module of SynergicLearning approach, we report the results for parallel implementation which delivers
higher performance. Thanks to exteremly low d
h value in SynergicHD, the hardware overhead of parallel
implementation is minimal.
Table 3.4 compares area utilization, latency, and power consumption of SynergicLearning at d
h = 16
with pure HD processing module at d
h = 10, 240. SynergicLearning outperforms the fully-parallel pure
HD processing module in terms of latency by a factor of 2.13x while yielding 1.60x lower power consumption. Compared to the vector-sequential implementation of HD processing module, SynergicLearning
achieves 33.89x improvement in latency while yielding 1.45x lower power consumption.
It is worth mentioning that our designs are capable of achieving high clock rates (i.e. 344 MHz). The
breakdown of different metrics between the NN and HD processing modules is as follows. The NN processing module consumes 93%, 100%, 87%, and 71% of the total consumed BRAMs-18K, DSPs-48E, FFs, and
LUTs, respectively. The latency of the NN processing module is 23.12µs and the power consumption of
the HD processing module is negligible compared to the NN processing module (i.e. less than 4% of total
power consumption).
75
3.6 Related Work
Kanerva [65] explains the advantages and mathematical properties of HD computing, and how data patterns should correspond in a systematic way to the entities they represent in the real world for achieving
brain-like computing. Some of the prior work attempt to improve the performance of HD computing, either by increasing the obtained accuracy for some complex tasks, or enabling it to maintain the accuracy
for lower dimensions. Authors in [60] propose a hierarchical HD computing framework, which enables
HD to improve its performance using multiple encoders without increasing the cost of classification. In
[88], authors utilize the mathematics of hyperdimensional spaces, and split each class hypervector into
separate components and combine them into a reduced dimensional model. However, these works have
not explored the effect of the feature extraction for low-dimensional input features.
Several studies in the literature explore hardware optimizations for implementing HD computing for
different application domains. Authors in [96] propose a memory-centric architecture for the HD classifier
with modular and scalable components, and demonstrate its performance on a language identification
task. In [27], authors develop a programmable and scalable architecture for energy-efficient supervised
classification using HD computing, and compare it with traditional architectures for a few conventional
machine learning algorithms. The work in [62] explores architectural designs for the cleanup memory to
facilitate energy-efficient, fast, and scalable search operation, and the proposed designs are evaluated for
a language recognition application.
3.7 Conclusions
In this chapter, we introduced SynergicLearning, a novel hybrid learning framework designed for incremental, on-line learning on a chip. We acknowledged the unique strengths and limitations of different
76
machine learning models, particularly NNs and HD learning models. NNs excel at automatic feature extraction, achieving high accuracy, and offer the adaptability required for on-chip learning. In contrast, HD
models are computationally efficient and can quickly train, making them suitable for real-time fine-tuning
but suffer from lower accuracy. Recognizing the complementary nature of these models, we presented a
synergistic approach that harnesses their advantages while mitigating their shortcomings.
The core components of SynergicLearning include a two-step training approach, an on-chip learning
module, and a custom compiler. This approach offers a unique way of combining the strengths of NNs
and HD models to create an end-to-end learning framework that addresses the needs of on-chip learning.
The two-step training approach leverages an NN to automatically extract high-quality, high-level features
specifically tailored for the HD learning system. It also reduces the dimensionality of the HD classifier,
ensuring efficient processing. The on-chip learning module includes parameterized hardware implementations for both NN and HD processing, offering flexibility and efficient resource utilization. The custom
compiler optimizes code and generates instructions that schedule various operations, improving performance on the target platform.
SynergicLearning offers automatic feature extraction, high accuracy, and the capability for one-pass
training and fine-tuning of the HD classifier. Most importantly, it achieves these feats without the need to
access previous training samples when updating the model with new data. By designing NNs that include
some components of the HD models in their training loop, we trained high-quality feature extraction layers
tailored to the HD learning model. By passing the input low-dimensional features through these layers
before encoding them into the HD space, the number of dimensions of the HD space was reduced by two
to three orders of magnitude, while maintaining the high classification accuracy, which led to less complex
HD classifier.
77
Using our proposed and implemented hardware architecture for end-to-end fully-parametrized implementation of SynergicLearning for inference, we achieved 2.13x improvement in terms of latency, while
yielding 1.60x lower power consumption compared to pure HD computing.
Acknowledgements
This research was sponsored in part by a grant from the Software and Hardware Foundations program of
the National Science Foundation.
78
DRAM
…
…
…
…
…
…
Input Buffer
Weight Buffer
Output Buffer
Reg File
Tree Adders …
…
ALUs
…
PE PE PE
PE PE PE
PE PE PE
PE PE PE
Output data flow
Weight data flow
Input data flow
Figure 3.3: Architectural view of the NN processing module which includes a systolic array, on-chip memories, tree adders, and ALUs.
79
Input Buffer
LUT-based
Levels
: Hard-wired
Feature
LUT-based
Levels
: Hard-wired
Feature
…
C
C
> 0
> 0
: Hard-wired
Centroid
: Hard-wired
Centroid
…
Tree Comparator
<
<
<
…
…
DRAM
-rep
ℎ
-rep -rep
퐬풅
풍
퐬ퟏ
퐭ퟏ
퐭
퐜
Binding Units M-Counters Comparators
ℎ
-rep
Unbinding Units
-repTree Adders
Multiplexer
Concatenator
Encoder
Bundling Hamming Distance Calculator
Similarity Checker
Figure 3.4: Architectural overview of the HD processing module which includes lookup tables that store
hypervectors representing quantized levels, binding/unbinding units, majority counters, comparators, tree
adders, and tree comparators.
q = 4, d
h= 16
Accuracy = 37.60%
(a) HDL
q = 4, d
h= 16
Accuracy = 77.13%
(b) NN followed by HDL
q = 4, d
h= 16
Accuracy = 96.00%
(c) Encoder-aware NN followed by
HDL
q = 4, d
h= 10240
Accuracy = 80.90%
(d) HDL
q = 4, d
h= 10240
Accuracy = 95.05%
(e) NN followed by HDL
q = 4, d
h= 10240
Accuracy = 96.17%
(f) Encoder-aware NN followed by
HDL
Figure 3.5: Two-dimensional (t-SNE) representation of the encoded hypervectors of the HAR dataset for
three different designs: HDL, NN followed by HDL, and encoder-aware NN followed by HDL.
80
5 10
log2
(d
h )
35
50
65
80
95
Accuracy (%)
q = 4, HAR
5 10
log2
(d
h )
35
50
65
80
95
Accuracy (%)
q = 16, HAR
5 10
log2
(d
h )
20
40
60
80
100
Accuracy (%)
q = 4, ISOLET
5 10
log2
(d
h )
20
40
60
80
100
Accuracy (%)
q = 16, ISOLET
HDL NN + HDL Encoder-aware NN + HDL
Figure 3.6: Classification accuracy of different models on HAR and ISOLET datasets for different values of
d
h
and q.
4 6 8 10 12 14
log2
(d
h)
5
10
15
LUT Utilization (%)
0
5000
10000
15000
Latency (Cycles)
Figure 3.7: the LUT utilization and latency of HD processing modules for different values of d
h
.
81
Chapter 4
Modeling Processor Idle Times in MPSoC Platforms to Enable
Integrated DPM, DVFS, and Task Scheduling Subject to a Hard Deadline
4.1 Introduction
Energy consumption is one of the most important design criteria of computing devices, ranging from
portable embedded systems to servers in data centers. Furthermore, with growing demand for high performance in embedded systems, architectures such as multiprocessor system-on-chip (MPSoC) are becoming
more popular for many real-time applications. In order to reduce energy consumption in such embedded systems, two main techniques are used, namely, dynamic voltage and frequency scaling (DVFS) and
dynamic power management (DPM). In DVFS, operating voltage and clock frequency of processors are adjusted based on workload characteristics. With DPM, processors are switched to a low power state (sleep
mode) when they are not used for execution of any tasks (idle time/interval). This leads to the reduction
of static power consumption. However, switching to a sleep mode has non-negligible time and energy
overhead, and it only causes energy savings when the idle time of a processor is longer than a threshold
called break-even time [42].
There have been many research studies regarding reducing the energy consumption using DVFS and/or
DPM. A major portion of these studies only considers DVFS for the energy optimization on single and
82
multiprocessor platforms [8, 55, 41]. Ref [90] has focused on DPM and has proposed an energy-efficient
scheduling relying on minimizing the number of processor switching and maximizing the usage of energyefficient cores in heterogeneous platforms. Some other research studies have integrated scheduling of tasks
with DVFS and then, at the final phase, have applied DPM wherever it was possible [116]. However, with
the increase in static power portion of the total power consumption of systems [54], both DPM and DVFS
should be integrated with scheduling of tasks for the sake of energy optimization. Reference [100] has
combined DPM and DVFS for minimizing energy consumption of a uniprocessor platform performing
periodic hard real-time tasks with precedence constraints. A major challenge of integrating DPM with the
scheduling of tasks in a multiprocessor platform is formulating idle intervals and their associated energy
consumption in the total energy consumption of the these platforms. The authors in [13] have developed an
energy-minimization formulation for a multiprocessor system considering both DVFS and DPM and solves
it via mixed integer linear programming (MILP). However, one major assumption in their formulation is
that the processor assignment for the tasks to be scheduled is known in advance. Furthermore, they only
consider inter-task DVFS, i.e., the frequency of the processor stays constant for the entire duration of the
execution of a task. However, when there is a set of discrete frequencies available for task execution (as
that is the case in [13] and also our work as we will see in Section 5.3.3), allowing intra-task DVFS and the
usage of a combination of discrete frequencies for execution of tasks can result in more energy savings
[42].
In this chapter, by proposing a method for modeling idle intervals in a multiprocessor system, we
present an energy optimization MILP formulation integrating both DVFS and DPM with scheduling of
real-time tasks with precedence and time constraints. By solving the MILP, for each task, we obtain the
optimum processor assignment, execution start time, and the distribution of its workload among available
frequencies of the processor. To the best of our knowledge, this is the first work that integrates both DVFS
and DPM with scheduling of real-time periodic dependent tasks in a formulation that provides optimum
83
values for all the aforementioned results simultaneously in a multiprocessor platform. We also present
a heuristic approach for solving the model and compare its results with those obtained from solving the
MILP. The rest of the chapter is organized as follows: Section 4.2 explains the models used for the problem
formulation and presents the formal problem statement. Section 4.3 presents the proposed method and
MILP formulation. Section 4.4 provides the results. Finally, Section 4.5 concludes the chapter and discusses
future work.
4.2 Models and Problem Definition
4.2.1 Voltage and Frequency Change Overhead
The frequency change for modern processors takes around tens of microseconds depending on the amount
and (up or down) direction of the frequency change. According to [91], the frequency downscaling for Intel
Core2 Duo E6850 processor takes approximately between 10 to 60 microseconds depending on the amount
of the frequency change. In contrast, the transition to and from sleep modes of modern processors usually
takes in the order of a few milliseconds. Therefore, for our modeling, we ignore the latency overhead of
switching frequencies compared to that of transition to and from sleep modes of a processor. The energy
overhead associated with frequency change is also small and neglected in our modeling.
4.2.2 Task Model
Tasks to be scheduled are modeled as a task graph which itself is a directed acyclic graph (DAG) represented by G(V, E, Td), in which V denotes the set of tasks (we have a total of n tasks), E denotes data
dependencies among tasks, and Td denotes the period of the task graph (i.e., tasks in the task graph are
repeated after Td). Each task graph should be scheduled before the arrival of the next one (i.e., Td acts
as a hard deadline for scheduling of tasks). In this work, the workload of each task is represented by the
84
total number of processor cycles required to perform that task completely. For task u (u = 1, 2, ..., n), this
workload is represented by Wu.
4.2.3 Energy Model
For modeling the processor power consumption during executing a task with frequency f, similar to [41],
the following model would be exploited:
P = afα + bf + c, (4.1)
in which afα represents dynamic power portion, and bf + c represents static power portion of total
processor power consumption. α indicates the technology-dependent dynamic power exponent; usually
≈ 3. a is a constant that depends on the average switched capacitance and the average activity factor.
Therefore, energy consumption in one clock cycle, when executing a task with frequency f, is obtained
via the following formulation:
Ecycle = afα−1 + b +
c
f
. (4.2)
For modeling the processor energy consumption during an idle time, Eidle function is used according
to the formulation presented in (4.3). Here, for the illustration purposes, we only use one sleep mode for
switching to and waking up from, and power consumption during this sleep mode is considered to be zero
(It is straightforward to extend the work to support multiple sleep modes each associated with a different
non-zero power consumption):
Eidle(I) =
c × I 0 ≤ I < Tbe
Esw Tbe ≤ I < Td ,
0 I = Td
(4.3)
85
where I represents the idle time, c is the frequency-independent component of power consumption (by
setting f to zero in (5.6)), and Esw is the switching energy overhead for both switching to the sleep mode
and waking up from it. Tbe represents break-even time and is obtained as follows:
Tbe = max (Tsw,
Esw
c
), (4.4)
where Esw
c
represents the minimum amount the idle time should be so that switching to the sleep mode
and waking up from it causes energy savings, and Tsw is the physical time needed for both switching to the
sleep mode and waking up from it. Tbe is the maximum of these two values. Furthermore, the third term in
(4.3) conveys the fact that if no task is assigned to a processor or equivalently I = Td, that processor is not
used for scheduling of tasks and thus does not contribute to the total energy consumption at all. Therefore,
our model explores the possibility of scheduling the task graph on a subset of K available processors if it
results in energy savings.
4.2.4 Problem Statement
Using the combination of DVFS and DPM, where each of these techniques can be done for each processor independently, we are looking for energy-optimized scheduling of the task graph represented by
G(V, E, Td) on a platform comprising of K homogeneous processors subject to a hard deadline. Each
processor supports a set of m distinct frequencies: {f1, f2, ..., fm}. We are considering a non-preemptive
scheduling method. Therefore, when the execution of a task starts on each of the processors, it continues
until the task completion without any interruption. Consequently, for each task, we are looking for optimum values for: processor assignment for the task, task execution start time, and distribution of the total
number of required processor cycles for the complete execution of the task among m available frequencies.
86
4.3 Proposed Method
4.3.1 Constraints of the Proposed Scheduling Model
In this section, we formulate constraints of the proposed scheduling model. Duration of task u (u =
1, 2, ..., n) is formulated as follows:
Duru =
Xm
i=1
Nu,i
fi
, (4.5)
where Nu,i indicates number of processor cycles performed at fi
(i = 1, 2, ..., m) for the execution of task
u. Therefore:
Xm
i=1
Nu,i = Wu, Nu,i ≥ 0. (4.6)
According to (5.7) and (5.16), energy consumption during the execution of task u can be formulated as
follows:
Etask(u) = Xm
i=1
(Nu,i.(afα−1
i + b +
c
fi
)). (4.7)
To ensure each task finishes its execution before Td, for u = 1, 2, ..., n, we have:
Su + Duru ≤ Td, Su ≥ 0, (4.8)
where Su represents start time of the execution of task u. Furthermore, the precedence constraint is
formulated as follows:
Su + Duru ≤ Sv, ∀e(u, v) ∈ E. (4.9)
Here, we do not consider any inter-task communication cost associated with e(u, v) for sending output
data of task u to input data of task v (The model can be easily extended to incorporate this cost).
87
For processor assignment for task u to processor k, k = 1, 2, ..., K, we introduce the decision variable
of Pk,u which is defined as follows:
Pk,u =
1 if task u is assigned to processor k
0 otherwise
. (4.10)
Therefore, we have the following constraint:
X
K
k=1
Pk,u = 1, for u = 1, 2, ..., n. (4.11)
One other important constraint that needs to be satisfied is that the execution of tasks assigned to the
same processor shall not overlap each other (non-preemptive scheduling). For this, we define an auxiliary
decision variable called Ok,u,v representing ordering of tasks. For k = 1, 2, ..., K; u = 1, 2, ..., n; v =
1, 2, ..., n, v ̸= u; we define:
Ok,u,v =
1 if task u is scheduled immediately
before task v on processor k
0 otherwise
. (4.12)
In addition, if task v is the first task assigned to processor k, we define Ok,0,v to be 1 (and is 0 otherwise).
On the other hand, if task u is the last task assigned to processor k, we define Ok,u,n+1 to be 1 (and is 0
otherwise). Furthermore, if there is no task assigned to processor k, we define Ok,0,n+1 to be 1 (and is 0
88
otherwise). Accordingly, using (5.26) and the definitions provided for Ok,0,v, Ok,u,n+1 and Ok,0,n+1, we
have the following constraints for k = 1, 2, ..., K:
nX
+1
v=1
v̸=u
Ok,u,v = Pk,u, for u = 0, 1, ..., n (4.13)
Xn
u=0
u̸=v
Ok,u,v = Pk,v, for v = 1, 2, ..., n + 1. (4.14)
According to (5.27), if task u is assigned to processor k (Pk,u = 1), either there is one and only one
task scheduled immediately after task u on processor k or task u is the last task assigned to processor k.
Similarly, according to (5.28), if task v is assigned to processor k (Pk,v = 1), either there is one and only
one task scheduled immediately before task v on processor k or task v is the first task assigned to processor
k. In both (5.27) and (5.28), Pk,0 and Pk,n+1 are defined as 1 for all k = 1, 2, ..., K.
For non-preemptive scheduling we should have:
X
K
k=1
Xn
u=1
((Su + Duru).Ok,u,v) ≤ Sv,
for v = 1, 2, ..., n, v ̸= u,
(4.15)
which can be formulated as the following linear constraint:
Su + Duru − (1 − Ok,u,v) × Td ≤ Sv,
for u = 1, 2, ..., n,
for v = 1, 2, ..., n, v ̸= u,
for k = 1, 2, ..., K.
(4.16)
89
4.3.2 Modeling Idle Intervals
Using Ok,u,v variables introduced in Section 4.3.1, we can conveniently model idle intervals in an MPSoC
platform. Specifically, for each task v (v = 1, 2, ..., n), we formulate the amount of the idle time before
servicing task v on the processor to which task v is assigned. When task v is not the first task scheduled
on the processor to which it is assigned, the idle time before servicing task v can be written as follows:
Iv = (1 −
X
K
k=1
Ok,0,v)
× (Sv −
X
K
k=1
Xn
u=1
u̸=v
((Su + Duru).Ok,u,v)). (4.17)
If task v is the first task scheduled on any of K processors, the first term in multiplication in (4.17) causes
Iv to be zero. In that case, idle time before servicing task v on the processor k to which the task is assigned
is obtained using the following:
I
′
k = (Td −
Xn
u=1
((Su + Duru).Ok,u,n+1))
+
Xn
v=1
(Ok,0,v × Sv). (4.18)
In (4.18), the second term in summation represents the idle time on processor k before servicing its first
assigned task in the current period. On the other hand, The first term in the summation in (4.18) represents
the idle time on that processor after servicing its last assigned task in the previous period. This interval
should also be taken into account when calculating the amount of idle time before servicing first task
scheduled on processor k. If there is no task assigned to processor k at all, (4.18) would give the value of
Td for I
′
k
.
90
4.3.3 Objective Function
Subject to constraints formulated so far, we are trying to minimize the following objective function which
represents the total energy consumption:
Xn
u=1
Etask(u) +Xn
v=1
Eidle(Iv) +X
K
k=1
Eidle(I
′
k
). (4.19)
The objective function of (4.19), alongside the formulated constraints, forms a mixed integer programming
over the positive real variables of Su and Nu,i; and the Boolean decision variables of Pk,u and Ok,u,v. The
number of these variables in our problem are n, nm, nk, and (n + 1)2k − nk, respectively.
However, due to formulations presented for idle time intervals in (4.17) and (4.18), and the concave
piece-wise behavior of Eidle function in (4.3) in a minimization problem, it is a non-linear non-convex
programming (Etask(u)
in (4.19) is linear with respect to positive real variables of Nu,i and this term does
not contribute to the nonlinearity of the problem).
For linearizing (4.17) and (4.18), we use the lemma mentioned in [13], where this lemma is stated
as follows: Given constants s1 and s2, if P1 and P2 are two constraint spaces where P1 is {[t, b, x] |t =
bx, −s1 ≤ x ≤ s2, b ∈ {0, 1}}, and P2 is {[t, b, x] | −bs1 ≤ t ≤ bs2, t+bs1−x−s1 ≤ 0, t−bs2−x+s2 ≥
0, b ∈ {0, 1}}, then, P1 and P2 are equivalent. Proof of this lemma is given in [13]. With this lemma, we
can substitute multiplication of a Boolean decision variable and a bounded real variable, with a newly
introduced bounded real variable and three added linear constraints indicated in P2). Using this lemma
multiple times, we can reach linear representations for idle time interval formulations in (4.17) and (4.18)
at the end.
Furthermore, Eidle(Iv) in (4.19) can be written as follows:
Eidle(Iv) = Sv.(Esw) + (1 − Sv).(c × Iv), (4.20)
91
where Sv is a Boolean decision variable which is 1 when Tbe ≤ Iv < Td and is 0 otherwise (I < Tbe).
Therefore, this decision variable represents switching and whether we put the processor during Iv in the
sleep mode or not. Since Iv represents the amount of idle time before servicing task v on the processor to
which task v is assigned when task v is not the first task on that processor, Iv can never be Td. Therefore, we
do not need to formulate the third term of (4.3) in (4.20). Corresponding constraint for Sv (v = 1, 2, ..., n),
is written as follows:
Iv − Tbe
Td
≤ Sv ≤
Iv
Tbe
, Sv ∈ {0, 1}. (4.21)
For Eidle(I
′
k
), a similar formulation like (4.20) can be used except that we need another Boolean decision variable called Uk which represents whether we assign any task to processor k or not. When Uk
is 0, it means processor k is not used at all for scheduling the task graph and thus does not contribute to
the energy consumption in (4.19). Therefore, Uk is 1 when we assign one or more tasks to processor k
(I
′
k < Td) and is 0 otherwise (I
′
k = Td, or equivalently: Td ≤ I
′
k ≤ Td). Accordingly, Eidle(I
′
k
) in (4.19)
can be written as follows:
Eidle(I
′
k
) = Uk.[S
′
k
.(Esw) + (1 − S
′
k
).(c × I
′
k
)], (4.22)
where S
′
k
represents whether we switch the processor during I
′
k
to the sleep mode or not (similar to Sv).
The usage of Uk in (4.22) allows formulating the third term of (4.3). Corresponding constraints for S
′
k
and
Uk (k = 1, 2, ..., K) are written as follows:
I
′
k − Tbe
Td
≤ S
′
k ≤
I
′
k
Tbe
, S′
k ∈ {0, 1}, (4.23)
I
′
k − Td
T d ≤ Uk ≤
I
′
k
Td
, Uk ∈ {0, 1}. (4.24)
92
In order to linearize (4.20) and (4.22), we again use the aforementioned lemma. However, for (4.22),
where we have a multiplication of two Boolean decision variables, we also need the following lemma: If
P1 and P2 are two constraint spaces where P1 is {[z, x, y] | z = xy, x ∈ {0, 1}, y ∈ {0, 1}}, and P2 is
{[z, x, y] | z ≤ x, z ≤ y, x + y − z ≤ 1, x ∈ {0, 1}, y ∈ {0, 1}}, then, P1 and P2 are equivalent. Using
these lemmas and methods for linearizing the objective function of (4.19), the energy-optimized scheduling
problem expressed in Section 5.3.3 is modeled as an MILP formulation.
4.4 Results
4.4.1 Experiment Setup
In order to solve the formulated MILP, we use IBM ILOG CPLEX Optimization Studio [58]. The platform
on which simulations are performed is a computer with a 3.2 GHz Intel Core i7-8700 Processor and 16
GB RAM. Using [13] for obtaining energy model parameters, the frequency-independent component of
processor power consumption, which is represented by c in (5.6), is obtained as 276 mW. Each processor
can operate independently of other processors at either f1 = 1.01 GHz, f2 = 1.26 GHz, f3 = 1.53 GHz,
f4 = 1.81 GHz, f5 = 2.1 GHz. For these frequencies, frequency-dependent component of processor
power consumption, which is represented by afα + bf in (5.6), is 430.9 mW, 556.8 mW, 710.7 mW,
896.5 mW, and 1118.2 mW, respectively. Using curve fitting, we obtain a = 23.8729, b = 401.6654, and
α = 3.2941 in (5.6). Esw and Tsw are set as 385 µJ and 5 ms. Here, We consider a architecture with 4
processors. Simulations are performed on 8 task graphs randomly generated using TGFF [30], which is a
randomized task graph generator widely used in the literature to evaluate the performance of scheduling
algorithms. Detailed information for each task graph is presented in Table 4.1. For studied task graphs,
the average workload of each task is set to 2 × 106
cycles (around 1 ms execution time under maximum
93
frequency). The maximum in-degree and out-degree for each node is set to 2 and 3, respectively. The
number of tasks in studied random task graphs ranges from 7 to 28.
To evaluate the advantage of our modeling of idle intervals in multiprocessor systems, we consider
two cases: 1) A baseline case which uses only the first term of (4.19) as the objective function alongside
with constraints of (5.16) to (5.28), and (5.23). In other words, in this baseline case, we do not use any idle
time-related terms in the objective function or constraints. 2) Using (19) as the objective function alongside
all constraints and linearization techniques mentioned in Section 4.3 (this case is our proposed method). In
the baseline case, switching to the sleep mode during an idle time is done, if possible, after the scheduling is
finished (i.e., DPM is not integrated with DVFS and scheduling in the baseline case). Therefore, the baseline
case is an integrated Scheduling and Clock-and-voltage scaling followed by mode Transition algorithm
(iSC+T). The second case, which is our proposed method, is referred to as an integrated Scheduling, Clockand-voltage scaling, and mode Transition algorithm (iSCT).
4.4.2 Effect of Modeling Idle Intervals
According to Table 4.1, including the energy consumption of modeled idle intervals in the objective function causes an average energy saving of 15.34% (up to 25.21%) for iSCT versus iSC+T. To better observe
the contribution of modeling idle intervals in an MPSoC platform, for each scheduled task graph, the total
number of idle intervals on all processors, and total time of these idle intervals are shown in Table 4.2 for
both iSCT and iSC+T. Furthermore, for each scheduled task graph, the number of used processors for the
Table 4.1: Task Graphs Characteristics and Corresponding Energy Consumption Values Obtained from iSCT versus iSC+T
Task No. of Total Workload of Td Total Energy iSCT versus iSC+T
Graph Tasks Tasks in Processor (ms) Consumption (mJ) Energy Saving
Cycles (×106
) iSC+T iSCT (%)
TGFF1 7 15.89 8 12.67 10.45 17.52
TGFF2 11 18.69 12 13.60 12.06 11.32
TGFF3 14 34.39 10 25.99 22.70 12.66
TGFF4 15 31.89 12 28.08 21.00 25.21
TGFF5 16 34.88 12 27.35 23.02 15.83
TGFF6 18 33.46 14 25.38 21.98 13.40
TGFF7 22 44.94 22 33.74 29.39 12.89
TGFF8 28 56.81 18 42.95 37.00 13.85
94
scheduling of that task graph, out of maximum 4 processors, is also presented in Table 4.2 for both iSCT
and iSC+T.
According to Table 4.2, while for all scheduled task graphs, the total time of idle intervals are higher
or the same for iSCT compared to iSC+T, the number of idle intervals for iSCT are notably fewer than
the number of idle intervals for iSC+T (on average fewer than half). Therefore, by including the energy
consumption of modeled idle intervals in the objective function of (4.19), instead of having a number of
distributed short idle intervals, we will have fewer merged longer idle intervals. This results in more
opportunities for switching the processors to the sleep mode during idle intervals and thus more energy
savings (as indicated in Table 4.1). In fact, for task graphs studied in this work, the percentage of idle
intervals that are longer than Tbe, and thus we can switch the processor to the sleep mode, are 91.67 % and
32.31 % for iSCT and iSC+T, respectively.
On the other hand, as indicated in Table 4.2, iSCT explores the possibility of the usage of a subset of 4
processors if it results in energy savings. For discussed task graphs in this work, iSCT always uses fewer
than 4 processors for the scheduling. The unused processors do not contribute to the energy consumption
at all (we do not need to switch them to the sleep mode and wake them up in every time period). This can
be helpful in terms of energy efficiency, particularly when the energy overhead of switching processors is
relatively high. Reference [13] cannot take advantage of this since the processor assignment for tasks to
be scheduled is assumed to be known in advance and it is not integrated in their MILP formulation.
On the platform we performed simulations, iSCT and iSC+T approaches on average generated results
for studied task graphs in less than 69 and 1 minutes, respectively. Since we are considering scheduling of
a periodic task graph, these simulations are done offline only once for one period of the task graph. The
obtained scheduling can be programmed to a MPSoC for real-time scheduling of each arriving period of
the task graph.
95
Table 4.2: Idle Intervals Characteristics and No. of Used Processors for iSCT versus iSC+T
Task No. of Idle Intervals Total Idle time (ms) No. of Used Processors
Graph iSC+T iSCT iSC+T iSCT iSC+T iSCT
TGFF1 5 3 21.62 24.00 3 1
TGFF2 4 3 35.79 36.00 4 1
TGFF3 6 3 17.53 21.22 4 2
TGFF4 10 3 27.16 29.00 4 2
TGFF5 7 3 25.21 29.00 4 2
TGFF6 8 3 34.13 34.13 4 2
TGFF7 10 3 58.63 58.64 4 2
TGFF8 10 3 34.87 36.96 4 2
4.4.3 A Heuristic Approach to Solve the Model
Here, we propose a two-stage heuristic algorithm to solve the formulated model: 1) We first determine
and fix the values of Ok,u,v and Pk,u variables using a polynomial-time list scheduling algorithm. 2) Using
fixed Ok,u,v and Pk,u values, the number of variables in the original MILP problem reduces significantly.
Also, (4.17) and (4.18) will become linear formulations in the first place, and we do not anymore need to
use the first lemma presented in Section 4.3.3 multiple times to linearize them. This further reduces the
number of variables of the MILP problem considerably (since each time using of this lemma adds one set of
real variables, plus three sets of constraints). Then, the new formulated problem with considerably fewer
number of variables, which is still an MILP due to the usage of (4.20) and (4.22) in the objective function
of (4.19), will be solved to obtain values of Su and Nu,i.
For the first stage, we use a variant of heterogeneous earliest finish time (HEFT) algorithm [127]. While
this algorithm aims for heterogeneous platforms, it can be applied to a homogeneous platform similar to
our work as well. Basically, in this algorithm, tasks are ordered according to their upward rank, which is
defined recursively for each task as follows:
rankup(u) = Dur∗
u + max
v∈succ(u)
(rankup(v)), (4.25)
where here, Dur∗
u
is the duration of the task u when all of its workload is executed using maximum
available frequency, and succ(u) is the set of immediate successors of task u. Ranks of the tasks are
computed recursively starting from exit tasks of the task graph (exit tasks are the ones with out-degree of
96
zero). The upward rank of exit tasks are equal to their corresponding Dur∗ values. Basically, rankup(u)
indicates the length of critical path from task u to exit tasks, including Dur∗
u
itself.
After the calculation of ranks for all the tasks, a task list is generated by sorting the tasks in the
decreasing order of their ranks. Tie-breaking is done randomly. Then, tasks are scheduled on processors
based on the order of the task list. Each task can only be scheduled after a time called ready_time of that
task, which indicates the time that the execution of all immediate predecessors of that task has completed.
For each task, we look for the first idle interval on each processor after the task ready_time, with the
amount of at least Dur∗ of that task, and assign the task to the processor which gives us the earliest finish
time. Since the task list sorted by the decreasing order of ranks gives a topological sorting of the DAG [127],
when we choose a task for scheduling, its predecessors have already been scheduled. The time complexity
of HEFT algorithm is O(|E| × K) where |E| denotes the number of edges of the DAG and K denotes the
number of processors [127].
Using HEFT in the first stage of our heuristic approach, we determine the processor assignment for
each task (Pk,u), and ordering of tasks on each processor (Ok,u,v). Note that obtained start times for tasks
after the first stage just show relative ordering of tasks on each processor. Also, we only used maximum
frequency in the first stage. Next, in the second stage, we solve the newly derived MILP, which has been
obtained after fixing Ok,u,v and Pk,u values in the first stage, and has considerably fewer number of variables compared to the original MILP. This gives us the values for Su and Nu,i variables. On average, on
the platform we performed simulations, solving the newly derived MILP provided results for studied task
graphs in less than 2 seconds, which is considerably less than the simulation time of solving the original
MILP.
Fig. 4.1 shows a comparison between the energy consumption obtained from iSCT, and the energy
consumption obtained from solving the problem using the proposed heuristic approach. According to
Fig. 4.1, the heuristic method provides close estimates compared to the optimum solution. The values of
97
energy consumption obtained from the heuristic approach are on average 5.66% higher than the optimum
solution.
4.5 Conclusions and Future Work
In this chapter, we have addressed the critical issue of energy consumption in computing devices, particularly in the context of real-time applications running on multiprocessor system-on-chip (MPSoC) architectures. Two essential techniques for energy reduction, dynamic voltage and frequency scaling (DVFS) and
dynamic power management (DPM), were discussed. DVFS adjusts processor voltage and clock frequency
based on workload characteristics, while DPM involves transitioning processors to low power states during idle intervals. However, DPM comes with non-negligible time and energy overhead and is effective
only when idle times exceed a certain threshold known as the break-even time. Balancing these techniques
is a complex challenge due to the trade-offs they entail.
In this chapter, by proposing a method for modeling idle intervals in multiprocessor systems, we presented an energy optimization MILP formulation integrating both DVFS and DPM with scheduling of
real-time tasks with precedence and time constraints. By solving the MILP, for each task, we obtain the
optimum processor assignment, execution start time, and the distribution of its workload among available
frequencies of the processor. Results show the effectiveness of our modeling of idle intervals in MPSoCs
in terms of energy efficiency. Our approach represents a novel contribution in this field, as it is the first to
simultaneously integrate both DVFS and DPM into the scheduling of real-time periodic dependent tasks,
providing optimal solutions for processor assignments, start times, and frequency combinations in a multiprocessor platform.
We also presented a heuristic approach for solving the MILP which provided close results compared
to optimum results. It is worth mentioning that although our proposed model focuses on MPSoCs, it can
also be applicable to servers in data centers by using proper energy model parameters of those platforms.
98
TGFF1 TGFF2 TGFF3 TGFF4 TGFF5 TGFF6 TGFF7 TGFF8
0
10
20
30
40
50
Energy Consumption ( mJ )
Proposed Heuristic
iSCT
Figure 4.1: Energy Consumption obtained from the proposed heuristic approach and iSCT for different task graphs
For future work, workload of tasks can be investigated to represent more than just the processor cycle
count; e.g., the memory requirement, or the possibility of executing the entire or part of a task on GPUs can
be modeled and investigated. Also, obtaining a variant of the proposed model for heterogeneous processors
could be another potential future direction.
99
Chapter 5
Energy-Aware Scheduling of Task Graphs with Imprecise Computations
and End-to-End Deadlines
5.1 Introduction
In many real-time applications, it is often preferred for a task to produce an approximate (aka imprecise)
result by its deadline rather than producing an exact (aka precise) result late [141]. In imprecise computations, a real-time task is allowed to return intermediate and imprecise results of poorer quality as long as it
processes a predefined chunk of work that defines its baseline quality. Imprecise computations increase the
flexibility of scheduling algorithms developed for real-time systems by allowing them to trade off output
quality with utilization of system resources such as processor cycles and/or energy.
There are many real-world applications that encourage deployment of imprecise computations. For
example, in video streaming applications, poor quality images and voices may be tolerable, but video frame
freezes or lags are often not tolerated. Similarly, a self-driving car that can predict the approximate location
of an obstacle quickly and adjusts its speed and direction accordingly is preferred over one that predicts
the exact location of the obstacle much later. Newton-Raphson’s root finding algorithm is another example
where approximate computations may be beneficial. In this iterative algorithm which has a convex error
100
function, one can find a root close enough to the exact value without performing all required iterations,
hence saving processor cycles and energy [74].
In imprecise computations, a real-time task is allowed to return intermediate and imprecise results
of poorer quality as long as it processes a predefined chunk of work that defines its baseline quality. In
imprecise computations, tasks are usually characterized by their mandatory and optional workloads [87,
151, 119, 99, 141, 110, 73, 111]. The number of processor cycles required for a task to provide its minimum
acceptable quality is referred to as the mandatory workload of the task. Mandatory workloads of all tasks
in a task graph should be completed before a hard deadline. Assigning a larger number of processor cycles
to a task beyond its mandatory workload leads to an increase in its quality of results.
The workload of a task beyond its mandatory workload is referred to as the optional workload, which
can be executed partially. When the full workload of a task, both mandatory and optional, is entirely
executed, the results produced by that task are considered precise. The quality of service (QoS) is usually
evaluated as a linear or concave function of the number of processor cycles assigned to optional workloads
of tasks [119].
This work presents a heuristic for scheduling task graphs with potentially imprecise computations,
aiming at maximizing QoS subject to a hard deadline and an energy bound. It also considers the fact that
tasks can be interdependent and the imprecise output of one task affects the input quality of its child tasks.
Therefore, the proposed heuristic takes account of potential extension in the workload of each task based
on the quality of its inputs.
The major contributions of this work can be summarized as follows:
• It takes account of input-quality-dependent workload extension in energy-constrained scheduling
of imprecise, interdependent tasks on multiprocessor system-on-chip (MPSoC) platforms. To the
best of our knowledge, this is the most comprehensive work in this domain to date.
101
• It presents a mixed integer linear program (MILP) formulation of the same problem, enabling comparison of the proposed heuristic with optimal solutions.
• The proposed heuristic in some cases is capable of finding the same QoS as the ones found by MILP.
Furthermore, for those task graphs that MILP outperforms the proposed heuristic, QoS values obtained with the proposed heuristic are, on average, within 1.24% of the optimal solutions while
improving the runtime by a factor of 100 or so.
The rest of the chapter is organized as follows. Section 5.2 reviews prior work while Section 5.3 explains the models used in this work, formally characterizes tasks with potentially imprecise computations,
and presents the problem statement. Next, Section 5.4 explains the proposed heuristic for scheduling task
graphs with imprecise computation on an MPSoC platform. It also presents a comprehensive MILP formulation of the same problem, which allows comparing the proposed heuristic with exact solutions. After
that, Section 5.5 details experimental results. Finally, Section 5.6 concludes the chapter.
5.2 Prior Work
There has been a large body of research on finding efficient solutions for the scheduling problem [86, 35,
117, 151, 89, 26, 116, 103, 7]. The goal of such research is to schedule given tasks on single- and/or multiprocessor platforms considering a possibly hard deadline, energy consumption, and/or QoS. In prior work
that focus on imprecise computation, the objective is to maximize QoS while meeting a hard deadline.
The work by Chen et al. [14] is an example of such work where the existence of mandatory and optional
workloads allow trading off QoS with deadlines.
Some of the prior work take account of energy consumption in addition to hard deadline and QoS. This
energy awareness may be considered in terms of inter-task DVFS [141, 22, 26, 99, 87], intra-task DVFS[141,
26, 99], and/or heterogeneous processors working at different energy levels [87, 86, 99, 141, 26, 151].
102
Task that need to be scheduled on the target platform are either represented using a set of independent
tasks or a directed acyclic graph (DAG). The latter is more realistic for real-time applications such as video
compression and speech recognition where the output of some tasks are consumed by subsequent tasks
[37]. Some of the prior work that model tasks with DAGs consider the effect of input quality on such
interdependent tasks. The work by Feng et al. is one of the earliest work that takes such effect into
consideration [37]. It does so by increasing the processing times of mandatory and optional workloads
whenever inputs are imprecise. This leads to introduction of input-quality-dependent mandatory and
optional workloads for the whole task graph. Such modification of workloads is subsequently considered in
[57, 119]. For example, Stavrinides and Karatza employ workload extension when introducing alternative
versions of common scheduling policies for distributed real-time systems [119].
Among prior work, Ravindran et al. [99] propose a method for scheduling DAGs on multi-processor
platforms considering QoS, energy consumption, and input quality. However, in their model, the effect
of input quality on exit tasks (aka leaf nodes) is found using a recursive function. The introduction of
such recursive function makes the proposed solution practically infeasible for relatively large DAGs. One
of the major differences between this work and [99] is that the effect of input quality is studied locally,
which allows finding feasible solutions even for large DAGs quickly. Furthermore, this work considers the
fact that the minimum number of processor cycles allocated to a task for producing an acceptable level of
quality increases with decreasing its input quality [57].
In summary, prior work can be classified into different categories based on their characteristics. Some
of those characteristics include:
• Task Model: whether tasks are represented using a set or a DAG,
• Platform: single- or multi-processor,
103
• Energy Awareness: whether processor(s) in the platform can operate at different energy levels, e.g.
dynamic voltage and frequency scaling (DVFS), and
• Input-Error Awareness: whether the effect of imprecise inputs to child tasks is taken into account.
Table 5.1 compares some of the key prior work with this work in terms of the said characteristics. Indeed,
our work presents a heuristic for scheduling DAGs on multiprocessor platforms where input error and
intra-task DVFS are taken into account. As a result, it is the most comprehensive work in this domain to
date because it is capable of solving problems that constitute a superset of problems prior work addresses.
5.3 Models and Problem Definition
5.3.1 Task Model and Imprecise Computation
Tasks to be scheduled are modeled as a directed acyclic graph represented by G(V, E, Td) in which V
denotes the set of n tasks, E denotes data dependencies among tasks, and Td denotes the period of the
task graph. Td acts as a hard deadline for scheduling and each repetition of the task graph should be
scheduled before the arrival of the next one.
Each task with the possibility of imprecise computation consists of two parts: a mandatory part and an
optional part. In order for a task to produce an acceptable result, its mandatory part must be completed.
The optional part refines the result produced by the mandatory part. If the optional part of a task is not
executed entirely, the result of the task is imprecise and the task has an output error. In a task graph, if
one or more parent tasks of each task u have an output error, task u will have an input error.
Similar to prior work [119, 57], we assume only the execution of mandatory part of task u will be
extended to compensate for the input error and optional part of task u remains the same. This is a valid
assumption for many applications such as weather forecasting systems [119], image and video processing,
104
Reference
Task Model Platform Energy Awareness Input-Error Awareness
Set DAG SP MP Inter-Task Intra-Task Heterogeneous
DVFS DVFS Processors
This Work ✓ ✓ ✓ ✓
Yu et al. [141] ✓ ✓ ✓ -
Aydin et al. [7] ✓ ✓ - - - -
Rusu et al. [103] ✓ ✓ ✓ -
Stavrinides & Karatza [119] ✓ ✓ - - - ✓
Cortés et al. [26] ✓ ✓ ✓ -
Zhou et al. [151] ✓ ✓ ✓ -
Ravindran et al. [99] ✓ ✓ ✓ ✓
Mo et al. [87] ✓ ✓ ✓ -
Mo et al. [86] ✓ ✓ ✓ -
Stavrinides & Karatza [120] ✓ ✓ - - - ✓
Feng & Liu [37] ✓ ✓ - - - ✓
Table 5.1: Comparison of characteristics of prior work on imprecise scheduling compared to this work (SP
stands for single-processor while MP stands for multi-processor).
and Newton’s root finding method [37]. In other words, mandatory part of a certain task can be thought
of as the minimum amount of processor cycles required for the task to produce a result with an acceptable
quality, and the mandatory part grows when the quality of a task’s inputs decrease [57]. In order for a task
graph to be considered feasibly scheduled, at least the potentially extended mandatory workload of each
task must be completed before the deadline Td.
The number of processor cycles required to finish the mandatory part of task u when its inputs are
error-free is represented by Mu. For a task u with nonzero input error, its mandatory workload is extended
such that it is capable of producing correct results. The number of processor cycles required to process the
extension added to Mu, which depends on the quality of its inputs, is represented by Mx
u
. Therefore, the
total mandatory workload, represented by M′
u
, is obtained as follows:
M′
u = Mu + Mx
u
. (5.1)
The total optional workload of task u, which can be executed partially, is represented by Ou. The
number of processor cycles actually assigned to the optional workload of task u is represented by ou
(ou ≤ Ou). According to [37], the general mandatory extension function of a task can be estimated by a
105
straight line, which provides an upper bound on the amount of required extension. Therefore, the slope of
this line, which is represented by mu and referred to as the task-specific scaling factor [37, 119], quantifies
the dependency between Ei
u
and Mx
u
as follows:
Mx
u = mu × E
i
u
, (5.2)
in which Ei
u
indicates the input error of task u. Similar to [119], Ei
u
in a task graph is defined as follows:
E
i
u = min{1,
X
j∈par(u)
E
o
j }, (5.3)
where par(u) is the set of immediate parents of task u and Eo
j
represents output error of parent task j.
Eo
j
is defined as the portion of discarded optional workload of task j [37], and thus obtained as follows:
E
o
j =
Oj − oj
Oj
= 1 −
oj
Oj
, 0 ≤ E
o
j ≤ 1. (5.4)
Based on (5.3) and (5.4), we have 0 ≤ Ei
u ≤ 1. According to (5.2), when the input of task u is error-free
(i.e. Ei
u = 0), Mx
u = 0 and thus M′
u = Mu. On the other hand, when task u has the maximum input-error
(i.e. Ei
u = 1), Mx
u = mu and thus M′
u = Mu + mu. In this case, the mandatory workload extension
for task u reaches its maximum. It is worth mentioning that it is not always the case that dropping a
number of optional cycles of a parent task is compensated by extension with the same or fewer cycles in
the mandatory workload of the child task. This is in fact due to the potentially different types of these
tasks, and is captured by the task-specific mu scaling factor.
Note the assumption that workload extension can always compensate the input error is not true in
general. However, based on [37], we can transform the given mandatory and optional portions of a task
106
workload such that in the worst case, when all transformed optional workloads of parent tasks are discarded, the extension amount obtained by (5.2) would be able to compensate the input error. Therefore, Mu
and Ou used in our proposed method are transformed versions of given mandatory and optional workloads
of tasks.
The total number of processor cycles assigned to task u is represented by Wu and is obtained as follows:
Wu = M′
u + ou. (5.5)
5.3.2 Energy Model
Similar to [151], different tasks are assumed to exhibit different power consumption values on the same
processor, even when executing at the same frequency. This is due to the fact that power consumption
of tasks depend on circuit activities and usage patterns of different functional units [75]. Therefore, the
activity factor of a task, denoted by µ ∈ (0, 1], is employed to capture how intensively functional units are
being utilized by the task [53]. Taking the activity factor µ into account, we use the following equation
borrowed from [40] to model power consumption of a processor when operating at clock frequency f for
task u:
ρ(u) = µuαfβ + γf + δ, (5.6)
in which ρ(u) represents the total power consumption for task u, and α, β, γ, and δ are the power model
coefficients. µuαfβ
represents the dynamic power consumption (in which α depends on the average
switched capacitance), and γf + δ represents the static power consumption. Furthermore, β indicates the
technology-dependent dynamic power exponent, which is usually ≈ 3. Therefore, energy consumption
in one clock cycle for task u, ϵcycle(u), when executing a task at clock frequency f, is obtained from the
following equation:
ϵcycle(u) = µuαfβ−1 + γ +
δ
f
. (5.7)
107
5.3.3 Problem Statement
We seek to schedule a task graph with the possibility of imprecise computations represented by G(V, E, Td)
on a platform comprising of K homogeneous processors, each supporting a set of m distinct clock frequencies {f1, f2, ..., fm}, in order to maximize QoS subject to a hard deadline and an energy bound.
QoS highly correlates with how many processor cycles are assigned to the execution of optional workloads of exit tasks, which are the tasks in the task graph with no child tasks. The reason is that the discarded
optional workload of tasks other than exit tasks are compensated with extensions in the mandatory workload of their child tasks. Consequently, QoS is quantitatively defined as follows:
QoS =
P
u∈exit(G) Pu
|exit(G)|
, 0 ≤ QoS ≤ 1, (5.8)
where exit(G) represents the set of exit tasks of task graph G, and Pu represents the precision of task u.
Please note that the effect of discarding the optional part of a non-exit task does not affect the QoS defined
in (5.8) directly, rather it affects the mandatory workloads of its child tasks and causes their extension
as explained in detail in Section 5.3.1. This workload extension will compensate for the discarding of
optional workload of the parent task(s). Therefore, QoS would be a function of how many processor cycles
are assigned to the execution of optional workloads of exit tasks which do not have any child tasks. In this
way, optional workloads of non-exit tasks have an indirect effect on the QoS. The reason is, under fixed
deadline and energy budget constraints, they affect the number of processor cycles remaining for optional
workloads of exit tasks.
108
Pu is a non-decreasing function of number of processor cycles assigned to the optional workload of
task u. Similar to [119], Pu is defined as follows:
Pu = P
T
u + (1 − P
T
u
)( ou
Ou
), (5.9)
in which P
T
u
indicates the minimum precision acceptable from task u, aka precision threshold of task
u. P
T
u
assumes values between 0 and 1. P
T
u
indicates the precision of task u when only its (extended)
mandatory part is completed [119]. Based on (5.9), executing only the extended mandatory workload of
task u (ou = 0) results in Pu = P
T
u
. On the other hand, executing the entire optional workload of task
u (ou = Ou) in addition to its extended mandatory workload leads to Pu = 1. For other values of ou,
P
T
u < Pu < 1.
5.4 Proposed Method
The proposed heuristic comprises of two main phases:
1. determining the number of processor cycles assigned to optional workloads of non-exit tasks, and
2. scheduling tasks on an MPSoC for maximizing QoS subject to energy and deadline constraints.
Table 5.2 summarizes key notation used in this section.
5.4.1 Determining the Number of Processor Cycles Assigned to Optional Workloads of
Non-Exit Tasks
The first step of the proposed heuristic tries to minimize the summation of total workload of non-exit tasks
plus the total (extended) mandatory workloads of exit tasks. The intuition behind choosing such objective
function is the fact that minimizing the total number of processor cycles associated with the aforementioned portions of tasks leads to having more processor cycles available for executing optional workloads
109
Notation Description
Mu Mandatory workload of task u when its inputs are error-free
mu Mandatory scaling factor for task u
Ei
u
Input error of task u
Mx
u The amount of mandatory workload extension when inputs are erroneous
M′
u Extended mandatory workload of task u with erroneous inputs
Ou Optional workload of task u
ou Number of CPU cycles assigned to the optional workload of task u
Eo
u Output error of task u
Wu Total workload of task u
ϵtask(u) Energy consumption for the execution of task u
ϵmax Energy bound
ϵ
∗ Minimum energy required for scheduling the task graph
Du Duration of task u
Nu,i The number of processor cycles of task u processed at clock frequency fi
Su Start time of task u
Td Deadline for scheduling of the task graph
n Number of tasks
m Number of available discrete frequencies
K Number of processors
Πk,u Indicating whether task u is assigned to processor k
Table 5.2: Summary of Key Notation
of exit tasks as there are fixed deadline and energy budget constraints. This can result in increased QoS
according to (5.8). Therefore, we aim to minimize the following expression:
" X
for u ∈ non-exit tasks
Wu
#
+
" X
for v ∈ exit tasks
M′
v
#
. (5.10)
We first explain our approach for minimizing (5.10) for two simple task graphs that constitute base
cases. Then, we explain our proposed algorithm for a general task graph.
p
1 2 b c
1 2 b
(a) (b)
Figure 5.1: Task graphs of (a) base case 1 and (b) base case 2
110
Base Case 1: Consider the task graph demonstrated in Fig. 5.1a. It consists of a parent task p, alongside
b child tasks. The workload defined in (5.10) for this simple task graph can be written as follows:
[M′
p + op] + [X
b
i=1
Mi +
X
b
i=1
mi × (1 −
op
Op
)], (5.11)
in which subscripts p and i are used for referring to workload components of the parent task and child
tasks in Fig. 5.1a, respectively. Equation (5.11) can be rewritten as:
[M′
p +
X
b
i=1
(Mi + mi)] + [op × (1 −
Pb
i=1 mi
Op
)]. (5.12)
In (5.12), the first term in the summation does not depend on how many processor cycles are assigned
to op (note that the actual workload of M′
p depends on the input error of the parent task and not op).
However, the second term is a function of op and minimizing this term leads to minimization of (5.12).
Two possible scenarios are postulated in this case:
1. if Pb
i=1 mi ≤ Op, op should be minimized as much as possible, i.e., op = 0. This means the optional
workload of parent task must be discarded.
2. if Pb
i=1 mi > Op, op should be maximized as much as possible, i.e., op = Op. This means that the
parent task should be executed precisely. A large number of child tasks and/or high values of their
mi values lead to a higher chance of this scenario occurring.
Base Case 2: Consider the task graph demonstrated in Fig. 5.1b. It consists of a child task c, alongside
b parent tasks. The workload of (5.10) for this simple task graph can be written as follows:
[
X
b
i=1
(M′
i + oi)]+
[Mc + mc × min
1,
X
b
i=1
(1 −
oi
Oi
)
!
],
(5.13)
111
in which subscripts c and i are used for referring to workload components of the child task and parent
tasks in Fig. 5.1b, respectively. (5.13) can be rewritten as:
[
X
b
i=1
M′
i + Mc]+
[
X
b
i=1
oi + mc × min
1,
X
b
i=1
(1 −
oi
Oi
)
!
].
(5.14)
In (5.14), the first term in the summation does not depend on how many processor cycles are assigned
to o1, o2, ..., ob. However, the second term is a function of how many processor cycles are assigned to
optional workloads of parent tasks and therefore, this term should be minimized for minimizing (5.14).
Two possible scenarios are postulated in this case:
1. If Pb
i=1 Oi ≤ mc, in order to minimize (5.14), all optional workloads of b parent tasks should be
executed completely, i.e., Pb
i=1 oi =
Pb
i=1 Oi
. The proof is included in Appendix A.
2. If Pb
i=1 Oi > mc, in order to minimize (5.14), all optional workloads of b parent tasks should be
discarded, i.e., Pb
i=1 oi = 0. A large number of parent tasks and/or high values of their Oi values
lead to a higher chance of this scenario occurring. The proof is included in Appendix A.
General Task Graphs: While base cases 1 & 2 help determine the number of processor cycles assigned
to optional workload of tasks in simple task graphs, similar conclusions cannot be drawn for complicated
tasks graphs with interdependent tasks.
For instance, consider an example where two parent tasks share a few child tasks and the goal is to
either fully discard or execute the optional workload of tasks within this task graph. Because a few child
tasks are potentially shared between the two parent tasks, applying base case 1 or base case 2 without
considering the interdependence of tasks may lead to conflicting decisions about execution of optional
workloads. As the number of such parent tasks increases, depending on the interdependencies among them
and their shared child tasks, the number of possible permutations that should be explored in terms of fully
112
executing or discarding the optional workloads of those parent tasks can grow exponentially. However,
presented base cases can guide us in developing a heuristic that determines the number of processor cycles
assigned to optional workload of non-exit tasks.
Note that in the proposed heuristic, it is assumed that the input task graph has only one source task
(i.e. a task with in-degree of zero), but potentially many exit tasks. In task graphs where the number of
source tasks is larger than one, a dummy task with zero workload is introduced and connected to all source
tasks. The steps of the first phase of the proposed heuristic are as follows:
Step 1 (Forward Pass): This step starts traversing tasks in the task graph G from the source task and
labels each task as precise (fully executing its optional workload) or imprecise (fully discarding its optional
workload) based on the task’s optional workload and the total maximum extension of its child tasks if the
task is executed imprecisely.
This step of the proposed heuristic is similar to base case 1. The difference, though, is the fact that
if a child task is encountered more than once due to being a shared child of multiple parent tasks and its
mandatory part is extended because one of its parents is labeled as imprecise, it is not considered when
writing (5.12) for its other parent tasks.
After exploring all paths in the task graph, tasks with multiple parents and extended workloads are
marked. For these tasks, their parent tasks are evaluated again while their marked child tasks are removed
from (5.12). This may lead to an update for a parent task, labeled to be executed precisely before, to get
executed imprecisely due to the unavoidable extension of its child The same process is repeated until no
decisions are further updated. Note that each child task with multiple parents is visited only once during
this update pass. The algorithmic flowchart associated with step 1 is shown in Fig. 5.2a.
Step 2 (Backward Pass): This step starts traversing tasks in the task graph G in the reverse order
from exit tasks back to the source task. For a task with multiple parents, those which are labeled as precise
113
Start
Traverse the task graph forwards
For task u visited during the traversal,
calculate �m
v
for v ∊ Y(u),
where Y(u) represents the set of
unextended child tasks of task u
Initialize o
u
= O
u
and M’
u
= M
u
for all tasks
O
u
= 0
M’
v
= M
v
+ m
v
for v ∊ Y(u)
Are all nodes visited?
O
u
≥ m
v
for v ∊ Y(u), or
is task u an exit task?
End
Update workload of parents of shared
extended child tasks if necessary
Yes
No
Yes
No
(a)
Start
Traverse the task graph backwards
Does the visited task u
have more than 1 parent?
End
Create the “sorted_precise_parents”
list (with length b) for task u and iterate
over the corresponding b subsets
calculate and store the total
workload if for for each task v in
the current subset, we set O
v
= 0
Are all b subsets visited?
Are all nodes visited?
set O
v
= 0 for the tasks in the subset
associated with the maximum
workload reduction (if any)
Yes
Yes
No No
(b)
Figure 5.2: Algorithmic flow chart for determining the number of processor cycles assigned to optional workloads of non-exit
tasks: (a) Step 1, and (b) Step 2.
114
are added to a list and sorted in increasing order of the number of child tasks with intact (not extended)
mandatory workloads. The resulting list is called sorted_precise_parents, which includes b tasks.
Next, a subset of tasks in sorted_precise_parents is chosen such that transforming those tasks to imprecise tasks and extending the mandatory workload of their child tasks leads to the highest reduction
in (5.10). However, instead of exploring all 2
b possible subsets, we only explore b subsets which are: the
subset containing the first task in the sorted list, the subset containing the first and second tasks in the
sorted list, ..., and for the b
th subset, the subset containing all tasks in the sorted list. The rationale behind
such decision is that according to base case 1, labeling a task with fewer number of intact child tasks as imprecise is more likely to eventually increase QoS. Such tasks are explored more often in proposed subsets
due to the sorting strategy. The algorithmic flowchart associated with step 2 is shown in Fig. 5.2b.
Step 2 (Backward Pass) is inspired by base case 2 where multiple parents with shared child tasks can be
labeled as imprecise. In other words, the first step of proposed heuristic looks at parent tasks independently
while the second step studies their combined effect on overall QoS.
The presented heuristic determines which tasks in a given task graph should be executed imprecisely.
Therefore, we refer to this heuristic as imp_label. The optional workload of each non-exit task u marked
as imprecise is o
imp_label
u = 0 while the optional workload of a precise task is o
imp_label
u = Ou. Furthermore, if a non-exit task u has a parent which is labeled imprecise, M
′ imp_label
u = Mu + mu, otherwise
M
′ imp_label
u = Mu. Therefore, the total workload of each non-exit task u is determined by imp_label, is
represented by W
imp_label
u , and obtained as follows:
Wimp_label
u = M′ imp_label
u + o
imp_label
u
. (5.15)
Note that imp_label also determines whether the mandatory workload of an exit task v is extended (M
′ imp_label
v =
Mv + mv) or not (M
′ imp_label
v = Mv).
115
5.4.2 Scheduling Tasks on an MPSoC for Maximizing QoS Subject to Energy and Deadline
Constraints.
In this section, we seek to schedule the task graph obtained from imp_label on an MPSoC platform for
maximizing QoS subject to energy and time constraints. For this purpose, we determine a proper processor assignment for each task alongside the ordering of tasks on each processor in order to minimize
the finish time while operating at the maximum clock frequency (we temporarily ignore energy budget
constraint). This is achieved by deploying a minimal delay list scheduling algorithm, which is a variant of
Heterogeneous Earliest Finish Time (HEFT) [126].
HEFT assigns a rank to each task in the task graph based on the length of the critical path from that
task to exit tasks. While HEFT is designed for heterogeneous platforms, it can be applied to a homogeneous platform as well. We provide workloads obtained from imp_label for non-exit tasks and (extended)
mandatory workloads for exit tasks plus their total optional workloads as inputs to HEFT. Next, we pick
tasks in decreasing order of their ranks and schedule each selected task on its “best” processor, which is
the processor that minimizes the finish time of the task under the maximum available frequency.
Note that HEFT is only used to just obtain a processor assignment for each task alongside the ordering
of tasks on each processor. The obtained start times for tasks from HEFT just show relative ordering of
tasks on each processor. Furthermore, we used the maximum frequency in HEFT and included the total
optional workloads of all exit tasks since we were temporarily ignoring the energy budget constraint.
Therefore, in the next step, the actual number of processor cycles assigned to optional workload of exit
tasks, the actual distribution of workload of each task among m available frequencies of the processors,
and the actual execution start time of each task should be obtained.
For this purpose, we demonstrate that maximizing QoS for a task graph obtained from imp_label subject to energy and time constraints, and processor assignment and task ordering obtained from HEFT, will
116
be reduced to a linear programming (LP) formulation. In the following formulation, u and v are used to
refer to any of the tasks in the task graph.
Duration of task u, u = 1, 2, ..., n, is formulated as follows:
Du =
Xm
i=1
Nu,i
fi
, Nu,i ≥ 0 (5.16)
where Nu,i indicates the number of processor cycles of task u processed at clock frequency fi
(i =
1, 2, ..., m). If task u is a non-exit task, the following constraint is introduced:
Xm
i=1
Nu,i = Wimp_label
u
. (5.17)
On the other hand, if task u is an exit task, we have:
M′ imp_label
u ≤
Xm
i=1
Nu,i ≤ M′ imp_label
u + Ou. (5.18)
According to (5.6) and (5.16), energy consumption during the execution of task u can be formulated as
follows:
ϵtask(u) = Xm
i=1
(Nu,i.(αfβ−1
i + γ +
δ
fi
)). (5.19)
To ensure the total energy consumption of tasks is less than or equal to the given energy bound, represented
by ϵmax, we have:
Xn
u=1
ϵtask(u) ≤ ϵmax. (5.20)
To ensure time and precedence constraints, by representing start time of each task u with Su, we
should have:
Su + Du ≤ Td, u = 1, 2, ..., n, Su ≥ 0, (5.21)
117
Su + Du + Cu,v ≤ Sv, ∀e(u, v) ∈ E. (5.22)
In (5.22), Cu,v represent the average communication cost associated with eu,v for sending output of task u
to input of task v.
Finally, we need to ensure tasks assigned to the same processor do not overlap:
Su + Du ≤ Sv, For tasks u and v which are
assigned to the same processor
and task v is the immediate task
after task u based on HEFT
(5.23)
Maximizing the objective function of (5.8), with the constraints introduced in (5.16) to (5.23), forms
an LP over positive real variables of Su, Nu,i, and optional workload of exit tasks (ou for u ∈ exit tasks).
Please note that the domain for Nu,i variables are in fact positive integers. However, during solving the
formulated LP, we consider Nu,i as continuous real variables, and then we round the result. This impact
of one cycle is negligible as the tasks execute typically for more than hundreds of thousands of cycles [87].
5.4.3 MILP formulation
In order to evaluate the performance of our two-phase proposed heuristic in Sections 5.4.1 and 5.4.2 compared to the optimal solution, we present a comprehensive MILP formulation of the problem statement
in Section 5.3.3. By solving the MILP, we obtain the optimal values for the number of processor cycles
assigned to the optional workload of each task, processor assignment for each task alongside the ordering
of tasks on each processor, task execution start time, and distribution of the total number of processor
118
cycles associated with the execution of each task among m available frequencies. For this purpose, the
following variables are defined:
Denoting the number of processors with K, for the processor assignment of task u to processor k,
k = 1, 2, ..., K, we use the decision variable Πk,u, defined as follows:
Πk,u =
1 if task u is assigned to processor k
0 otherwise
. (5.24)
Consequently, we have the following constraint for Πk,u:
X
K
k=1
Πk,u = 1, for u = 1, 2, ..., n. (5.25)
In order to prevent the overlap of execution of tasks assigned to the same processor with each other,
we use the decision variable Yk,u,v indicating ordering of the tasks. For k = 1, 2, ..., K; u = 1, 2, ..., n;
v = 1, 2, ..., n, v ̸= u; we define:
Yk,u,v =
1 if task u is scheduled immediately
before task v on processor k
0 otherwise
. (5.26)
In addition, if task v is the first task assigned to processor k, Yk,0,v is defined to be 1 (and is 0 otherwise).
On the other hand, if task u is the last task assigned to processor k, Yk,u,n+1 is defined to be 1 (and is
0 otherwise). Furthermore, if there is no task assigned to processor k, Yk,0,n+1 is defined to be 1 (and is
119
0 otherwise). Accordingly, using (5.26) and the definitions provided for Yk,0,v, Yk,u,n+1 and Yk,0,n+1, we
have the following constraints for k = 1, 2, ..., K:
nX
+1
v=1
v̸=u
Yk,u,v = Πk,u, for u = 0, 1, ..., n (5.27)
Xn
u=0
u̸=v
Yk,u,v = Πk,v, for v = 1, 2, ..., n + 1. (5.28)
According to (5.27), if task u is assigned to processor k (Πk,u = 1), either there is one and only one
task scheduled immediately after task u on processor k or task u is the last task assigned to processor
k. Similarly, according to (5.28), if task v is assigned to processor k (Πk,v = 1), either there is one and
only one task scheduled immediately before task v on processor k or task v is the first task assigned to
processor k. In both (5.27) and (5.28), Πk,0 and Πk,n+1 are defined as 1 for all k = 1, 2, ..., K. Using Yk,u,v,
we rewrite the constraint in (5.23) as the following:
Su + Du − (1 − Yk,u,v) × Td ≤ Sv,
for u = 1, 2, ..., n,
for v = 1, 2, ..., n, v ̸= u,
for k = 1, 2, ..., K.
(5.29)
Finally, instead of using imp_label algorithm to determine the workload of non-exit and exit tasks in
(5.17) and (5.18), the following constraint is used for all the tasks:
Mu + mu × E
i
u ≤
Xm
i=1
Nu,i ≤ Mu + mu × E
i
u + Ou, (5.30)
120
where Ei
u
is obtained by (5.3). In order to present the minimum formulation existing in (5.3) as a linear
constraint, we rewrite (5.3) using an auxiliary decision variable, represented by Xu, as the following:
E
i
u = Xu.(1) + (1 − Xu).(
X
j∈par(u)
E
o
j
), (5.31)
in which Xu is a decision variable which is 1 when P
j∈par(u) Eo
j > 1 and is 0 otherwise. According to
[35], the corresponding constraint for Xu can be written as follows:
P
j∈par(u) Eo
j − 1
n
≤ Xu ≤
X
j∈par(u)
E
o
j
, Xu ∈ {0, 1}, (5.32)
in which n serves as an upper bound for P
j∈par(u) Eo
j
. Furthermore, we use the lemma presented in [35]
for linearization of multiplication of a Boolean decision variable and a bounded real-valued variable for
the second term of (5.31).
Consequently, maximizing the objective function of (5.8) with the constraints introduced in (5.16),
(5.19) to (5.22), (5.25), (5.27) to (5.32), and the lemma mentioned in [35] for linearization of the second term
of (5.31), forms an MILP yielding the optimal values for the desired variables mentioned in the beginning
of this section.
5.4.4 Complexity Analysis
The time complexity of the proposed labeling heuristic described in Section 5.4.1 is O(|E|+|V |) where |E|
denotes the number of edges in the task graph while |V | represents the number of vertices. Furthermore,
the time complexity of HEFT, which is used for obtaining the processor assignment of tasks in the labeled
graph and ordering of them on each processor for an MPSoC platform, is O(K × |E|) where K denotes
the number of processors.
121
5.5 Results
5.5.1 Simulation Setup
For solving the formulated MILP in Section 5.4.3 and the LP part of the porposed method in Section 5.4.2,
we use IBM ILOG CPLEX Optimization Studio [58]. The platform on which simulations are performed is
a computer with a 3.2 GHz Intel Core i7-8700 Processor and 16 GB RAM.
For obtaining energy model parameters, we employ [13] which uses a classical energy model of a 70nm
technology processor that supports 5 discrete frequencies. The frequency-independent component of processor power consumption, which is represented by δ in (5.6), is obtained as 276 mW. Each processor can
operate independently of other processors at either f1 = 1.01 GHz, f2 = 1.26 GHz, f3 = 1.53 GHz,
f4 = 1.81 GHz, f5 = 2.1 GHz. Frequency-dependent component of the power model at full switching activity is obtained by αfβ + γf from (5.6) and is reported as 430.9 mW, 556.8 mW, 710.7 mW,
896.5 mW, and 1118.2 mW, respectively. Using curve fitting, we obtain α = 23.8729, γ = 401.6654, and
β = 3.2941 in (5.6).
Simulations are performed for 20 task graphs randomly generated using TGFF [31], which is a randomized task graph generator widely used in the literature to evaluate the performance of scheduling
algorithms. These task graphs are named as TGFF0 to TGFF19. The number of tasks in studied random
task graphs ranges from 23 (in TGFF0) to 97 (in TGFF19). The maximum in-degree and out-degree for each
task in our randomly generated task graphs are set to 6. µu for each task is chosen uniformly from (0,1].
These tasks are scheduled on a platform with four homogeneous cores.
For each task u, the amount of workload required to produce precise results when input is error-free
is referred to as the initial workload of the task, and is represented by Winitial
u
. Therefore: Winitial
u =
Mu + Ou. For studied task graphs, the average value for Winitial
u of each task u is set to 2 × 106
cycles.
122
For each task u, based on what portion of Winitial
u
is for its base mandatory workload (Mu), we consider
3 cases:
1. man_low: Mu ∼ U(0.2, 0.4) × Winitial
u
(low portion of Winitial
u
is for the base mandatory).
2. man_med: Mu ∼ U(0.4, 0.6) × Winitial
u
(medium portion of Winitial
u
is for the base mandatory).
3. man_high: Mu ∼ U(0.6, 0.8) × Winitial
u
(high portion of Winitial
u
is for the base mandatory).
In each of these three cases, similar to[119], mu is set as mu ∼ U(0, 2 × Mu).
For having a fair comparison among these 3 cases, each task graph uses the same random seed for all
the above uniform distributions, where this random seed is different in each task graph. In all three cases,
P
T
u
for all the tasks are uniformly chosen from [0, 1]. Average communication costs associated with edges
of task graphs are chosen uniformly from 0.4 ms to 0.6 ms. Td of each task graph is set to twice the length
of the longest path from its source task to an exit task (including communication costs), when executing
the total workload along the path, including all optional workloads, with the maximum frequency.
5.5.2 Evaluating the Effect of Energy Budget on the obtained QoS
In this section, for each of the studied task graphs, we evaluate the effect of the ϵmax value on the obtained
QoS, defined in (5.8), using the proposed heuristic in Sections 5.4.1 and 5.4.2. In order to obtain a proper
value for ϵmax, first, we derive the minimum energy required for scheduling the task graph in one Td
without the possibility of imprecise computations. We refer to this energy value as ϵ
∗
. For obtaining ϵ
∗
,
HEFT is again used to obtain the processor assignment for each task and the ordering of tasks on each
processor. Then, we solve the LP which minimizes the objective function of Pn
u=1 ϵtask(u), with the
constraints described in (5.16), (5.19), (5.21) to (5.23), and the constraint imposing that the workload of
each task u, whether non-exit task or exit task, should be executed precisely: Pm
i=1 Nu,i = Winitial
u
. By
123
solving this LP, ϵ
∗ of each task graph will be obtained. Optionally, one can use an MILP formulation to
obtain ϵ
∗
.
For the case of imprecise computations, for each task graph, if its ϵ
∗
is used as the value for ϵmax, QoS
is obtained as its maximum value (QoS = 100%, if QoS in (5.8) is shown with percentage). Therefore, for
each task graph, we reduce ϵmax gradually, starting from its ϵ
∗ with the resolution of 0.05×ϵ
∗
, and observe
QoS obtained using our proposed heuristic for each value of ϵmax, .
For each task graph, existence of a QoS ≥ 0 for a ratio of its ϵ
∗
as ϵmax, shows that our proposed
heuristic can generate a feasible schedule for that task graph and ϵmax, which produces that value of QoS.
A feasible schedule means that at least (extended) mandatory workloads of all tasks are completed before
Td, and the total energy consumption is below the ϵmax. Table 5.3 presents the number of task graphs a
feasible schedule was found for, alongside the average of their obtained QoS, in each ϵmax value for the
man_low, man_med, and man_high case.
According to Table 5.3, by reducing ϵmax, we observe the sharpest drop in the obtained QoS in the
man_high case, while the slowest drop in QoS is observed in the man_low case. This reflects the fact that
when lower portion of initial task workloads are mandatory, feasible results can be achieved with lower
values of ϵmax, compared to the case that higher portion of initial task workloads are mandatory.
For illustration, the obtained QoS values for TGFF8 in different values of ϵmax are shown in Fig. 5.3.
As shown in this figure, in the man_low case, our proposed method can generate a feasible schedule for
TGFF8 even with using 45% of its ϵ
∗
as the value for ϵmax, while in the man_high case, it can only generate
a feasible schedule for TGFF8 when ϵmax is reduced to at most 85% of its ϵ
∗
.
5.5.3 Evaluating the Performance of the Proposed Heuristic versus MILP
In this section, we compare the performance of the proposed heuristic in Sections 5.4.1 and 5.4.2 with the
MILP formulation presented in Section 5.4.3, in terms of their obtained QoS in different values of ϵmax.
124
Number of task graphs with a feasible schedule
(+ the average of their obtained QoS %)
for different values of ϵmax in terms of fraction of ϵ
∗
1.0 0.95 0.90 0.85 0.80 0.75 0.70 0.65 0.60 0.55 0.50 0.45
man_low 20 20 20 20 20 20 20 20 20 18 9 1
(100%) (100%) (99.98%) (99.89%) (99.51%) (97.92%) (93.78%) (86.64%) (76.45%) (65.50%) (55.55%) (53.34%)
man_med 20 20 20 20 19 17 6 1 0 0 0 0
(100%) (99.74%) (97.53%) (90.58%) (80.21%) (65.59%) (59.19%) (60.47%) - - - -
man_high 20 20 19 4 0 0 0 0 0 0 0 0
(100%) (92.43%) (71.93%) (63.16%) - - - - - - - -
Table 5.3: The number of task graphs a feasible schedule was found for using the proposed heuristic,
alongside the average of their obtained QoS, in each ϵmax value for man_low, man_med, and man_high
case.
We consider our comparison in a case where Mu of tasks in a task graph can be chosen uniformly from
20% to 80% of Winitial
u
(a mix of three aforementioned cases in Section 5.5.1; we refer to this case as the
man_mixed case). For each task graph and ϵmax value, we impose a time limit of 60 minutes for MILP to
find the optimal scheduling solutions. For evaluating the performance of our proposed heuristic, we only
consider those task graphs for which MILP found the optimal solutions for each value of ϵmax within the
time limit (This comparison is actually in the favor of MILP. We elaborate more on this later). Using this
setup, the comparison between the proposed heuristic and MILP is shown in Table 5.4. According to this
table:
• MILP could find solutions for 8 of 20 studied task graphs within the time limit.
• QoS values found by our proposed heuristic are completely equal to those found by MILP for 5 task
graphs.
• For other task graphs, the average QoS difference found by the proposed method versus MILP for
different ϵmax values is 1.24% (up to 6.56%).
• As the number of tasks increase, MILP fails more often in finding feasible solutions.
125
1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3
²max (fraction of ²∗)
40
50
60
70
80
90
100
Q
o
S (%)
man_low case
man_med case
man_high case
Figure 5.3: QoS values obtained for different values of ϵmax for TGFF8 for the man_low case (solid line),
the man_med case (dashed line), and the man_high case (dotted line).
As an illustration, a comparison between obtained QoS values for different ϵmax values is shown in Fig.
5.4 for TGFF2, for which MILP outperformed the proposed heuristic the most. As it can be observed in
this figure, the proposed heuristic can still provide close results, on average 3.53% difference, compared
to the optimal values obtained with MILP. Consequently, the proposed heuristic yeilds close QoS values
compared to the optimal MILP formulation.
Comparing the runtime of the proposed heuristic and MILP, we see a clear advantage for the proposed
heuristic. On the platform we performed simulations, the average runtime of the proposed heuristic for
each task graph and ϵmax value was around 100× lower compared to MILP. This is without considering the
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
MILP == Proposed Heuristic ✓ ✓ ✓ ✓ ✓
MILP > Proposed Heuristic ✓ ✓ ✓
MILP timeout ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓
Table 5.4: Comparison between the QoS values obtained with the proposed heuristic and MILP for each
task graph from TGFF0 to TGFF19
126
1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3
²max (fraction of ²∗)
50
60
70
80
90
100
Q
o
S (%)
Proposed heuristic
MILP
Figure 5.4: Comparison between QoS values obtained for different values of ϵmax for TGFF2 using the
proposed heuristic versus MILP.
cases that MILP did not find the optimal solutions within the time limit. For many real-world applications,
as the task graphs can have higher number of nodes and more complex interdependencies compared to
studied task graphs, the runtime of using MILP for those task graphs can grow exponentially. Therefore,
employing the proposed heuristic, as it provided close estimations to MILP, can be an efficient alternative.
In order to study the effect of the number of processors on the results achieved using MILP, we repeated
experiments presented in Table 5.4, this time with eight processors instead of four. Based on the obtained
results, while still in many cases MILP could not find the optimal solution within the time limit, the number
of timeouts were reduced by around 20% compared to the 4-processor case. This is due to the fact that
as the number of processors increases, more scheduling opportunities are available for allocating tasks
to processors.This in turn increases the chances for MILP to find the optimal solutions within the time
limit. For example, for TGFF15, in contrast to the 4-processor case, MILP found the optimal solutions for
all feasible ϵmax values for the 8-processor case. For this task graph, QoS values found by the proposed
heuristic are still close to ones found by MILP (average QoS difference of 3.46%).
127
5.5.4 Evaluating the Effect of imp_label algorithm
In order to evaluate the effect of imp_label algorithm presented in Section 5.4.1, we compare the results
obtained from our proposed heuristic with a baseline approach in which we feed the task graph with
their initial workloads (Winitial) for non-exit tasks to the scheduling method presented in Section 5.4.2,
and assign as much as processor cycles possible to exit tasks in order to maximize QoS. Therefore, In the
baseline approach, we solve the same LP as the one formulated in Section 5.4.2, however, the constraint in
(5.17) for non-exit task u will be transformed to the following constraint:
Xm
i=1
Nu,i = Winitial
u
, (5.33)
and the constraint in (5.18) for exit task u will be transformed to the following constraint:
Mu ≤
Xm
i=1
Nu,i ≤ Mu + Ou. (5.34)
Table 5.5 presents a comparison between the proposed heuristic and the baseline approach in the
number of task graphs a feasible schedule was found for, alongside the average of their obtained QoS,
in each ϵmax value The mandatory portion of initial workload of tasks is set based on the man_mixed
case, similar to Section 5.5.3. According to Table 5.5, using the baseline approach, QoS drops more quickly
compared to the proposed heuristic when the energy budget is reduced. Particularly, as observed for the
studied task graphs, QoS for all task graphs immediately drops from 100% as soon as ϵmax is reduced from
ϵ
∗
. However, in the corresponding man_mixed case of our proposed heuristic, QoS can be maintained
at 100% even for values lower than ϵ
∗
. Furthermore, for each task graph, the minimum ϵmax with which
our proposed heuristic can generate a feasible schedule for that task graph is lower in comparison to the
baseline approach. For those ϵmax values that both the proposed heuristic and the baseline approach can
provide a feasible schedule for, QoS values obtained with our proposed heuristic are on average 13.54%
128
(up to 47.34%) higher than QoS values obtained with the baseline approach. As an example, QoS values
obtained using the proposed heuristic and the baseline approach for TGFF8 is shown in Fig. 5.5.
5.6 Conclusion
In this chapter, we have delved into the intriguing realm of imprecise computations, exploring their applications and the advantages they offer in real-time systems. Imprecise computations provide the flexibility
to trade off output quality with resource utilization, making them particularly relevant in scenarios where
timely delivery of approximate results takes precedence over exact results that may arrive late. Various
real-world applications, from video streaming to self-driving cars and iterative algorithms like NewtonRaphson’s root finding, benefit from imprecise computations to optimize resource allocation, energy consumption, and response times.
Imperfect computations are characterized by mandatory and optional workloads, with mandatory
workloads ensuring minimum quality within hard deadlines and optional workloads contributing to improved results. Quality of Service (QoS) is evaluated as a function of the processor cycles assigned to
optional workloads, creating a balance between resource allocation and output quality.
The core contribution of this work is a heuristic for scheduling task graphs with potentially imprecise
computations, aiming to maximize QoS while adhering to hard deadlines and energy constraints. This
heuristic accounts for interdependencies between tasks, recognizing that the imprecise output of one task
affects the quality of its child tasks. Moreover, it introduces a mixed-integer linear program (MILP) formulation of the problem, enabling a comparison between the proposed heuristic and optimal solutions.
Our results demonstrate that the proposed heuristic effectively balances QoS and energy constraints in
scheduling, and in some cases, it matches the QoS levels achieved by MILP. Even when MILP outperforms
the heuristic, the QoS values obtained with the proposed approach are remarkably close to optimal, with
an average deviation of just 1.24%. Additionally, the proposed heuristic substantially reduces runtime,
129
Number of task graphs with a feasible schedule
(+ the average of their obtained QoS %)
for different values of ϵmax in terms of fraction of ϵ
∗
1.0 0.95 0.90 0.85 0.80 0.75 0.70 0.65
The proposed heuristic 20 20 20 20 19 16 5 1
(100%) (99.39%) (95.33%) (86.02%) (75.57%) (64.64%) (60.15%) (64.41%)
The baseline approach 20 20 20 20 11 5 0 0
(100%) (93.16%) (81.79%) (68.15%) (61.98%) (56.96%) - -
Table 5.5: The number of task graphs a feasible schedule was found for using the proposed heuristic versus
the baseline approach, alongside the average of their obtained QoS, in each ϵmax value for the man_mixed
case.
making it a practical choice for real-time systems. In general, our results show the effectiveness of our
proposed heuristic in terms of obtaining promising QoS values even with low energy budgets.
Appendix A
- The minimum value for the following expression:
A =
X
b
i=1
oi + mc × min
1,
X
b
i=1
(1 −
oi
Oi
)
!
,
in which oi
is a variable and ∀i : 0 ≤ oi ≤ Oi
, and other parameters are constant values, is obtained as
follows:
• If mc ≤
Pb
i=1 Oi
, for each i we set oi = 0, and thus A∗ = mc.
• Otherwise, for each i we set oi = Oi
, and thus A∗ =
Pb
i=1 Oi
Proof:
The maximum value for the second term in A is mc. Hence:
130
1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3
²max (fraction of ²∗)
50
60
70
80
90
100
Q
o
S (%)
Proposed heuristic
Baseline approach
Figure 5.5: QoS values obtained for different values of ϵmax for TGFF8 using the proposed heuristic (solid
line) and the baseline approach (dashed line) for the man_mixed case.
• If ∃i : Oi ≥ mc, for each i we set oi = 0, and thus A∗ = mc.
• Else:
1. Starting from ∀i : oi = 0, A = mc.
2. By gradually increasing oi values from zero, while Pb
i=1(1 −
oi
Oi
) > 1, the min in the second term of A still yeilds 1. Therefore, A starts to increase by increasing oi values using the
following equation: A =
Pb
i=1 oi + mc.
3. When Pb
i=1(1 −
oi
Oi
) reaches to 1, as the min in the second term of A gives Pb
i=1(1 −
oi
Oi
)
now, A can be rewritten as follows:
X
b
i=1
oi × (1 −
mc
Oi
) + b × mc.
131
As there is no i for which Oi ≥ mc, ∀i : (1 −
mc
Oi
) < 0, A starts to decrease by increasing oi
values to the point where ∀i : oi = Oi
, where we have: A =
Pb
i=1 Oi
.
4. Therefore, based on (1) and (3), if mc ≤
Pb
i=1 Oi
, we set oi = 0 for all i (A∗ = mc). Otherwise,
we execute all the optional workloads, i.e., ∀i : oi = Oi
(A∗ =
Pb
i=1 Oi
).
132
Chapter 6
Energy-aware scheduling of jobs in heterogeneous cluster systems using
deep reinforcement learning
6.1 Introduction
Energy efficiency in cluster systems is an important design factor, as it not only can reduce the operational
electricity cost, but also can increase system reliability. Furthermore, these platforms are becoming more
popular for many computing-intensive real-time applications such as image or signal processing, weather
forecasting, and so forth [39, 118, 156]. A major portion of this trend is due to rapid progress in computing
power of commodity hardware components and their relatively low cost [156]. Therefore, developing
scheduling strategies that achieve promising performance metrics for real-time workloads while yielding
low energy costs are of great necessity.
Traditionally, majority of these scheduling problems are solved today using carefully designed heuristics, as they are usually combinatorial NP-hard problems [79]. There are several works in the literature
addressing energy-aware scheduling problem for heterogeneous clusters [118, 156, 72, 147]. Authors in
[147] propose energy-aware task scheduling solutions on DVS-enabled heterogeneous clusters based on an
iterated local search method (DVS: dynamic voltage scaling). In [156], authors present an adaptive energyaware scheduling of jobs on heterogeneous clusters with the goal of making the best trade-offs between
133
energy conservation and admissions of subsequently arriving tasks. Generally, the main approach in these
studies is developing clever heuristics that have performance guarantee under certain conditions, which
in some cases is followed by further testing and tuning for obtaining a better performance in practice.
Inspired by recent advances in employing reinforcement learning (RL) for addressing resource management problems, in this chapter, we examine building intelligent systems which learn by their own to
achieve energy-aware scheduling strategies, as an alternative to using manually-tuned heuristics. While
major portion of successful machine learning techniques fall into the category of supervised learning, in
which a mapping from training inputs to outputs is learned, supervised learning cannot be applicable to
most combinatorial optimization problems, such as nontrivial scheduling problems, as optimal labels are
not available due to inherent NP-hardness of most of these problems in nontrivial settings. However, one
can evaluate the performance of a set of solutions using a verifier, and provide some feedbacks to a learning algorithm. Consequently, approaching a combinatorial optimization problem using an RL paradigm,
could be promising [10].
In general, RL agent start from not knowing anything from the task at hand, and improves itself based
on how well it is doing in the system. Particularly, we approach the problem with the help of deep RL. A
high-level view of how deep RL works is shown in Fig. 6.1. In each step i, the deep RL agent observes a
state si
, and performs an action ai
. This action is sampled from a probability distribution over the action
space, where this distribution is obtained by the underlying neural network with parameters θ given the
state si as its input, and is referred to as the policy of the deep RL agent shown by πθ(s, a), where πθ(s, a)
is the probability that action a is taken in state s. Therefore, πθ(s, a) → [0, 1]. θ is referred to as the policy
parameters of the agent. Following the action ai
, the system state would change to si+1 and a reward
ri+1 is given to agent. The agent has only control on what action it can do, and not on the obtained
reward or state transition. During training, by performing a series of interactions with the environment,
the parameters of the underlying neural network will be adjusted for the goal of improving the policy and
134
maximizing the expected cumulative discounted reward: E[P∞
i=0 γ
i
ri
], in which γ ∈ (0, 1] is the discount
factor representing how much the agent cares about the future rewards. RL has been recently combined
with deep neural networks to be effective in applications with large space of state and action pairs. In those
applications, storing the policy in tabular form would not be feasible anymore and function approximators
with tunable parameters, such as deep neural networks, are commonly used [81, 85, 84].
The main motivation for the proposed method compared to prior work in energy-aware scheduling
for heterogeneous clusters is that the proposed Deep-EAS agent starts from knowing nothing about the
scheduling task at hand, and learns nontrivial scheduling policies by modeling the different aspects of the
system such as the arrival rate, duration and resource-demand profile of incoming jobs, current occupation
state of servers and energy profile of using each one for scheduling any of the waiting jobs, and so forth.
The obtained scheduling strategy can be employed in an online scheduling environment and be efficient
under varying workload conditions as we see in Section 6.3.
The proposed method in this chapter uses the notions similar to ones used in [79], which is the first
successful attempt to our knowledge that solely using deep RL, addresses the conventional problem of
scheduling for multi-resource constrained jobs in clusters. However, [79] does not consider heterogeneity
of computing machines in terms of their energy profile in the cluster and thus does not examine energy
awareness in its proposed scheduling solution. There are some challenges associated with crafting the
rewards function in RL formulation so that the scheduling solution would be energy-aware, which are
explained in detail in Section 6.2.2. Furthermore, in [79], it is assumed that the duration of incoming jobs
is known upon arrival. However, in a realistic scenario, uncertainties can occur due to miss-predictions on
the workloads [72]. Therefore, the proposed method also takes into account the uncertainties associated
with the workloads of arriving jobs.
Consequently, in this chapter, using the deep RL paradigm, we present Deep-EAS, an online energyaware scheduler for cluster systems that have multiple machines with heterogeneous energy profiles. The
135
detailed model of the underlying cluster system and associated RL formulation will be presented in Section
6.2.1 and Section 6.2.2, respectively. In Section 6.2.3, a detailed explanation on how Deep-EAS is trained
will be presented. In Section 6.3, we compare Deep-EAS with comparable heuristics under varying workload conditions and examine the situations where using Deep-EAS is advantageous compared to manual
heuristics. Finally, Section 6.4 concludes the chapter.
6.2 Method
6.2.1 Cluster Model and the Objective Function
We consider a cluster with K heterogeneous machines in terms of different energy profiles. Each machine
is comprised of N processors that can serve the jobs requiring multiple processors for their execution.
Jobs arrive to the system in an online fashion in discrete timesteps. Energy profile of machine k for job j
is shown by ej,k, which represents the normalized energy consumption of one processor in machine k for
job j in one timestep, if that processor is invoked for execution of the job in that timestep. The number of
processors required for the execution of job j is represented by nj . For determining the actual duration of
job j on machine k, represented with dj,k, similar to [72], we assume that the duration profile is known
only in advance as probability distribution such as normal distribution, i.e., dj,k ∼ N (µj,k, σ2
j,k), where
µj,k and σ
2
j,k represent the expected value and variance of dj,k, respectively. We assume σ
2
j,k to be a ratio
of µj,k, i.e., σ
2
j,k =
µj,k
c
, where coefficient c reflects the accuracy of workload estimator of incoming jobs.
For each job j, µj,k on machines with higher performance (operating frequency), and correspondingly
higher energy profile, is lower than the machines with lower energy profiles. For instance, if for a job j
and two machines in the cluster such as machine 0 and machine 1, we have ej,0 > ej,1, then we will have
µj,0 < µj,1.
136
Environment
Deep RL Agent
action ai
(sampled from the policy)
reward
r
i
state
si
policy
(probability distribution over the action
space outputted by the neural network )
r
i+1
si+1
policy parameters
Figure 6.1: A high-level view of the reinforcement learning with the policy represented by a deep neural network.
The scheduler, in each discrete timestep, selects and assigns a number of jobs to machines from a
queue of waiting jobs. Πj represents the machine that job j is assigned to by the Deep-EAS agent (0 ≤
Πj < K). A job j assigned to machine k is executed until the end of dj,k. Furthermore, nj processors are
allocated continuously for the entire execution span of job j. As we will see in Section 6.3.6, even with
these assumptions, the Deep-EAS agent provides nontrivial solutions that are advantageous compared to
manually-tuned heuristics, especially in heavy load conditions.
As both energy and performance should be addressed, we aim for optimizing the average normalized
energy-delay product for arriving jobs. The normalized delay for job j is represented by Dnorm
j
, and
is calculated as follows: Dnorm
j =
Dj
µj,∗
, in which Dj represents the time it takes from the arrival of
job j until its execution completion and departure from the system (including the waiting time of the
job in the waiting queue), and µj,∗ represents the minimum µj,k among all machines. Normalizing Dj
137
prevents biasing the solution towards longer jobs. The normalized energy consumption associated with
the complete execution of job j is represented by
Ej,Πj = nj × ej,Πj ×
dj,Πj
µj,∗
. (6.1)
Therefore, our scheduling goal is minimizing E[Ej × Dnorm
j
], where the expectation is calculated over all
jobs in the job arrival sequence.
6.2.2 Deep RL Formulation for Deep-EAS Agent
6.2.2.1 State Space
For state representation of the system in each timestep, we represent the current occupation state of machines and the resource-demand and average duration profile of the jobs in the waiting queue as binary
matrices. Fig. 6.2 illustrates a sample state of a system with two machines and a waiting queue of size five.
The matrices corresponding to the machines, shown on the left side of Fig. 6.2, represent the occupation
state of these machines from the current timestep until H timesteps ahead in the future. For instance in
Fig. 6.2, two jobs are scheduled on machine 1. For the sake of argument, we refer to these jobs as job 0 and
Multiple-processor Waiting Queue
Machines
b l
Jobslot 0 Jobslot 1 Jobslot 2 Jobslot 3 Jobslot 4
Resource
Tim
e
ste
p
M
a
c
hin
e
0
M
a
c
hin
e
1
counter tracking the
number of waiting jobs
beyond the first Q jobs in
the waiting queue
counter tracking the number
of timesteps since the arrival
of the last new job
Parameters of this sample cluster system:
N = 10 (number of processors per machine)
K = 2 (number of machines)
H = 6 (time horizon)
Q = 5 (size of waiting queue)
B = 6 (size of b vector)
L = 6 (size of l vector)
Figure 6.2: An illustrative example of the state representation of the cluster system in the middle of the job arrival
process.
138
job 1. Job 0 uses two processors for the next d0,1 timesteps where d0,1 ∼ N (4, 4/c), and job 1 uses two
processors for the next d1,1 timesteps where d1,1 ∼ N (2, 2/c).
Furthermore, the matrices corresponding to the jobs in the waiting queue, shown in the middle of Fig.
6.2, represents the resource-demand and average duration profile of the jobs in the waiting queue on each
of machines. The average duration profile of each job on different machines are different. For instance, in
Fig. 6.2 that we have two heterogeneous machines, we see two different duration profiles for each job in
the waiting queue. In this case, the duration profile of each job on the machine 1 is higher compared to the
duration profile of that job on the machine 0, which conveys the fact that in this sample system, machine
1 is the machine with the lower performance (operating frequency) and lower energy profile compared to
machine 0. The size of the queue is represented by Q.
In order to have a finite state representation, we maintain the binary matrices corresponding to the
resource-demand and average duration profile of waiting jobs only for the first Q jobs arrived at the system
that have not yet been scheduled on any of the machines. For the further jobs, we incorporate only their
count in the state of the system. We use a binary vector b for representing these backlog jobs, in which the
number of 1s in this vector represents the count of backlog jobs. Furthermore, for making the scheduler
aware of the arrival rate of incoming jobs, we track the number of discrete timesteps since the last job
has arrived to the system. We use a binary vector l in which the number of 1s represents the count of
timesteps since the arrival of last new job. The length of vectors b and l, shown on the right side of Fig.
6.2, are represented by B and L, respectively, and should be large enough so that these vectors do not
get exhausted. B and L are also chosen to be integer multiples of time horizon H, as we want to tile the
vectors b and l in chunks of size H so that we have a rectangular state matrix. Consequently, the height
and width of the state binary matrix, that incorporates all the mentioned information, would be obtained
as H and N × (1 + Q) × K +
B
H +
L
H
, respectively.
139
6.2.2.2 Action Space
In each timestep, the scheduler can potentially select one or more jobs from the waiting queue with the
size of Q and assigns it to one of K machines. Therefore, the size of the action space would become
exponentially large with respect to Q and K. In order to reduce the action space size, similar to [79], we
decouple the decision steps of the Deep-EAS agent from real timesteps, and allow the agent to do multiple
actions in a single timestep. The new action space is associated with selecting one of the waiting jobs and
assigning it to one of K machines. Therefore, the size of the action space is reduced to Q×K. Specifically,
we define the action k × Q + q as “assign the job in q-th slot in the waiting queue to machine k", where
0 ≤ q < Q and 0 ≤ k < K. We define the action K × Q as the “hold" action. Upon taking this action, the
agent does not schedule any further jobs in the current timestep (by considering hold action, the actual
action space size would be Q×K + 1 instead of Q×K). In each timestep, the scheduler can take multiple
actions until choosing the hold action, or an invalid action. The selected action is invalid if there is no
job in q-th slot of the waiting queue, or if the selected job does not fit in the selected machine from the
current timestep looking ahead into the next H timesteps based on the average estimates available for jobs
occupying the underlying machines.
By choosing each valid action, the corresponding job in the queue is assigned to the selected machine
starting from the earliest possible timestep on that machine, and a job from the backlog queue (if any)
is dequeued and replaces the job that has just been scheduled. However, by choosing an invalid action
or the hold action, the time actually goes on and the state matrix shifts up by one row and jobs which
have finished their execution, according to their actual duration sampled from the corresponding normal
distribution, depart from the system. Therefore, jobs may depart from the system sooner or later than
their average estimates. This will either advance or postpone the actual start time of other jobs waiting
for resources to get freed up. Furthermore, when the actual time proceeds, any new jobs may arrive to the
system, depending on the job arrival process. If any new job arrives, vector l resets to an all-zero vector.
140
6.2.2.3 Rewards Function
One challenge for defining the rewards function for our problem is the fact that for the jobs in the waiting
queue and backlog, in the timesteps before they actually get assigned to one of the underlying machines,
we know their contribution to the average delay of jobs (one for every timestep they are still in the system).
However, we do not know their corresponding Ej before they get assigned to one of the machines. In order
for the cumulative rewards function to correlate with our objective, normalized energy-delay product,
we need to weight each timestep that each job j is still in the system with Ej
µj,∗
(see Section 6.2.1). For
solving this issue, we define the rewards function in each timestep as the following (no reward is given
for intermediate actions of the scheduler agent during a timestep. Reward is only granted after the actual
time proceeds):
−
X
j∈Jp
Ej
µj,∗
+
X
j /∈Jp
E∗
j
µj,∗
+
X
j∈Jnew
δ
correct
j
µj,∗
. (6.2)
The breakdown of three terms of the (6.2) are as follows:
First term: Jp represents the set of jobs currently scheduled on any of machines. For each job j in
this set, we know the energy consumption associated with execution of job j, Ej .
Second term: For each job j which is not currently scheduled on any of machines, we do not know
yet their energy consumption. For such job j, we temporarily assume we will eventually assign it to the
machine that yield the minimum energy consumption for its execution, and represent this value by E∗
j
.
We will correct our assumption using the to-be-explained third component of the rewards function, when
we eventually assign the job to one of the underlying machines.
Third term: Jnew represents the set of jobs that have been “just" scheduled on a machine in the current
timestep. For each job j in Jnew, we have used the second term of (6.2) during previous timesteps from
the time the job arrived to the system. In case the current assigned machine of job j is not the machine
that yields the lowest energy consumption for job j, which was our temporary assumption in the second
141
component of (6.2), we correct and add the amount of difference for previous timesteps to the rewards
function. This amount for such job j is represented by δ
correct
j = (Ej −E∗
j
)× |∆t|, where |∆t| represents
the number of timesteps from the time job arrived to the system until the current timestep (|∆t| represents
just the number of timesteps and is unit-less itself).
Consequently, using the discount factor γ = 1, the cumulative rewards function (6.2) over all timesteps
would result the (negative) total of normalized energy-delay product over all the jobs, and maximizing this
cumulative reward results in minimizing the total and thus the average of normalized energy-delay product
over all the jobs.
6.2.3 Training Deep-EAS
For training the Deep-EAS agent, we need to adjust the policy parameters of its underlying deep neural
network (see Fig. 6.1). Similar to [81], we use policy gradients in which we learn by employing gradient
descent on the policy parameters. For using gradient descent, we need to have the gradient of the expected
cumulative discounted reward, E[P∞
i=0 γ
i
ri
], which is our objective function. This gradient is obtained
using the REINFORCE equation [136]:
∇θEπθ
[
X∞
i=0
γ
i
ri
] = Eπθ
[Rπθ
(s, a).∇θ log πθ(s, a)], (6.3)
where Rπθ
(s, a) represents the expected cumulative discounted reward if we choose action a in state s
and follow the policy πθ afterwards. In policy gradients, in each training iteration, the main idea is that
we approximate the gradient equation in (6.3) by evaluating the trajectories of executions obtained by
following the policy we have in that iteration. Specifically, for training Deep-EAS using policy gradients,
in each training iteration, we draw a number of trajectories of the executions sampled from πθ for a sample
job arrival sequence. Each execution trajectory (episode) terminates when all the jobs in the sequence finish
their execution (or a predefined maximum length of the trajectory is reached). To train a generalized policy,
142
we use multiple sample job arrival sequences in each training iteration (S sequences), and we perform M
trajectories of execution for each sequence until the trajectory termination. Using these trajectories, we
approximate (6.3) as follows:
∇θEπθ
[
X∞
i=0
γ
i
ri
] ≈
1
S.M
X
S
s=1
X
M
m=1
X
t
∇θ log πθ(s
s,m
t
, a
s,m
t
)v
s,m
t
,
(6.4)
in which v
s,m
t
is the empirically computed cumulative discounted reward and serves as an unbiased estimate of Rπθ
(s
s,m
t
, a
s,m
t
) (superscript s and m are used to refer to m-th trajectory of s-th sample job arrival
sequence). Using this approximation, we update policy parameters in each iteration via the following
equation:
θ ← θ +
α
S.M
X
S
s=1
X
M
m=1
X
t
∇θ log πθ(s
s,m
t
, a
s,m
t
)(v
s,m
t − b
s
t
). (6.5)
α in (6.5) indicates the learning rate of the training algorithm. In (6.5), we reduce a baseline value b
s
t
from v
s,m
t which help reduce the variance of policy gradients. Without reducing the baseline, gradient
estimates obtained using (6.4) can have high variances [106]. For calculating b
s
t
, the average of v
s,m
t
at the
same timestep t over all trajectories (m = 1, 2, ..., M) of the job sequence s is used.
6.3 Evaluation
6.3.1 Cluster Setup
We use an instance of the cluster system described in Section 6.2 and shown in Fig. 6.2 with the following
parameters: K = 2, N = 10, H = 30t (t represents the duration of one timestep), Q = 10, B = 90,
and L = 30. Jobs arrive to the system in an online fashion according to a Bernoulli process with the
arrival rate of λ. The length of each job arrival sequence is set to 60t (new jobs can arrive until timestep
143
60, however experiment goes on until all jobs remained in the system beyond 60t finish their execution).
The resource requirement of each arriving job is chosen uniformly between 1 and 10 processors. In our
sample cluster model, we consider machine 0 as the higher-performance machine and machine 1 as the
lower-performance machine. Particularly, we consider the operation frequency of machine 0 to be twice
the operating frequency of machine 1. Therefore, for each job j, we have µj,1 = 2µj,0. Furthermore,
we consider each job arrival sequence to be a combination of short-duration and long-duration jobs. The
probability that an arriving job is a short job is indicated with β. µj,0 for each short job j is chosen
uniformly between 1t and 3t, while µj,0 for each long job j is chosen uniformly between 10t and 15t.
Coefficient c which was introduced in Section 6.2.1, and reflects the accuracy of workload estimator of
incoming jobs, is set as 4. We will examine the efficiency of Deep-EAS for different values of λ, β, and c
in Sections 6.3.4 and 6.3.7.
6.3.2 Energy Model
We consider both machines to be always “on" during the experiment. This means that the energy consumption due to static power consumption of the system serves as an additive factor to the total energy
consumption of the system during the experiment. Therefore, the ratio between ej,0 and ej,1 for each job
j needs to reflect the ratio between the dynamic energy consumption of the processor on machine 0 and
machine 1 in one timestep. By employing the power model presented in [36] and [150], dynamic power
consumption of a processor can be modeled by xjf
y
, in which xj is a coefficient depending on the average
switched capacitance and the activity factor of job j, f is the processor operating frequency, and y is the
technology-dependent dynamic power exponent. Therefore, for each job j we have: ej,0
ej,1
= ( f0
f1
)
y
. Using a
classical energy model of a 70nm technology processor that supports 5 discrete frequencies ranging from
1 GHz to 2 GHz, whose accuracy has been verified by SPICE simulation, [36] proposes the value for y as
3.2941. Therefore, by setting f0 = 1 GHz and f1 = 2 GHz (the operating frequencies of our machines)
144
and using this value for y, for each job j we have: ej,0
ej,1
= 23.2941 = 9.809. While the actual ej,k for the
jobs are different with each other due to the different xj each job j has, the ratio between ej,0 and ej,1
for each job j remains the same. Therefore, we use the normalized values of ej,0 = 9.809 and ej,1 = 1
for each job j. It should be noted that while the energy model based on a 70nm technology is employed
here, the proposed method is capable of dealing with a general, parameterized power model. Therefore, in
a smaller technology node, one can find the corresponding coefficients and exponents, and use them for
finding scheduling solutions.
6.3.3 Deep-EAS Training Setup
For the underlying neural network of the Deep-EAS agent, the size of the input is obtained as a 30 × 224
binary matrix with the values used for the parameters of the cluster model in Section 6.3.1. We apply
a convolutional layer to extract features from this matrix. We use eight 3 × 3 filters with the stride of
size 2 (in both height and width directions), followed by the Relu activation function. After this layer,
we use a fully connected layer with the size of 21 followed by the softmax activation function (the action
space size for our cluster mode is 10 × 2 + 1). We train Deep-EAS as described in Section 6.2.3 using
150 different job arrival sequences for 1000 training iterations. In each training iteration, we evaluate 20
different trajectories of execution for each job sequence. For updating the policy parameters, we use Adam
optimizer [67] with the learning rate of 0.001.
6.3.4 Results
As a standard manually-tuned heuristic to compare the proposed Deep-EAS agent with, we choose an
energy-aware shortest job first (ESJF) agent. ESJF, in each timestep, schedules the job that yields the
lowest normalized energy-delay product to its corresponding machine (according to the available average
estimates of duration of jobs in the waiting queue). ESJF keeps doing this process until no job is left in
145
the waiting queue or no further jobs can be scheduled on any of machines in that timestep (due to the
occupancy state of machines). In that case, time proceeds and ESJF repeats this procedure. This process
continues until all jobs in the job sequence finish their execution.
Fig. 6.3 presents a comparision between Deep-EAS on 150 new jobsets (not seen during training) and
ESJF, for different job arrival rates when β = 0.5 (the probability that a new job is a short job is equal to
the probability that it is a long job). As presented in Fig. 6.3, the average normalized energy-delay product
values obtained by either of Deep-EAS and ESJF generally increase with the job arrival rate. Deep-EAS is
comparable with ESFJ for low arrival rates (e.g., for λ = 0.1 and λ = 0.3). However, Deep-EAS shows to
be considerably advantageous in higher arrival rates. For instance, for λ = 0.9, the average normalized
energy-delay product obtained from Deep-EAS is 42.88% lower in comparison with ESJF.
6.3.5 Deep-EAS Training Curve and Overhead Analysis
The training curve of Deep-EAS over 1000 iterations and achieved average normalized energy-delay product after each iteration are presented in Fig. 6.4, for the case where the job arrival rate is 0.7 and β = 0.5.
The obtained average normalized energy-delay product using ESJF is also shown in Fig. 6.4 as a reference.
As indicated in Fig. 6.4, Deep-EAS starts from acting poorly in the environment, but quickly improves
itself over the training iterations, surpassing the ESJF after the first 30 iterations and further improvement
beyond that. For the sake of reducing the time of training, in each training iteration, we performed execution trajectories of each job sequence in parallel on a platform with four 3.2 GHz Intel Core i7-8700 CPUs
and 64 GB RAM. On this platform, each training iteration took about 97 seconds on average.
Note that the training of the Deep-EAS agent is done before the deployment of the Deep-EAS agent in
the system. In other words, we do not suffer from the training overhead during the actual scheduling. The
time overhead associated with each scheduling decision is instead corresponding to the inference latency of
the trained policy network. Using our experiment setup, this scheduling overhead was negligible compared
146
0.1 0.3 0.5 0.7 0.9
Job arrival rate
0
200
400
600
800
1000
1200
1400
A
v
era
g
e
n
orm
aliz
e
d
e
n
erg
y-d
ela
y
pro
d
u
ct
Deep-EAS
ESJF
Figure 6.3: Comparison of Deep-EAS and ESJF at different job arrival rates, when β = 0.5.
147
to the duration of timestep t (the time interval between scheduling decisions), where t is usually in the
order of a few milliseconds.
6.3.6 Analyzing Why Deep-EAS is Advantageous
The main advantage that Deep-EAS possess is that it can develop nontrivial scheduling solutions during
training, which are not necessarily energy-delay conserving for every job, or work conserving for every
timestep. If a scheduling solution is energy-delay conserving for every job, if it allocates a job in a timestep
to a machine, it allocates it to the one yielding the minimum normalized energy-delay product for that job.
If a scheduling solution is work-conserving for every timestep, it keeps allocating jobs from the waiting
queue as long as resources are available in a timestep. ESJF is a scheduling solution which is both energydelay conserving and work conserving. In general, since manually-tuned resource scheduler solutions
usually make decisions in each timestep based on a predefined metric, they are mainly resource conserving
[44]. However, Deep-EAS can potentially be both not energy-delay conserving and not work conserving,
if these decisions can eventually cause the lower average normalized energy-delay product over all the
jobs. Particularly, for results shown in Fig. 6.3, Deep-EAS is not energy-delay conserving for 13.11% of
jobs, and is not work conserving for 91.81% of timesteps.
To further analyze the scheduled jobs using Deep-EAS, for the case in Fig. 6.3 where β = 0.5 and
λ = 0.9 (a high job arrival rate), we examine the cumulative distribution function (CDF) plots of the µj,0
of the jobs Deep-EAS was not energy-delay conserving for (shown with holde), alongside the jobs DeepEAS was not work conserving for (shown with holdw). These CDF plots are shown in Fig. 6.5. As shown
in this figure, while β = 0.5 and thus the number of small jobs and long jobs in a sequence are almost
the same, we observe if Deep-EAS does not act as an energy-delay conserving scheduler for a job in a
timestep, that job is a long job most of the times (see holde in Fig. 6.5). The intuition behind this could
be that allocating long jobs to the machine with lower energy profile (and thus resulting higher duration)
148
0 200 400 600 800 1000
Iteration
600
850
1100
1350
1600
1850
A
v
era
g
e
n
orm
aliz
e
d
e
n
erg
y-d
ela
y
pro
d
u
c
t
Deep-EAS
ESJF
Figure 6.4: Deep-EAS learning curve indicating the policy improvement over the training iterations.
149
could occupy that machine for many timesteps, which can reduce the chance of allocating a number of
potentially arriving small jobs to that machine and potentially increase the average normalized energydelay product over all the jobs. Hence, in a heavy load condition, it can be eventually useful to not be
energy-delay conserving for some of the long jobs. Similarly, if Deep-EAS withholds a job in a timestep,
that job is most of the times a long job (see holdw in Fig. 6.5). Almost the same argument mentioned
for holde can be presented as the intuition for holdw. In a job arrival process with a high arrival rate,
withholding a long job can potentially pave the way for scheduling a number of yet-to-arrive small jobs
and eventually being advantageous in reducing the average normalized energy-delay product over all the
jobs. Deep-EAS learns these solutions on its own.
6.3.7 Examining the effect of β and c
To evaluate the effect of β, for the case where the job arrival rate is 0.7, we consider 3 cases: β = 0.8
(majority of the jobs are short jobs), β = 0.5 (the number of small jobs and long jobs are almost the same),
and β = 0.2 (majority of the jobs are long jobs). By evaluating Deep-EAS and ESJF on 150 new jobsets
(not seen during the training of Deep-EAS), for these values of β, the average normalized energy-delay
product obtained by Deep-EAS are 45.29% ,35.10%, and 11.94% lower compared to ESJF. This indicates that
for the job sequences where majority of jobs are small jobs, Deep-EAS shows to be more advantageous.
To evaluate the effect of the workload estimator accuracy, reflected by coefficient c mentioned in Section 6.2.1, we reproduce the results in Fig. 6.3, but assuming we have a perfect workload estimator of
incoming jobs. In other words, for each job j we assume we have dj,k ∼ N (µj,k, 0) or dj,k = µj,k. Using
this perfect workload estimator, for job arrival rates of 0.1, 0.3, 0.5, 0.7, and 0.9, the obtained average normalized energy-delay product values via Deep-EAS are 5.37%, 13.84%, 29.45%, 37.64%, and 45.47% lower,
respectively, compared to ESJF. Furthermore, the obtained values for average normalized energy-delay
products using the perfect workload estimator showed to be lower for both Deep-EAS and ESJF compared
150
to values in Fig. 6.3. Therefore, using a better workload estimator, it is again observed that Deep-EAS
shows to be considerably more advantageous in higher arrival rates. However, in all job arrival rates, the
gap between values obtained by Deep-EAS and ESJF has been increased.
6.4 Conclusions and Future Work
Inspired by recent advances in employing reinforcement learning (RL) for resource management, we’ve
introduced an alternative paradigm to manual heuristic-based approaches for task scheduling inn cluster
systems. By harnessing the power of deep reinforcement learning, we embarked on the journey of building intelligent systems that learn, adapt, and autonomously develop energy-aware scheduling strategies.
This approach departs from traditional supervised learning, which is typically infeasible for solving combinatorial optimization problems. Instead, we exploit the ability to evaluate the performance of solutions
through a verifier, providing valuable feedback to a learning algorithm.
We presented Deep-EAS, an online energy-aware scheduler designed with the aid of deep RL. One of
the key motivations behind Deep-EAS is its ability to adapt to varying workload conditions efficiently,
thanks to the RL agent’s ability to model various aspects of the system, such as job arrival rates, job
duration and resource demand profiles, server occupation states, and energy profiles. Our method builds
scheduling policies that are energy-aware and capable of handling uncertainties in job workloads, a critical
consideration in real-world scenarios.
During the training, Deep-EAS starts from knowing nothing about the scheduling task at hand, and
develops nontrivial scheduling solutions. We observe these solutions outperform standard manually-tuned
heuristics, especially in heavy load conditions with high job arrival rates. For future work, Deep-EAS
can be potentially extended to learn more complex strategies such as job-preemption, job-migration and
dynamic voltage and frequency scaling (DVFS), which can increase its adaptability to various situations.
151
0 2 4 6 8 10 12 14 16
μ
0.0
0.2
0.4
0.6
0.8
1.0
C
D
F
holdw
holde
j,0 of jobs
Figure 6.5: CDF plots of µj,0 of jobs Deep-EAS is not energy-delay conserving for (holde), alongside the jobs DeepEAS is not work conserving for (holdw), when λ = 0.9 and β = 0.5.
x value
a 1.23
b 3.456
c 100.0002
d 12 345.0
Table 6.1: In hendrerit gravida rutrum quisque non tellus orci ac. Iaculis urna id volutpat lacus laoreet non
curabitur gravida arcu. Mauris ultrices eros in cursus turpis massa. Sed tempus urna et pharetra pharetra
massa massa. Eget sit amet tellus cras adipiscing enim eu turpis egestas. Morbi blandit cursus risus at
ultrices.
152
Chapter 7
Conclusions
In conclusion, this thesis addresses a wide range of critical challenges in the fields of deep learning, hardware acceleration, energy-efficient scheduling, and cluster system optimization. The contributions presented in this work represent substantial advancements in their respective domains, offering innovative
solutions to real-world problems and pushing the boundaries of technology.
The first major contribution revolves around the development of an FPGA-friendly inference accelerator, F2N2, designed specifically for deep neural networks. F2N2 leverages mixed-computation techniques
to optimize neural network inference on FPGA devices, achieving remarkable gains in efficiency and performance. By addressing data movement costs, optimizing memory resources, and supporting mixedprecision operations, F2N2 opens the door to efficient deployment of deep learning models in latencycritical applications.
The second contribution, SynergicLearning, introduces a novel learning framework that combines
the strengths of hyperdimensional and neural network models for on-chip, incremental learning. This
framework supports mixed-computation and streamlines the learning process, offering adaptability and
efficiency. By bridging the gap between these two approaches, it overcomes their individual shortcomings,
paving the way for more effective on-chip learning.
153
The third set of contributions focuses on energy optimization in embedded systems and cluster systems. In embedded systems, the thesis explores dynamic voltage and frequency scaling (DVFS) and dynamic power management (DPM) techniques to minimize energy consumption while meeting performance
requirements. By modeling idle intervals and proposing energy-efficient scheduling strategies, it provides
an integrated approach to energy optimization.
Furthermore, we also focused on task graph scheduling with potentially imprecise computations, prioritizing Quality of Service (QoS) within the constraints of hard deadlines and energy efficiency. Our method
offers a comprehensive solution to a complex problem. By considering the input-quality-dependent workload extension and leveraging interdependent tasks in the context of multiprocessor system-on-chip (MPSoC) platforms. We also provided a Mixed Integer Linear Program (MILP) formulation of the same problem,
allowing for a direct comparison between the heuristic and optimal solutions.
In the realm of cluster systems, the thesis introduces Deep-EAS, an intelligent online energy-aware
scheduler designed for heterogeneous energy profiles. Deep-EAS learns scheduling policies based on various system characteristics and workload uncertainties, outperforming manual heuristics under varying
conditions.
Collectively, these contributions demonstrate significant advancements in the domains of deep learning, hardware acceleration, on-chip learning, energy efficiency, and cluster system optimization. They
not only provide valuable insights but also offer practical solutions to complex problems, contributing to
the ongoing efforts to make technology more efficient, adaptable, and accessible across a wide range of
applications. This work represents a substantial step forward in the pursuit of efficient and sustainable
computing solutions for the future.
154
Bibliography
[1] “7 series dsp48e1 slice user guide (ug479).” https://www.xilinx.com/support/documentation/user
guides/ug479 7Series DSP48E1.pdf.
[2] “Convolutional neural network with int4 optimization on xilinx devices white paper.”
https://www.xilinx.com/ support/documentation/white papers/wp521-4bit-optimization.pdf.
[3] “Deep learning with int8 optimization on xilinx devices white paper (wp485).”
https://www.xilinx.com/support/documentation/white papers/wp486-deep-learning-int8.pdf.
[4] Davide Anguita, Alessandro Ghio, Luca Oneto, Xavier Parra, and Jorge Luis Reyes-Ortiz. “A
public domain dataset for human activity recognition using smartphones.” In: Esann. 2013.
[5] Anupreetham Anupreetham, Mohamed Ibrahim, Mathew Hall, Andrew Boutros, Ajay Kuzhively,
Abinash Mohanty, Eriko Nurvitadhi, Vaughn Betz, Yu Cao, and Jae-sun Seo. “End-to-End
FPGA-based Object Detection Using Pipelined CNN and Non-Maximum Suppression”. In: 31st
International Conference on Field-Programmable Logic and Applications, FPL 2021, Dresden,
Germany, August 30 - Sept. 3, 2021. IEEE, 2021, pp. 76–82. doi: 10.1109/FPL53798.2021.00021.
[6] Automatic Mixed Precision for Deep Learning.
https://developer.nvidia.com/automatic-mixed-precision/.
[7] Hakan Aydin, Rami Melhem, Daniel Mosse, and Pedro Mejıa-Alvarez. “Optimal reward-based
scheduling for periodic real-time tasks”. In: IEEE Transactions on Computers (2001).
[8] Hakan Aydin, Rami Melhem, Daniel Mossé, and Pedro Mejıa-Alvarez. “Determining optimal
processor speeds for periodic real-time tasks with different power characteristics”. In: Real-Time
Systems, 13th Euromicro Conference on, 2001. IEEE. 2001, pp. 225–232.
[9] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. “Neural Machine Translation by Jointly
Learning to Align and Translate”. In: International Conference on Learning Representations. 2015.
[10] Irwan Bello et al. “Neural combinatorial optimization with reinforcement learning”. In: arXiv
preprint arXiv:1611.09940 (2016).
[11] Gal Chechik, Uri Shalit, Varun Sharma, and Samy Bengio. “An online algorithm for large scale
image similarity learning”. In: Advances in Neural Information Processing Systems. 2009.
155
[12] Kumar Chellapilla, Sidd Puri, and Patrice Simard. “High performance convolutional neural
networks for document processing”. In: 2006.
[13] Gang Chen, Kai Huang, and Alois Knoll. “Energy optimization for real-time multiprocessor
system-on-chip with optimal DVFS and DPM combination”. In: ACM Transactions on Embedded
Computing Systems (TECS) 13.3s (2014), p. 111.
[14] Jia-Ming Chen, Wan-Chen Lu, Wei-Kuan Shih, and Ming-Chung Tang. “Imprecise Computations
with Deferred Optional Tasks”. In: Journal of Information Science & Engineering (2009).
[15] Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Q. Yan, Haichen Shen,
Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and
Arvind Krishnamurthy. “TVM: An Automated End-to-End Optimizing Compiler for Deep
Learning”. In: USENIX Symposium on Operating Systems Design and Implementation. USENIX
Association, 2018, pp. 578–594. url:
https://www.usenix.org/conference/osdi18/presentation/chen.
[16] Tianshi Chen, Zidong Du, Ninghui Sun, Jia Wang, Chengyong Wu, Yunji Chen, and
Olivier Temam. “DianNao: a small-footprint high-throughput accelerator for ubiquitous
machine-learning”. In: Architectural Support for Programming Languages and Operating Systems,
ASPLOS ’14, Salt Lake City, UT, USA, March 1-5, 2014. Ed. by Rajeev Balasubramonian, Al Davis,
and Sarita V. Adve. ACM, 2014, pp. 269–284.
[17] Yao Chen, Jiong He, Xiaofan Zhang, Cong Hao, and Deming Chen. “Cloud-DNN: An Open
Framework for Mapping DNN Models to Cloud FPGAs”. In: Proceedings of the 2019 ACM/SIGDA
International Symposium on Field-Programmable Gate Arrays, FPGA 2019, Seaside, CA, USA,
February 24-26, 2019. Ed. by Kia Bazargan and Stephen Neuendorffer. ACM, 2019, pp. 73–82.
[18] Yu-Hsin Chen, Joel S. Emer, and Vivienne Sze. “Eyeriss: A Spatial Architecture for
Energy-Efficient Dataflow for Convolutional Neural Networks”. In: International Symposium on
Computer Architecture. IEEE Computer Society, 2016, pp. 367–379.
[19] Ping Chi, Shuangchen Li, Cong Xu, Tao Zhang, Jishen Zhao, Yongpan Liu, Yu Wang, and
Yuan Xie. “PRIME: A Novel Processing-in-Memory Architecture for Neural Network
Computation in ReRAM-Based Main Memory”. In: International Symposium on Computer
Architecture. IEEE Computer Society, 2016, pp. 27–39. doi: 10.1109/ISCA.2016.13.
[20] Jungwook Choi, Zhuo Wang, Swagath Venkataramani, Pierce I-Jen Chuang,
Vijayalakshmi Srinivasan, and Kailash Gopalakrishnan. “PACT: Parameterized Clipping
Activation for Quantized Neural Networks”. In: CoRR abs/1805.06085 (2018). arXiv: 1805.06085.
url: http://arxiv.org/abs/1805.06085.
[21] Jungwook Choi, Zhuo Wang, Swagath Venkataramani, Pierce I-Jen Chuang,
Vijayalakshmi Srinivasan, and Kailash Gopalakrishnan. “PACT: Parameterized clipping activation
for quantized neural networks”. In: arXiv preprint arXiv:1805.06085 (2018).
[22] J-Y Chung, Jane W.-S. Liu, and K-J Lin. “Scheduling periodic jobs that allow imprecise results”. In:
Transactions on Computers (1990).
156
[23] Dan C. Ciresan, Ueli Meier, and Jürgen Schmidhuber. “Multi-column deep neural networks for
image classification”. In: Conference on Computer Vision and Pattern Recognition. IEEE Computer
Society, 2012, pp. 3642–3649. doi: 10.1109/CVPR.2012.6248110.
[24] Ron Cole, Yeshwant Muthusamy, and Mark Fanty. The ISOLET spoken letter database. Oregon
Graduate Institute of Science and Technology, Department of Computer . . ., 1990.
[25] Jason Cong, Peng Wei, and Cody Hao Yu. “From JVM to FPGA: Bridging Abstraction Hierarchy
via Optimized Deep Pipelining”. In: 10th USENIX Workshop on Hot Topics in Cloud Computing,
HotCloud 2018, Boston, MA, USA, July 9, 2018. Ed. by Ganesh Ananthanarayanan and
Indranil Gupta. USENIX Association, 2018.
[26] Luis Alejandro Cortés, Petru Eles, and Zebo Peng. “Quasi-static assignment of voltages and
optional cycles in imprecise-computation systems with energy considerations”. In: Transactions
on Very Large Scale Integration Systems (2006).
[27] Sohum Datta, Ryan AG Antonio, Aldrin RS Ison, and Jan M Rabaey. “A Programmable
Hyper-Dimensional Processor Architecture for Human-Centric IoT”. In: IEEE Journal on
Emerging and Selected Topics in Circuits and Systems (2019).
[28] Tim Dettmers and Luke Zettlemoyer. “Sparse Networks from Scratch: Faster Training without
Losing Performance”. In: CoRR abs/1907.04840 (2019). arXiv: 1907.04840. url:
http://arxiv.org/abs/1907.04840.
[29] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. “BERT: Pre-training of Deep
Bidirectional Transformers for Language Understanding”. In: Conference of the North American
Chapter of the Association for Computational Linguistics: Human Language Technologies.
Association for Computational Linguistics, 2019, pp. 4171–4186.
[30] Robert P Dick, David L Rhodes, and Wayne Wolf. “TGFF: task graphs for free”. In: Proceedings of
the 6th international workshop on Hardware/software codesign. IEEE Computer Society. 1998,
pp. 97–101.
[31] Robert P Dick, David L Rhodes, and Wayne Wolf. “TGFF: task graphs for free”. In: International
Workshop on Hardware/Software Codesign. IEEE. 1998.
[32] Xiaohan Ding, Guiguang Ding, Xiangxin Zhou, Yuchen Guo, Jungong Han, and Ji Liu. “Global
Sparse Momentum SGD for Pruning Very Deep Neural Networks”. In: Advances in Neural
Information Processing Systems. 2019, pp. 6379–6391. url: http://papers.nips.cc/paper/8867-
global-sparse-momentum-sgd-for-pruning-very-deep-neural-networks.
[33] Zhen Dong, Zhewei Yao, Amir Gholami, Michael W. Mahoney, and Kurt Keutzer. “HAWQ:
Hessian AWare Quantization of Neural Networks With Mixed-Precision”. In: 2019 IEEE/CVF
International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 -
November 2, 2019. IEEE, 2019, pp. 293–302. doi: 10.1109/ICCV.2019.00038.
[34] Zidong Du et al. “ShiDianNao: shifting vision processing closer to the sensor”. In: International
Symposium on Computer Architecture. ACM, 2015, pp. 92–104.
157
[35] Amirhossein Esmaili, Mahdi Nazemi, and Massoud Pedram. “Modeling processor idle times in
MPSoC platforms to enable integrated DPM, DVFS, and task scheduling subject to a hard
deadline”. In: Asia and South Pacific Design Automation Conference. ACM. 2019.
[36] Amirhossein Esmaili, Mahdi Nazemi, and Massoud Pedram. “Modeling processor idle times in
MPSoC platforms to enable integrated DPM, DVFS, and task scheduling subject to a hard
deadline”. In: ASP-DAC. 2019.
[37] Wu-chun Feng and JW-S Liu. “Algorithms for scheduling real-time tasks with input error and
end-to-end deadlines”. In: Transactions on Software Engineering (1997).
[38] Mingyu Gao, Jing Pu, Xuan Yang, Mark Horowitz, and Christos Kozyrakis. “TETRIS: Scalable and
Efficient Neural Network Acceleration with 3D Memory”. In: International Conference on
Architectural Support for Programming Languages and Operating Systems. Ed. by Yunji Chen,
Olivier Temam, and John Carter. ACM, 2017, pp. 751–764. doi: 10.1145/3037697.3037702.
[39] Ritu Garg, Mamta Mittal, and Le Hoang Son. “Reliability and energy efficient workflow
scheduling in cloud environment”. In: Cluster Comput. (2019).
[40] Marco ET Gerards, Johann L Hurink, and Jan Kuper. “On the interplay between global DVFS and
scheduling tasks with precedence constraints”. In: Transactions on Computers (2015).
[41] Marco ET Gerards, Johann L Hurink, and Jan Kuper. “On the interplay between global DVFS and
scheduling tasks with precedence constraints”. In: IEEE Transactions on Computers 64.6 (2015),
pp. 1742–1754.
[42] Marco ET Gerards and Jan Kuper. “Optimal DPM and DVFS for frame-based real-time systems”.
In: ACM Transactions on Architecture and Code Optimization (TACO) 9.4 (2013), p. 41.
[43] Vinayak Gokhale et al. “A 240 G-ops/s Mobile Coprocessor for Deep Neural Networks”. In:
Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 2014, pp. 696–701.
[44] Robert Grandl et al. “Multi-resource packing for cluster schedulers”. In: ACM SIGCOMM CCR
(2015).
[45] Song Han, Jeff Pool, John Tran, and William J. Dally. “Learning both Weights and Connections for
Efficient Neural Networks”. In: CoRR abs/1506.02626 (2015). arXiv: 1506.02626. url:
http://arxiv.org/abs/1506.02626.
[46] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. “Deep Residual Learning for Image
Recognition”. In: Conference on Computer Vision and Pattern Recognition. IEEE Computer Society,
2016, pp. 770–778.
[47] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. “Deep residual learning for image
recognition”. In: IEEE conference on computer vision and pattern recognition (CVPR). 2016.
[48] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. “Distilling the knowledge in a neural network”. In:
arXiv preprint arXiv:1503.02531 (2015).
158
[49] Geoffrey E. Hinton, Oriol Vinyals, and Jeffrey Dean. “Distilling the Knowledge in a Neural
Network”. In: CoRR abs/1503.02531 (2015). arXiv: 1503.02531. url:
http://arxiv.org/abs/1503.02531.
[50] Sepp Hochreiter and Jürgen Schmidhuber. “Long Short-Term Memory”. In: Neural Computation
9.8 (1997), pp. 1735–1780.
[51] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang,
Tobias Weyand, Marco Andreetto, and Hartwig Adam. “Mobilenets: Efficient convolutional
neural networks for mobile vision applications”. In: arXiv preprint arXiv:1704.04861 (2017).
[52] Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Weinberger. “Densely Connected
Convolutional Networks”. In: Conference on Computer Vision and Pattern Recognition. IEEE
Computer Society, 2017, pp. 2261–2269.
[53] Huang Huang, Vivek Chaturvedi, Gang Quan, Jeffrey Fan, and Meikang Qiu. “Throughput
maximization for periodic real-time systems under the maximal temperature constraint”. In:
Transactions on Embedded Computing Systems (2014).
[54] Kai Huang, Luca Santinelli, Jian-Jia Chen, Lothar Thiele, and Giorgio C Buttazzo. “Applying
real-time interface and calculus for dynamic power management in hard real-time systems”. In:
Real-Time Systems 47.2 (2011), pp. 163–193.
[55] Xin Huang, KenLi Li, and RenFa Li. “A energy efficient scheduling base on dynamic voltage and
frequency scaling for multi-core embedded real-time system”. In: International Conference on
Algorithms and Architectures for Parallel Processing. Springer. 2009, pp. 137–145.
[56] Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. “Binarized
Neural Networks”. In: Advances in Neural Information Processing Systems. 2016, pp. 4107–4115.
url: http://papers.nips.cc/paper/6573-binarized-neural-networks.
[57] David Hull, Arjun Shankar, Klara Nahrstedt, and Jane WS Liu. “An end-to-end QoS model and
management architecture”. In: Workshop on Middleware for Distributed Real-time Systems and
Services. Citeseer. 1997.
[58] IBM ILOG CPLEX Optimization Studio, Version 12.8. Available from:
https://www.ibm.com/products/ilog-cplex-optimization-studio.
[59] Andrey Ignatov. “Real-time human activity recognition from accelerometer data using
Convolutional Neural Networks”. In: Applied Soft Computing (2018).
[60] Mohsen Imani, Chenyu Huang, Deqian Kong, and Tajana Rosing. “Hierarchical hyperdimensional
computing for energy efficient classification”. In: ACM/ESDA/IEEE Design Automation Conference
(DAC). 2018.
[61] Mohsen Imani, Deqian Kong, Abbas Rahimi, and Tajana Rosing. “Voicehd: Hyperdimensional
computing for efficient speech recognition”. In: 2017 IEEE International Conference on Rebooting
Computing (ICRC). IEEE. 2017.
159
[62] Mohsen Imani, Abbas Rahimi, Deqian Kong, Tajana Rosing, and Jan M Rabaey. “Exploring
hyperdimensional associative memory”. In: 2017 IEEE International Symposium on High
Performance Computer Architecture (HPCA). IEEE. 2017.
[63] Mohsen Imani, Sahand Salamat, Saransh Gupta, Jiani Huang, and Tajana Rosing. “Fach:
Fpga-based acceleration of hyperdimensional computing by reducing computational complexity”.
In: Proceedings of the 24th Asia and South Pacific Design Automation Conference. 2019.
[64] Gordon Inggs, Shane T. Fleming, David B. Thomas, and Wayne Luk. “Is high level synthesis ready
for business? A computational finance case study”. In: 2014 International Conference on
Field-Programmable Technology, FPT 2014, Shanghai, China, December 10-12, 2014. Ed. by
Jialin Chen, Wenbo Yin, Yuichiro Shibata, Lingli Wang, Hayden Kwok-Hay So, and Yuchun Ma.
IEEE, 2014, pp. 12–19.
[65] Pentti Kanerva. “Hyperdimensional computing: An introduction to computing in distributed
representation with high-dimensional random vectors”. In: Cognitive computation (2009).
[66] Duckhwan Kim, Jaeha Kung, Sek M. Chai, Sudhakar Yalamanchili, and Saibal Mukhopadhyay.
“Neurocube: A Programmable Digital Neuromorphic Architecture with High-Density 3D
Memory”. In: International Symposium on Computer Architecture. IEEE Computer Society, 2016,
pp. 380–392. doi: 10.1109/ISCA.2016.41.
[67] Diederik P Kingma and Jimmy Ba. “Adam: A method for stochastic optimization”. In: arXiv
preprint arXiv:1412.6980 (2014).
[68] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. “ImageNet classification with deep
convolutional neural networks”. In: Advances in neural information processing systems. 2012.
[69] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. “ImageNet Classification with Deep
Convolutional Neural Networks”. In: Advances in Neural Information Processing Systems. 2012,
pp. 1106–1114.
[70] E. A. Lee and D. G. Messerschmitt. “Synchronous data flow”. In: Proceedings of the IEEE 75.9
(1987), pp. 1235–1245.
[71] Fengfu Li and Bin Liu. “Ternary Weight Networks”. In: CoRR abs/1605.04711 (2016). arXiv:
1605.04711. url: http://arxiv.org/abs/1605.04711.
[72] Kenli Li et al. “Energy-aware scheduling algorithm for task execution cycles with normal
distribution on heterogeneous computing systems”. In: ICPP. 2012.
[73] Jane W.-S. Liu, Kwei-Jay Lin, Wei Kuan Shih, Albert Chuang-shi Yu, Jen-Yao Chung, and
Wei Zhao. “Algorithms for scheduling imprecise computations”. In: Foundations of Real-Time
Computing: Scheduling and Resource Management. Springer, 1991.
[74] Jane WS Liu, Kwei-Jay Lin, Riccardo Bettati, David Hull, and Albert Yu. “Use of imprecise
computation to enhance dependability of real-time systems”. In: Foundations of Dependable
Computing. Springer, 1994.
160
[75] Yongpan Liu, Huazhong Yang, Robert P Dick, Hui Wang, and Li Shang. “Thermal vs energy
optimization for DVFS-enabled processors in embedded systems”. In: International Symposium on
Quality Electronic Design. IEEE. 2007.
[76] Roi Livni, Shai Shalev-Shwartz, and Ohad Shamir. “On the computational efficiency of training
neural networks”. In: Advances in neural information processing systems. 2014.
[77] Viktor Losing, Barbara Hammer, and Heiko Wersing. “Incremental on-line learning: A review and
comparison of state of the art algorithms”. In: Neurocomputing (2018).
[78] Laurens van der Maaten and Geoffrey Hinton. “Visualizing data using t-SNE”. In: Journal of
machine learning research (2008).
[79] Hongzi Mao et al. “Resource management with deep reinforcement learning”. In: HotNets. ACM.
2016.
[80] Michael McCloskey and Neal J Cohen. “Catastrophic interference in connectionist networks: The
sequential learning problem”. In: Psychology of learning and motivation. Elsevier, 1989.
[81] Azalia Mirhoseini et al. “Device placement optimization with reinforcement learning”. In: ICML.
2017.
[82] Asit K. Mishra and Debbie Marr. “Apprentice: Using Knowledge Distillation Techniques To
Improve Low-Precision Network Accuracy”. In: International Conference on Learning
Representations. OpenReview.net, 2018. url: https://openreview.net/forum?id=B1ae1lZRb.
[83] Asit K. Mishra, Eriko Nurvitadhi, Jeffrey J. Cook, and Debbie Marr. “WRPN: Wide
Reduced-Precision Networks”. In: International Conference on Learning Representations.
OpenReview.net, 2018. url: https://openreview.net/forum?id=B1ZvaaeAZ.
[84] Volodymyr Mnih et al. “Human-level control through deep reinforcement learning”. In: Nature
(2015).
[85] Volodymyr Mnih et al. “Playing atari with deep reinforcement learning”. In: arXiv preprint
arXiv:1312.5602 (2013).
[86] Lei Mo, Angeliki Kritikakou, and Olivier Sentieys. “Approximation-aware Task Deployment on
Asymmetric Multicore Processors”. In: Design, Automation & Test in Europe Conference &
Exhibition. IEEE. 2019.
[87] Lei Mo, Angeliki Kritikakou, and Olivier Sentieys. “Controllable QoS for imprecise computation
tasks on DVFS multicores with time and energy constraints”. In: Journal on Emerging and Selected
Topics in Circuits and Systems (2018).
[88] Justin Morris, Mohsen Imani, Samuel Bosch, Anthony Thomas, Helen Shu, and Tajana Rosing.
“CompHD: Efficient hyperdimensional computing using model compression”. In: IEEE/ACM
International Symposium on Low Power Electronics and Design (ISLPED). 2019.
161
[89] Waqaas Munawar, Heba Khdr, Santiago Pagani, Muhammad Shafique, Jian-Jia Chen, and
Jörg Henkel. “Peak power management for scheduling real-time tasks on heterogeneous
many-core systems”. In: International Conference on Parallel and Distributed Systems. IEEE. 2014.
[90] Takashi Nakada, Hiroyuki Yanagihashi, Hiroshi Nakamura, Kunimaro Imai, Hiroshi Ueki,
Takashi Tsuchiya, and Masanori Hayashikoshi. “Energy-aware task scheduling for near real-time
periodic tasks on heterogeneous multicore processors”. In: Very Large Scale Integration
(VLSI-SoC), 2017 IFIP/IEEE International Conference on. IEEE. 2017, pp. 1–6.
[91] Sangyoung Park, Jaehyun Park, Donghwa Shin, Yanzhi Wang, Qing Xie, Massoud Pedram, and
Naehyuck Chang. “Accurate modeling of the delay and energy overhead of dynamic voltage and
frequency scaling in modern microprocessors”. In: IEEE Transactions on Computer-Aided Design of
Integrated Circuits and Systems 32.5 (2013), pp. 695–708.
[92] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan,
Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. “PyTorch: An imperative
style, high-performance deep learning library”. In: Advances in Neural Information Processing
Systems. 2019.
[93] Antonio Polino, Razvan Pascanu, and Dan Alistarh. “Model compression via distillation and
quantization”. In: International Conference on Learning Representations. OpenReview.net, 2018.
url: https://openreview.net/forum?id=S1XolQbRW.
[94] Louis-Noël Pouchet, Peng Zhang, P. Sadayappan, and Jason Cong. “Polyhedral-based data reuse
optimization for configurable computing”. In: The 2013 ACM/SIGDA International Symposium on
Field Programmable Gate Arrays, FPGA ’13, Monterey, CA, USA, February 11-13, 2013. Ed. by
Brad L. Hutchings and Vaughn Betz. ACM, 2013, pp. 29–38.
[95] Haotong Qin, Ruihao Gong, Xianglong Liu, Xiao Bai, Jingkuan Song, and Nicu Sebe. “Binary
neural networks: A survey”. In: Pattern Recognit. 105 (2020), p. 107281. doi:
10.1016/j.patcog.2020.107281.
[96] Abbas Rahimi, Pentti Kanerva, and Jan M Rabaey. “A robust and energy-efficient classifier using
brain-inspired hyperdimensional computing”. In: Proceedings of the 2016 International Symposium
on Low Power Electronics and Design (ISLPED). 2016.
[97] Rajat Raina, Anand Madhavan, and Andrew Y. Ng. “Large-scale deep unsupervised learning using
graphics processors”. In: International Conference on Machine Learning. Vol. 382. ACM
International Conference Proceeding Series. ACM, 2009, pp. 873–880. doi:
10.1145/1553374.1553486.
[98] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. “XNOR-Net: ImageNet
Classification Using Binary Convolutional Neural Networks”. In: European Conference on
Computer Vision. Vol. 9908. Lecture Notes in Computer Science. Springer, 2016, pp. 525–542. doi:
10.1007/978-3-319-46493-0\_32.
[99] RC Ravindran, C Mani Krishna, Israel Koren, and Zahava Koren. “Scheduling imprecise task
graphs for real-time applications”. In: International Journal of Embedded Systems (2014).
162
[100] Peng Rong and Massoud Pedram. “Power-aware scheduling and dynamic voltage setting for tasks
running on a hard real-time system”. In: Design Automation, 2006. Asia and South Pacific
Conference on. IEEE. 2006, 6–pp.
[101] Nadav Rotem, Jordan Fix, Saleem Abdulrasool, Summer Deng, Roman Dzhabarov,
James Hegeman, Roman Levenstein, Bert Maher, Nadathur Satish, Jakob Olesen, Jongsoo Park,
Artem Rakhov, and Misha Smelyanskiy. “Glow: Graph Lowering Compiler Techniques for Neural
Networks”. In: CoRR abs/1805.00907 (2018). arXiv: 1805.00907. url:
http://arxiv.org/abs/1805.00907.
[102] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma,
Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. “ImageNet large scale
visual recognition challenge”. In: International journal of computer vision (2015).
[103] Cosmin Rusu, Rami Melhem, and Daniel Mossé. “Maximizing rewards for real-time applications
with energy constraints”. In: ACM Transactions on Embedded Computing Systems (TECS) (2003).
[104] Doyen Sahoo, Quang Pham, Jing Lu, and Steven CH Hoi. “Online deep learning: Learning deep
neural networks on the fly”. In: arXiv preprint arXiv:1711.03705 (2017).
[105] Manuel Schmuck, Luca Benini, and Abbas Rahimi. “Hardware optimizations of dense binary
hyperdimensional computing: Rematerialization of hypervectors, binarized bundling, and
combinational associative memory”. In: ACM Journal on Emerging Technologies in Computing
Systems (JETC) (2019).
[106] John Schulman et al. “Trust region policy optimization”. In: ICML. 2015.
[107] Ali Shafiee, Anirban Nag, Naveen Muralimanohar, Rajeev Balasubramonian, John Paul Strachan,
Miao Hu, R. Stanley Williams, and Vivek Srikumar. “ISAAC: A Convolutional Neural Network
Accelerator with In-Situ Analog Arithmetic in Crossbars”. In: International Symposium on
Computer Architecture. IEEE Computer Society, 2016, pp. 14–26. doi: 10.1109/ISCA.2016.12.
[108] Hardik Sharma et al. “From high-level deep neural models to FPGAs”. In: International
Symposium on Microarchitecture. IEEE Computer Society, 2016, 17:1–17:12.
[109] Runbin Shi et al. “FTDL: A tailored fpga-overlay for deep learning with high scalability”. In: In
Proceedings of ACM/IEEE Design Automation Conference (DAC). 2020.
[110] W-K Shih and Jane W-S Liu. “On-line scheduling of imprecise computations to minimize error”.
In: Real-Time Systems Symposium. IEEE. 1992.
[111] W-K Shih, Jane W-S Liu, and J-Y Chung. “Fast algorithms for scheduling imprecise
computations”. In: Real-Time Systems Symposium. IEEE. 1989.
[112] Karen Simonyan and Andrew Zisserman. “Very Deep Convolutional Networks for Large-Scale
Image Recognition”. In: International Conference on Learning Representations. 2015.
163
[113] Leslie N Smith and Nicholay Topin. “Super-convergence: Very fast training of neural networks
using large learning rates”. In: Artificial Intelligence and Machine Learning for Multi-Domain
Operations Applications. International Society for Optics and Photonics. 2019.
[114] Atefeh Sohrabizadeh, Jie Wang, and Jason Cong. “End-to-End Optimization of Deep Learning
Applications”. In: FPGA ’20: The 2020 ACM/SIGDA International Symposium on
Field-Programmable Gate Arrays, Seaside, CA, USA, February 23-25, 2020. Ed. by
Stephen Neuendorffer and Lesley Shannon. ACM, 2020, pp. 133–139.
[115] Jan Sommer, M. Akif Özkan, Oliver Keszöcze, and Jürgen Teich. “DSP-Packing: Squeezing
Low-precision Arithmetic into FPGA DSP Blocks”. In: 32nd International Conference on
Field-Programmable Logic and Applications, FPL 2022, Belfast, United Kingdom, August 29 - Sept. 2,
2022. IEEE, 2022, pp. 160–166. doi: 10.1109/FPL57034.2022.00035.
[116] Krishnan Srinivasan and Karam S Chatha. “Integer linear programming and heuristic techniques
for system-level low power scheduling on multiprocessor architectures under throughput
constraints”. In: INTEGRATION, the VLSI journal 40.3 (2007), pp. 326–354.
[117] Georgios L Stavrinides, Francisco Rodrigo Duro, Helen D Karatza, Javier Garcia Blas, and
Jesus Carretero. “Different aspects of workflow scheduling in large-scale distributed systems”. In:
Simulation Modelling Practice and Theory (2017).
[118] Georgios L Stavrinides and Helen D Karatza. “Energy-aware scheduling of real-time workflow
applications in clouds utilizing DVFS and approximate computations”. In: FiCloud. IEEE. 2018.
[119] Georgios L Stavrinides and Helen D Karatza. “Scheduling multiple task graphs with end-to-end
deadlines in distributed real-time systems utilizing imprecise computations”. In: Journal of
Systems and Software (2010).
[120] Georgios L Stavrinides and Helen D Karatza. “Scheduling real-time DAGs in heterogeneous
clusters by combining imprecise computations and bin packing techniques for the exploitation of
schedule holes”. In: Future Generation Computer Systems (2012).
[121] Dave Steinkrau, Patrice Y. Simard, and Ian Buck. “Using GPUs for Machine Learning Algorithms”.
In: International Conference on Document Analysis and Recognition. IEEE Computer Society, 2005,
pp. 1115–1119. doi: 10.1109/ICDAR.2005.251.
[122] Vivienne Sze, Yu-Hsin Chen, Tien-Ju Yang, and Joel S Emer. “Efficient processing of deep neural
networks: A tutorial and survey”. In: Proceedings of the IEEE (2017).
[123] Vivienne Sze, Yu-Hsin Chen, Tien-Ju Yang, and Joel S. Emer. Efficient Processing of Deep Neural
Networks. Synthesis Lectures on Computer Architecture. Morgan & Claypool Publishers, 2020.
doi: 10.2200/S01004ED1V01Y202004CAC050.
[124] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott E. Reed, Dragomir Anguelov,
Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. “Going deeper with convolutions”.
In: Conference on Computer Vision and Pattern Recognition. IEEE Computer Society, 2015, pp. 1–9.
164
[125] Lucas Theis, Iryna Korshunova, Alykhan Tejani, and Ferenc Huszár. “Faster gaze prediction with
dense networks and Fisher pruning”. In: CoRR abs/1801.05787 (2018). arXiv: 1801.05787. url:
http://arxiv.org/abs/1801.05787.
[126] Haluk Topcuoglu, Salim Hariri, and Min-you Wu. “Performance-effective and low-complexity
task scheduling for heterogeneous computing”. In: Transactions on Parallel and Distributed
Systems (2002).
[127] Haluk Topcuoglu, Salim Hariri, and Min-you Wu. “Performance-effective and low-complexity
task scheduling for heterogeneous computing”. In: IEEE transactions on parallel and distributed
systems 13.3 (2002), pp. 260–274.
[128] Yaman Umuroglu, Nicholas J. Fraser, Giulio Gambardella, Michaela Blott, Philip Heng Wai Leong,
Magnus Jahre, and Kees A. Vissers. “FINN: A Framework for Fast, Scalable Binarized Neural
Network Inference”. In: Proceedings of the 2017 ACM/SIGDA International Symposium on
Field-Programmable Gate Arrays, FPGA 2017, Monterey, CA, USA, February 22-24, 2017. Ed. by
Jonathan W. Greene and Jason Helge Anderson. ACM, 2017, pp. 65–74. url:
http://dl.acm.org/citation.cfm?id=3021744.
[129] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez,
Lukasz Kaiser, and Illia Polosukhin. “Attention is All you Need”. In: Advances in Neural
Information Processing Systems. 2017, pp. 5998–6008.
[130] Stylianos I. Venieris and Christos-Savvas Bouganis. “fpgaConvNet: Mapping Regular and
Irregular Convolutional Neural Networks on FPGAs”. In: IEEE Transaction on Neural Networks
and Learning Systems 30.2 (2019), pp. 326–342.
[131] Stylianos I. Venieris, Alexandros Kouris, and Christos-Savvas Bouganis. “Toolflows for Mapping
Convolutional Neural Networks on FPGAs: A Survey and Future Directions”. In: ACM Comput.
Surv. 51.3 (June 2018).
[132] Kuan Wang, Zhijian Liu, Yujun Lin, Ji Lin, and Song Han. “HAQ: Hardware-Aware Automated
Quantization With Mixed Precision”. In: IEEE Conference on Computer Vision and Pattern
Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019. Computer Vision Foundation /
IEEE, 2019, pp. 8612–8620. doi: 10.1109/CVPR.2019.00881.
[133] Zeke Wang, Hongjing Huang, Jie Zhang, and Gustavo Alonso. “Shuhai: Benchmarking high
bandwidth memory on fpgas”. In: 2020 IEEE 28th Annual International Symposium on
Field-Programmable Custom Computing Machines (FCCM). IEEE. 2020, pp. 111–119.
[134] Zhaokang Wang, Yunpan Wang, Chunfeng Yuan, Rong Gu, and Yihua Huang. “Empirical analysis
of performance bottlenecks in graph neural network training and inference with GPUs”. In:
Neurocomputing 446 (2021), pp. 165–191. doi: 10.1016/j.neucom.2021.03.015.
[135] Xuechao Wei, Yun Liang, Xiuhong Li, Cody Hao Yu, Peng Zhang, and Jason Cong. “TGPA:
tile-grained pipeline architecture for low latency CNN inference”. In: Proceedings of the
International Conference on Computer-Aided Design, ICCAD 2018, San Diego, CA, USA, November
05-08, 2018. Ed. by Iris Bahar. ACM, 2018, p. 58.
165
[136] Ronald J Williams. “Simple statistical gradient-following algorithms for connectionist
reinforcement learning”. In: Machine learning (1992).
[137] Qingcheng Xiao, Liqiang Lu, Jiaming Xie, and Yun Liang. “FCNNLib: An Efficient and Flexible
Convolution Algorithm Library on FPGAs”. In: 57th ACM/IEEE Design Automation Conference,
DAC 2020, San Francisco, CA, USA, July 20-24, 2020. IEEE, 2020, pp. 1–6. doi:
10.1109/DAC18072.2020.9218748.
[138] Xilinx Inc., “Ultrascale architecture dsp slice user guide (ug579).”
https://www.xilinx.com/support/documentation/user guides/ug579-ultrascale-dsp.pdf.
[139] Xuan Yang, Mingyu Gao, Jing Pu, Ankita Nayak, Qiaoyi Liu, Steven Bell, Jeff Setter, Kaidi Cao,
Heonjae Ha, Christos Kozyrakis, and Mark Horowitz. “DNN Dataflow Choice Is Overrated”. In:
CoRR abs/1809.04070 (2018). arXiv: 1809.04070. url: http://arxiv.org/abs/1809.04070.
[140] Hanchen Ye, Xiaofan Zhang, Zhize Huang, Gengsheng Chen, and Deming Chen. “HybridDNN: A
Framework for High-Performance Hybrid DNN Accelerator Design and Implementation”. In: 57th
ACM/IEEE Design Automation Conference, DAC 2020, San Francisco, CA, USA, July 20-24, 2020.
IEEE, 2020, pp. 1–6.
[141] Heng Yu, Bharadwaj Veeravalli, and Yajun Ha. “Dynamic scheduling of imprecise-computation
tasks in maximizing QoS under energy constraints for embedded systems”. In: Asia and South
Pacific Design Automation Conference. IEEE. 2008.
[142] Sergey Zagoruyko and Nikos Komodakis. “Wide Residual Networks”. In: British Machine Vision
Conference. BMVA Press, 2016.
[143] Chen Zhang et al. “Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural
Networks”. In: International Symposium on Field-Programmable Gate Arrays. ACM, 2015,
pp. 161–170.
[144] Tianyi Zhang, Zhiqiu Lin, Guandao Yang, and Christopher De Sa. “QPyTorch: A Low-Precision
Arithmetic Simulation Framework”. In: CoRR abs/1910.04540 (2019).
[145] Tianyun Zhang, Shaokai Ye, Kaiqi Zhang, Jian Tang, Wujie Wen, Makan Fardad, and
Yanzhi Wang. “A Systematic DNN Weight Pruning Framework Using Alternating Direction
Method of Multipliers”. In: European Conference on Computer Vision. Vol. 11212. Lecture Notes in
Computer Science. Springer, 2018, pp. 191–207. doi: 10.1007/978-3-030-01237-3\_12.
[146] Tianyun Zhang, Shaokai Ye, Kaiqi Zhang, Jian Tang, Wujie Wen, Makan Fardad, and
Yanzhi Wang. “A systematic DNN weight pruning framework using alternating direction method
of multipliers”. In: European Conference on Computer Vision (ECCV). 2018.
[147] Yujian Zhang, Yun Wang, and Xin Yuan. “Energy-aware Task Scheduling on DVS-enabled
Heterogeneous Clusters by Iterated Local Search”. In: CSCWD. 2018.
[148] Hengyu Zhao, Colin Weinshenker, Mohamed Ibrahim, Adwait Jog, and Jishen Zhao. “Layer-wise
Performance Bottleneck Analysis of Deep Neural Networks”. In: The 1st International Workshop
on Architectures for Intelligent Machine. 2017.
166
[149] Aojun Zhou, Anbang Yao, Yiwen Guo, Lin Xu, and Yurong Chen. “Incremental Network
Quantization: Towards Lossless CNNs with Low-precision Weights”. In: International Conference
on Learning Representations. OpenReview.net, 2017. url:
https://openreview.net/forum?id=HyQJ-mclg.
[150] Junlong Zhou et al. “Energy-adaptive scheduling of imprecise computation tasks for QoS
optimization in real-time MPSoC systems”. In: DATE. 2017.
[151] Junlong Zhou, Jianming Yan, Tongquan Wei, Mingsong Chen, and Xiaobo Sharon Hu.
“Energy-adaptive scheduling of imprecise computation tasks for QoS optimization in real-time
MPSoC systems”. In: Design, Automation & Test in Europe Conference & Exhibition. IEEE. 2017.
[152] Shuchang Zhou, Zekun Ni, Xinyu Zhou, He Wen, Yuxin Wu, and Yuheng Zou. “DoReFa-Net:
Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients”. In: CoRR
abs/1606.06160 (2016). arXiv: 1606.06160. url: http://arxiv.org/abs/1606.06160.
[153] Shuchang Zhou, Yuxin Wu, Zekun Ni, Xinyu Zhou, He Wen, and Yuheng Zou. “DoReFa-Net:
Training low bitwidth convolutional neural networks with low bitwidth gradients”. In: arXiv
preprint arXiv:1606.06160 (2016).
[154] Chenzhuo Zhu, Song Han, Huizi Mao, and William J. Dally. “Trained Ternary Quantization”. In:
International Conference on Learning Representations. OpenReview.net, 2017. url:
https://openreview.net/forum?id=S1%5C_pAu9xl.
[155] Michael Zhu and Suyog Gupta. “To Prune, or Not to Prune: Exploring the Efficacy of Pruning for
Model Compression”. In: International Conference on Learning Representations. OpenReview.net,
2018. url: https://openreview.net/forum?id=Sy1iIDkPM.
[156] Xiaomin Zhu et al. “Adaptive energy-efficient scheduling for real-time tasks on DVS-enabled
heterogeneous clusters”. In: J. Parallel Distribut. Syst. (2012).
167
Abstract (if available)
Abstract
Presently, machine learning (ML) models, including deep neural networks, find extensive utilization across diverse industries, such as banking, finance, analytics, security, drug design, high-tech industry, IC design, visual tasks, language understanding, healthcare, and business. However, conducting the inference, like object recognition or language understanding, for state-of-the-art DNNs, poses significant challenges due to the limited computational, memory, and energy resources available. It necessitates substantial advancements beyond the current state of the art. Essentially, there is a need for a lightweight and highly energy-efficient inference accelerator, which is capable of achieving inference accuracy comparable to using full precision for all the inference computations. Numerous studies have indicated that employing full-precision computations is unnecessary for many applications. Nevertheless, extremely low-precision models often lead to a noticeable decrease in accuracy, which can be as significant as 10-15% when compared to full-precision computation, such as 32-bit floating point or 16-bit fixed-point models.
While many methods have been proposed to improve the accuracy of the low-precision models, so far, no one has found a solution to this intrinsic accuracy loss. Stepping back from uniformly ultra-low-precision models, mixed-precision models have been proposed to serve as a better trade-off. Effective ways have been found to train accurate models with some layers being processed in ultra-low precision while other layers are processed in high precision. Another important consideration (which has been the focus of many academic and industry efforts) is the cost of accessing pre-trained weights from external memory, the memory cost of storing the weights on-chip, and finally, the cost of weight data transfers on-chip, which is a substantial bottleneck based on benchmarks across various hardware platforms.
By examining the range of deep learning models concerning number systems and numerical precision, we can conclude that achieving a trade-off between hardware efficiency and inference accuracy is possible through the utilization of mixed-computation models. These models should be appropriately trained and deployed on a heterogeneous hardware platform, commonly referred to as a mixed-computation accelerator fabric. To accommodate various types of neural network models, it is crucial to have a specially designed computational fabric that supports multiple number systems, multiple precision computations, and seamless data conversions between different precision levels. Additionally, support for distillation, compilation, and runtime optimizations is essential to ensure optimal performance.
In this thesis, we focus on the energy-efficient and low-latency implementation of the neural network inference, and present F2N2, an end-to-end FPGA-friendly Framework for designing Neural Network (NN) accelerators for NN models by leveraging the count and intrinsic arrangement of computing and memory resources of the target FPGA. We apply optimizations to reduce the cost of data movement in cloud FPGAs, which are typically equipped with extensive on-chip memory resources. Furthermore, We employ a software/hardware co-optimization flow in order to achieve an efficient communication method between the host CPU and the FPGA accelerator in order to maximize performance. Compared to the state-of-the-art work, F2N2 achieves a factor of three reductions in end-to-end inference latency under the same experimental setup while achieving a clock frequency of 342MHz on a Xilinx VU9P FPGA device.
In addition, we will provide an efficient streaming accelerator architecture for carrying out the inference for mixed-precision deep neural networks. In this architecture, we consider packing the operations associated with low-precision weights to perform multiple operations using the same resources on the target FPGA. This technique, associated with the streaming architecture, can greatly enhance the throughput of the inference of the neural network inference.
In addition to utilizing neural networks, we also propose employing brain-inspired hyperdimensional (HD) learning models for some cognitive tasks. While neural networks (NNs) are well-known for their high accuracy due to the quality of their automatic feature extraction while brain-inspired HD learning models are famous for their quick training, computational efficiency, and adaptability. This thesis presents a hybrid, synergic machine learning model that excels at all the said characteristics and is suitable for incremental, on-line learning on a chip. The proposed model comprises an NN and a classifier. The NN acts as a feature extractor and is specifically trained to work well with the classifier that employs the HD computing framework. We use the proposed accelerator mentioned above and present a parameterized hardware implementation of the said feature extraction and classification components while introducing a compiler that maps any arbitrary NN and/or classifier to the aforementioned hardware. The proposed hybrid machine learning model has the same level of accuracy as NNs while achieving at least a 10% improvement in accuracy compared to HD learning models.
Additionally, the end-to-end hardware realization of the hybrid model improves power efficiency by 1.60x compared to state-of-the-art, high-performance HD learning implementations while improving latency by 2.13x.
These results have profound implications for the application of such synergic models in challenging cognitive tasks.
Before shifting my focus to the acceleration of mixed-precision neural networks and brain-inspired hyperdimensional (HD) learning models on FPGAs, I was actively involved in developing energy-aware scheduling strategies for real-time, deadline-constrained tasks across different computing devices. These devices spanned from portable embedded systems to servers in data centers. To provide a comprehensive view of my research journey, I will incorporate my previous work on energy-aware scheduling strategies into the last three chapters of my thesis.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Compiler and runtime support for hybrid arithmetic and logic processing of neural networks
PDF
Energy consumption and lifetime/reliability improvement of computing systems using voltage overscaling (VOS) approximation technique
PDF
Ultra-low-latency deep neural network inference through custom combinational logic
PDF
Acceleration of deep reinforcement learning: efficient algorithms and hardware mapping
PDF
Towards efficient edge intelligence with in-sensor and neuromorphic computing: algorithm-hardware co-design
PDF
Energy efficient design and provisioning of hardware resources in modern computing systems
PDF
Hardware-software codesign for accelerating graph neural networks on FPGA
PDF
Algorithms and frameworks for generating neural network models addressing energy-efficiency, robustness, and privacy
PDF
Architecture design and algorithmic optimizations for accelerating graph analytics on FPGA
PDF
Accelerating reinforcement learning using heterogeneous platforms: co-designing hardware, algorithm, and system solutions
PDF
Sharpness analysis of neural networks for physics simulations
PDF
Improving efficiency to advance resilient computing
PDF
Energy proportional computing for multi-core and many-core servers
PDF
Towards green communications: energy efficient solutions for the next generation cellular mobile communication systems
PDF
Advanced cell design and reconfigurable circuits for single flux quantum technology
PDF
Learning to optimize the geometry and appearance from images
PDF
Exploring complexity reduction in deep learning
PDF
High throughput computational framework for synthesis and accelerated discovery of dielectric polymer materials using polarizable reactive molecular dynamics and graph neural networks
PDF
Scaling up deep graph learning: efficient algorithms, expressive models and fast acceleration
PDF
Hardware techniques for efficient communication in transactional systems
Asset Metadata
Creator
Esmaili, Amirhossein
(author)
Core Title
An FPGA-friendly, mixed-computation inference accelerator for deep neural networks
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Computer Engineering
Degree Conferral Date
2023-12
Publication Date
11/08/2023
Defense Date
07/07/2023
Publisher
Los Angeles, California
(original),
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
cluster system optimization,deep learning,energy-efficient scheduling,hardware acceleration,hyperdimensional learning,OAI-PMH Harvest,on-chip learning
Format
theses
(aat)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Pedram, Massoud (
committee chair
), Nakano, Aiichiro (
committee member
), Nuzzo, Pierluigi (
committee member
)
Creator Email
amirhossein.ed12@gmail.com,esmailid@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC113766976
Unique identifier
UC113766976
Identifier
etd-EsmailiAmi-12464.pdf (filename)
Legacy Identifier
etd-EsmailiAmi-12464
Document Type
Dissertation
Format
theses (aat)
Rights
Esmaili, Amirhossein
Internet Media Type
application/pdf
Type
texts
Source
20231114-usctheses-batch-1106
(batch),
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
cluster system optimization
deep learning
energy-efficient scheduling
hardware acceleration
hyperdimensional learning
on-chip learning