Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Ultra-low-latency deep neural network inference through custom combinational logic
(USC Thesis Other)
Ultra-low-latency deep neural network inference through custom combinational logic
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
UL TRA-LO W-LA TENCY DEEP NEURAL NETW ORK INFERENCE
THR OUGH CUSTOM COMBINA TIONAL LOGI C
b y
Mahdi Nazemi
A Dissertation Presen ted to the
F A CUL TY OF THE USC GRADUA TE SCHOOL
UNIVERSITY OF SOUT HERN CALIF ORNIA
In P artial F ulfillmen t of the
Requiremen ts for the Degree
DOCTOR OF PHILOSOPHY
(ELECTRICAL ENGINEERI NG)
Decem b er 2021
Cop yrigh t 2021 Mahdi Nazemi
to my family,
for their end less love and supp ort.
ii
ﻑﺮﻃ ﺮﻫ ﺎﻫﺭﺩ ﺖﺴﺑ ﺎﺨﯿﻟﺯ ﺮﮔ
ﻑﺮﺼﻨﻣ ﺶﺒﻨﺟ ﺯ ﻢﻫ ﻒﺳﻮﯾ ﺖﻓﺎﯾ
ﺪﯾﺪﭘ ﻩﺭ ﺪﺷ ﻭ ﺭﺩ ﻭ ﻞﻔﻗ ﺪﺷ ﺯﺎﺑ
ﺪﯿﻬﺟﺮﺑ ﻒﺳﻮﯾ ﺩﺮﮐ ﻞﮐ ﻮﺗ ﻥﻮﭼ
ﺪﯾﺪﭘ ﺍﺭ ﻢﻟﺎﻋ ﺖﺴﯿﻧ ﻪﻨﺧﺭ ﻪﭼ ﺮﮔ
ﺪﯾﻭﺩ ﺪﯾﺎﺑ ﻣ ﺭﺍﻭ ﻒﺳﻮﯾ ﻩﺮﯿﺧ
ﺩﻮﺷ ﺍﺪﯿﭘ ﺭﺩ ﻭ ﻞﻔﻗ ﺪﯾﺎﺸﮔ ﺎﺗ
∗
ﺩﻮﺷ ﺎﺟ ﺍﺭ ﺎﻤﺷ ﯽﯾﺎﺟ ﯽﺑ ﯼﻮﺳ
Though Zulaikha sh ut the do ors on ev ery side,
still Joseph reac hed safet y b y making an effort.
Lo c k and do or op ened, and the w a y out app eared;
when Joseph put trust in Go d, he escap ed.
Though the w orld has no visible exit do or,
still one m ust r un rec klessly , lik e Joseph,
so that the lo c k ma y op en and the do or migh t app ear,
so that the plac e of placelessness migh t b ecome y our home.
†
∗
F rom the Masnavi b y Jalal al-Din Muhammad Balkhi (R umi), 1273.
†
F rom The R umi Dayb o ok b y Kabir Helminski and Camille Helminski, 2011
iii
A c kno wledgmen ts
I w ould lik e to express m y gratitude to m y advisor, Professor Massoud P edram,
for supp orting m y do ctoral researc h and giving me the freedom to explore differen t
researc h topics b efore finding the area I w as passionate ab out. His motiv ation,
passion, and grit indeed set an example for me.
I w ould lik e to thank m y dissertation and qualifying exam committee mem b ers,
Professors P eter Beerel, Sandeep Gupta, Pierluigi Nuzzo, and Meisam Raza viy a yn,
for their though t-pro v oking commen ts.
My thanks also go to m y fello w team mem b ers and collab orators. In particular,
I am indebted to m y men tor, Do ctor Mohammad Ja v ad Dousti, and m y seniors
Do ctor Alireza Shafaei Bejestan and Professor Y anzhi W ang, for their en th usiastic
supp ort and for sharing their in v aluable exp erience with me, pa ving the w a y for
fruitful researc h.
I w ould also lik e to thank m y friends, without whom m y journey in the past
few y ears w ould not ha v e b een as exciting. In particular, I w ould lik e to thank
Mohammad Motie Share, Mahsa Moslehi, Mohammad Ja v ad Dousti, and Alireza
Shafaei Bejestan for creating sw eet, enduring memories.
Last but not least, I w ould lik e to express m y deep est gratitude for m y b elo v ed
iv
family . My paren ts, Masoud and Monireh, alw a ys w en t ab o v e and b ey ond to sup-
p ort me in ev ery w a y p ossible, and to pro vide qualit y education to me. They
indeed exp erienced an excruciating few y ears, esp ecially during the T rump A d-
ministration. I sincerely appreciate their patience in tolerating the long distance
b et w een us. I cannot thank m y mother enough for alw a ys k eeping in touc h with
me despite all the difficulties. I am grateful to m y brother, Mohsen, for teac hing
me m y first lessons in math and literature, encouraging me to pursue m y dreams,
and guiding me through the biggest c hallenges of m y life. I w ould also lik e to thank
m y wife, Shabnam, for b eing an excellen t friend for me in the past 11 y ears and
for going the extra mile to k eep our relationship strong despite the long distance
b et w een us during our do ctoral studies. I alw a ys admire her compassion, dev otion,
indep endence, and wisdom and truly enjo y her companionship.
v
Con ten ts
Dedication ii
Epigraph iii
A c kno wledgmen ts iv
List of T ables ix
List of Figures xi
Abstract xiii
1 In tro duction 1
2 Preliminaries & Related W ork 10
2.1 Artificial Neural Net w orks . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.1 Mo del Quan tization . . . . . . . . . . . . . . . . . . . . . . 16
2.1.2 Mo del Pruning . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.1.3 Kno wledge Distillation . . . . . . . . . . . . . . . . . . . . . 21
2.2 Logic Minimization . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2.1 T w o-lev el Logic Minimization . . . . . . . . . . . . . . . . . 23
2.2.2 Multi-lev el Logic Minimization . . . . . . . . . . . . . . . . 27
3 DNN Design & T raining 30
3.1 Ov erview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2 Quan tization-a w are T raining . . . . . . . . . . . . . . . . . . . . . . 30
3.2.1 P arameterized Hard T anh F unction . . . . . . . . . . . . . . 31
vi
3.3 F an-in-constrained Pruning . . . . . . . . . . . . . . . . . . . . . . 33
3.4 T raining Data Sampling . . . . . . . . . . . . . . . . . . . . . . . . 37
3.5 V estigial Neural Net w orks . . . . . . . . . . . . . . . . . . . . . . . 39
3.6 T raining Subtleties . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.6.1 Drop out La y ers . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.6.2 Max P o oling La y ers . . . . . . . . . . . . . . . . . . . . . . 42
4 Million-scale T w o-lev el Logic Minimization 45
4.1 Ov erview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.2 ESPRESSO-I I’s EXP AND Step . . . . . . . . . . . . . . . . . . . . 45
4.2.1 ESPRESSO-I I’s In ternal Data Represen tation . . . . . . . . 49
4.3 ESPRESSO-GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.3.1 Review of CUD A . . . . . . . . . . . . . . . . . . . . . . . . 50
4.3.2 P arallel Implemen tation of EXP AND . . . . . . . . . . . . . 53
4.4 Divide & Conquer-based TLM . . . . . . . . . . . . . . . . . . . . . 61
4.4.1 Decision T ree Construction . . . . . . . . . . . . . . . . . . 61
4.4.2 ESPRESSO-GPU on Leaf No des . . . . . . . . . . . . . . . 64
4.4.3 Dominan t Lab el Assignmen t . . . . . . . . . . . . . . . . . . 64
4.4.4 SVM-based Sample Selection . . . . . . . . . . . . . . . . . 65
4.4.5 Error-budget-driv en Leaf No de Elimination . . . . . . . . . 66
4.4.6 ESPRESSO-GPU on a Collection of Optimized Leaf No des . 69
4.5 Results & Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.5.1 ESPRESSO-GPU . . . . . . . . . . . . . . . . . . . . . . . . 70
4.5.2 Divide & Conquer-based TLM . . . . . . . . . . . . . . . . 71
5 Results & Discussions 77
5.1 Exp erimen tal Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.1.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.1.2 T raining Strategy . . . . . . . . . . . . . . . . . . . . . . . . 78
5.2 Impact of A ctiv ation Quan tization on
Classification A ccuracy . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.2.1 Batc h Normalization . . . . . . . . . . . . . . . . . . . . . . 78
5.2.2 Bit-width & Choice of A ctiv ation F unction . . . . . . . . . . 79
5.3 Impact of V estigial La y ers on
Classification A ccuracy . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.4 Impact of Con text-a w are T raining Data Sampling on Classification
A ccuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.5 Pro cessing DNN La y ers with NullaNet . . . . . . . . . . . . . . . . 82
5.5.1 JSC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.5.2 NID . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.5.3 CIF AR-10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
vii
6 Conclusions & P ossible Researc h Directions 86
Bibliograph y 89
viii
List of T ables
2.1 Summary of notation (deep neural net w orks) . . . . . . . . . . . . . 22
2.2 Summary of notation (logic minimization) . . . . . . . . . . . . . . 29
4.1 Comparison of the execution time of ESPRESSO-I I and ESPRESSO-
GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.2 Comparison of differen t figures of merit when optimizing ISF s with
300,000 min terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.3 Comparison of differen t figures of merit when optimizing ISF s with
300,000 min terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.4 Breakdo wn of optimization time for differen t divide and conquer-
based TLM heuristics . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.5 Comparison of differen t figures of merit when optimizing ISF s with
3,000,000 min terms . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.1 Impact of batc h normalization on classification accuracy of a quan-
tized neural net w ork trained for the JSC task. . . . . . . . . . . . . 79
5.2 Impact of bit-width and activ ation function on classification accu-
racy for the JSC task. . . . . . . . . . . . . . . . . . . . . . . . . . 80
ix
5.3 Impact of adding a v estigial la y er on classification accuracy of a
neural net w ork trained for the MNIST task. . . . . . . . . . . . . . 80
5.4 Impact of con text-a w are training data sampling on classification
accuracy of a neural net w ork trained for the MNIST task. . . . . . 81
5.5 Comparison b et w een the hardw are realization metrics of NullaNet
with those of LogicNets on the JSC task. . . . . . . . . . . . . . . . 82
5.6 Comparison b et w een the hardw are realization metrics of NullaNet
with those of LogicNets on the NID task. . . . . . . . . . . . . . . . 83
5.7 Comparison of differen t la y ers of the V GG16 arc hitecture . . . . . . 85
5.8 Comparison of pro cessing time of la y ers 8–13 of the V GG16 arc hi-
tecture using differen t implemen tations. . . . . . . . . . . . . . . . 85
x
List of F igures
1.1 An artificial neuron designed, optimized, and pro cessed with NullaNet 3
1.2 A neural net w ork la y er designed, optimized, and pro cessed with
NullaNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 An artificial neuron designed, optimized, and pro cessed with Nul-
laNet while emplo ying an incompletely sp ecified function . . . . . . 5
1.4 High-lev el o v erview of NullaNet . . . . . . . . . . . . . . . . . . . . 7
2.1 An example of a m ultila y er p erceptron . . . . . . . . . . . . . . . . 13
2.2 An example of a con v olutional neural net w ork . . . . . . . . . . . . 14
3.1 Replacing a linear la y er, batc h normalization, and activ ation func-
tion with a CCL . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.2 Replacing t w o consecutiv e linear la y ers with a CCL . . . . . . . . . 41
3.3 Order of la y ers during training, NullaNet optimization, and Nul-
laNet pro cessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.1 Represen tation of cub es in ESPRESSO-I I . . . . . . . . . . . . . . 50
4.2 GPU’s hardw are arc hitecture . . . . . . . . . . . . . . . . . . . . . 51
xi
4.3 Pro cessing k ernels on a GPU . . . . . . . . . . . . . . . . . . . . . 52
4.4 P arallel filtering of cub es . . . . . . . . . . . . . . . . . . . . . . . . 57
4.5 P arallel distance calculation . . . . . . . . . . . . . . . . . . . . . . 58
4.6 P arallel tree-based reduction . . . . . . . . . . . . . . . . . . . . . . 60
4.7 Decision tree construction for TLM . . . . . . . . . . . . . . . . . . 63
4.8 A v erage time p er call to the distance calculation k ernel . . . . . . . 70
xii
Abstract
Significan t adv ancemen ts in building b oth general-purp ose and custom hard-
w are ha v e b een among the critical enablers for shifting deep neural net w orks
(DNNs) from rather theoretical concepts to practical solutions for a wide v ari-
et y of problems. Ho w ev er, DNNs are gro wing in size and complexit y to impro v e
their output qualit y , demanding ev er more compute cycles, memory fo otprin t, and
I/O bandwidth during their inference. T o sustain the ubiquitous deplo ymen t of
deep learning mo dels and cop e with their computational and memory complexities,
this dissertation in tro duces Nul laNet , a tec hnique for the design, optimization, and
pro cessing of DNNs for applications with stringen t latency and throughput require-
men ts. NullaNet form ulates lo w-latency , high-throughput pro cessing of DNNs as a
logic minimization problem where it replaces arithmetic op erations of a DNN with
lo w-cost logic op erations and bak es the DNN’s parameters in to the realized logic.
F an-in-constrained pruning, con text-a w are training data sampling, and million-
scale t w o-lev el logic minimization are among the con tributions of this dissertation
that mak e the NullaNet tec hnique p ossible. Exp erimen tal results sho w the sup e-
riorit y of the NullaNet tec hnique compared to state-of-the-art DNN pro cessors,
where end-to-end inference latency is slashed b y a factor of fiv e or more.
xiii
1
In tro duction
Deep neural net w orks (DNNs) ha v e surpassed the a ccuracy of con v en tional
mac hine learning (ML) mo dels in man y c hallenging domains, including computer
vision [ 1 – 6 ] and natural language pro cessing [ 7 – 10 ]. Significan t adv ancemen ts in
building b oth general-purp ose and custom hardw are ha v e b een among the critical
enablers for shifting DNNs from rather theoretical concepts to practical solutions
for a wide v ariet y of problems [ 11 – 14 ]. Alarmingly , the success of DNNs comes
at the cost of high latency and enormous hardw are resources, whic h, in turn,
prev en t their deplo ymen t in latency-critical applications and resource-constrained
platforms. The high latency and h uge hardw are cost are b ecause practical, high-
qualit y deep learning (DL) mo dels en tail billions of arithmetic op erations and
millions of parameters, whic h exert considerable pressure on b oth pro cessing and
memory subsystems.
T o sustain the ubiquitous deplo ymen t of DL mo dels and cop e with their com-
putational and memory complexities, n umerous effectiv e metho ds op erating at
differen t lev els of the design hierarc h y ha v e b een dev elop ed. A t the algorithm
lev el, metho ds suc h as mo del quan tization [ 15 – 22 ], mo del pruning [ 23 – 27 ], and
kno wledge distillation [ 28 – 31 ] ha v e gained more p opularit y . A t the compiler
1
lev el, domain-sp ecific optimizations, memory-related optimizations (e.g., instruc-
tion sc heduling, static memory allo cation, and cop y elimination), and device-
sp ecific co de generation are emplo y ed [ 32 – 35 ]. A t the arc hitecture lev el, differ-
en t dataflo w arc hitectures, whic h encourage data reuse, are utilized to reduce data
mo v emen t [ 36 – 40 ]. Finally , at the circuit and device lev el, differen t energy-efficien t
digital and analog pro cessing elemen ts whic h con tribute to v ector-matrix m ultipli-
cations are designed [ 41 – 44 ].
While the metho ds describ ed ab o v e effectiv ely optimize DNNs designed for
applications with millisecond latency requiremen ts, they are inadequate for ap-
plications with extremely high data rates and microsecond or sub-microsecond
latency requiremen ts. F or example, conside r the problem of net w ork in trusion de-
tection, whic h requires categorizing net w ork pac k ets as malicious or harmless for
cyb ersecurit y purp oses. Pro cessing a DNN designed for suc h a problem should not
only supp ort mo dern net w orks where data rates can go up to 100Gibit/s , but it
also needs to in tro duce negligible latency to ensure high comm unication qualit y .
This dissertation presen ts Nul laNet , a tec hnique for the design, optimization,
and pro cessing of DNNs for applications with stringen t latency and throughput
requiremen ts. F undamen tally , NullaNet form ulates lo w-latency , high-throughput
pro cessing of DNNs as a logic minimization problem where it replaces arithmetic
op erations of a DNN with lo w-cost logic op erations and bak es the DNN’s param-
eters in to the realized logic.
Figure 1.1 illustrates the fundamen tal idea b ehind NullaNet for the design,
optimization, and pro cessing of an artifici al neuron. While training the artifi-
cial neuron, NullaNet discretizes all three inputs and the output to binary v alues
b y applying the Hea viside step function to them. It then creates a truth table
represen ting the artificial neuron, where differen t input ro ws of the truth table
2
0.61
2.8
1.4
-3.4
1
x
2
x
3
x
y
(a)
x
1
x
2
x
3
∑
3
j=1
x
j
×w
j
y
0 0 0 0.0 0
0 0 1 2.8 1
0 1 0 -3.4 0
0 1 1 -0.6 0
1 0 0 1.4 1
1 0 1 4.2 1
1 1 0 -2.0 0
1 1 1 0.8 1
(b)
x
3
x
1
x
2
00 01 11 10
1
0 0 0 0 1
1 0 1 1
y =x
1
x
2
+x
1
x
3
+x
2
x
3
(c)
1
x
2
x
3
x
y
(d)
Figure 1.1. An artificial neuron designed, optimized, and pro cessed with
NullaNet.
corresp ond to differen t input com binations of the artificial neuron, and differen t
output ro ws of the truth table corresp ond to output v alues the artificial neuron
pro duces when the said inputs are applied to it. After that, NullaNet emplo ys a
t w o-lev el logic minimization (TLM) tec hnique suc h as Karnaugh map optimiza-
tion to simplify the Bo olean function of the truth table, and, finally , it emplo ys
m ulti-lev el logic minimization to find a custom com binational logic (CCL) for the
simplified Bo olean function.
Suc h design, optimization, and pro cessing of the artificial neuron lead to consid-
erable impro v emen ts in latency , energy consumption, and memory bandwidth re-
quiremen ts for the follo wing reasons. First, pro cessing the artificial neuron neither
requires storing its parameters in p ermanen t storage nor reading the parameters
in to the main memory b ecause the parameters are bak ed in to the CCL. Second,
3
the op erations required for calculating the output of the artificial neuron are car-
ried out b y a simplified CCL, whic h is m uc h more efficien t than its arithmetic
coun terpart.
NullaNet can optimize and pro cess all artificial neurons in a neural net w ork
la y er to further impro v e computational complexit y , as illustrated in Figure 1.2 .
T o this end, NullaNet first finds a CCL for eac h artificial neuron similar to the
example sho wn in Figure 1.1 . It then applies m ulti-lev el logic minimization to find
common logic expressions that can b e shared among m ultiple output neurons, im-
plemen ts suc h logic expressions once, and feeds their outputs to artificial neurons
that consume them. In the example sho wn in Figure 1.2 , pro cessing the opti-
mized neural net w ork la y er requires only sev en logic op erations, while pro cessing
individual neurons tak es 13 logic op erations.
0.61
-1
1.2
1.4
-1
1.9
-3.4
2.8
-0.8
0.3
-2.5
2.3
1
y
2
y
3
y
(a)
1
x
2
x
3
x
1
x
2
x
3
x
1
x
2
x
3
x
1
y
2
y
3
y
(b)
1
x
2
x
3
x
1
y
2
y
3
y
(c)
Figure 1.2. A neural net w ork la y er designed, optimized, and pro cessed
with NullaNet.
4
T o optimize neurons with tens or h undreds of inputs, where exhaustiv e in-
put en umeration is infeasible, NullaNet records input com binations and outputs
encoun tered when pro cessing training data and creates an incompletely sp ecified
function (ISF) for artificial neurons using the recorded v alues. It then assumes the
outputs of remaining input com binations whic h are not recorded are don’t-cares.
This approac h to constructing an ISF is equiv alen t to sampling mathematical func-
tions represen ting artificial neurons at parts of the input space that matter to the
neurons. Figure 1.3 illustrates an example of optimizing the same artificial neuron
sho wn in Figure 1.1 while assuming t w o input com binations are not encoun tered.
Despite its strengths in the efficien t pro cessing of DNNs, NullaNet, as describ ed
herein, has a few limitations. First, it is w ell kno wn that discretizing inputs and
outputs of artificial neurons to binary v alues hamp ers the p erformance of man y
0.61
2.8
1.4
-3.4
1
x
2
x
3
x
y
(a)
x
1
x
2
x
3
∑
3
j=1
x
j
×w
j
y
0 0 0 0.0 -
0 0 1 2.8 1
0 1 0 -3.4 0
0 1 1 -0.6 0
1 0 0 1.4 1
1 0 1 4.2 1
1 1 0 -2.0 -
1 1 1 0.8 1
(b)
x
3
x
1
x
2
00 01 11 10
1
0
-
0
-
1
1 0 1 1
y =x
1
+x
2
(c)
1
x
2
x
3
x
y
(d)
Figure 1.3. An artificial neuron designed, optimized, and pro cessed with
NullaNet while emplo ying an incompletely sp ecified function
5
DL mo dels. Second, emplo ying ISF s to appro ximate functions of artificial neurons
with a large n um b er of inputs m a y further decrease the p erformance of DL mo dels
whic h include those neurons. Third, using training data to form ISF s for artificial
neurons ma y lead to tens of thousands to millions of min terms in the Bo olean
sp ecification of the functions. Ho w ev er, none of the existing TLM tec hniques
can optimize Bo olean functions with more than 10,000 min terms in a reasonable
amoun t of time. F ourth, DNNs ma y include other t yp es of la y ers suc h as con v o-
lutional la y ers, whic h are widely used in con v olutional neural net w orks (CNNs),
or batc h normalization la y ers whose design, optimization, and pro cessing w ere not
discussed in this c hapter. Finally , b ecause cen tral pro cessing units (CPUs) and
general-purp ose graphics pro cessing units (GPGPUs) are not explicitly designed to
p erform Bo olean op erations efficien tly , pro cessing DNNs optimized with NullaNet
on these pro cessors tak es longer than pro cessing the same DNN using arithmetic
op erations. The remainder of this dissertation fo cuses on in tro ducing an adv anced
v ersion of NullaNet, whic h addresses all the shortcomings men tioned ab o v e.
Design, optimization, and pro cessing of state-of-the-art DNNs and CNNs us-
ing NullaNet comprises four ma jor comp onen ts as sho wn in Figure 1.4 : a training
mo dule, a t w o-lev el logic minimization mo dule, a m ulti-lev el logic minimization
mo dule, and a hardw are realization mo dule. NullaNet utilizes the mo dules men-
tioned ab o v e to optimize a target DNN for a giv en dataset and map ma jor parts
of computations p erformed in the DNN to extremely lo w-latency , lo w-cost, fixed-
function, com binational logic blo c ks.
Although NullaNet divides the optimization pro cess in to logically separate com-
p onen ts, it has a holistic approac h to efficien t pro cessing of deep neural net w orks.
A brief description of the four main comp onen ts of NullaNet, whic h follo ws shortly ,
demonstrates ho w upstream comp onen ts tak e accoun t of do wnstream comp onen ts
6
Start
Model
Architecture
Training
Data
Target
Accuracy
Pre-trained
Weights
Quantization-Aware Training
Fan-in-constrained Pruning
Met Target
Accuracy?
Training Data Sampling
Two-level Logic
Minimization Config.
Espresso-GPU Decision Tree
Espresso-GPU
Dominant Label Assignment
Error-budget-driven Leaf Elimination
SVM-based Sample Selection
Top-level
Method
Low-level
Method
Espresso-GPU on
Combined Leaves
Layer Optimization
Multi-level Logic
Minimization Config.
Standard
Cell Library
Filter Optimization
Mapping
CPU Backend
GPU Backend
PyTorch
FPGA
Compiler/
Scheduler
FPGA
Characteristics
No
Reduce Pruning Rate
Yes
End
Granularity
General Purpose
Processor?
Pre-compiled?
No
Yes
No
Yes
1. Training
3. Multi-level Logic Minimization
4. Hardware Realization
2. Two-level Logic Minimization
Figure 1.4. A high-lev el o v erview of NullaNet illustrating the training,
t w o-lev el logic minimization, m ulti-lev el logic minimization, and hard-
w are realization mo dules.
7
while p erforming v arious optimizations.
• The training mo dule p erforms quan tization-a w are training on the pro-
vided mo del and dataset. Quan tization-a w are training is optionally follo w ed
b y applying a newly-in tro duced fan-in-constrained pruning to significan tly
reduce the computational complexit y of the required TLM problem and the
hardw are cost. A t the end of the training, this mo dule ma y pass a subset of
critical samples of the dataset to the TLM mo dule to help construct truth
tables for neurons with tens or h undreds of inputs.
• The t w o-lev el logic minimization mo dule creates truth tables that rep-
resen t functions of differen t neurons b y en umerating all their p ossible input
com binations and finding the corresp onding outputs or examining the inputs
and outputs of differen t neurons when a subset of training data is applied
to the trained mo del. Then, it passes the truth tables to a suite of exact
or appro ximate TLM algorithms that harden the functions of neurons in to
CCL.
• The m ulti-lev el logic minimization mo dule optimizes groups of neurons
within a la y er or groups of consecutiv e la y ers b y applying logic restructuring
tec hniques suc h as decomp osition and common expression extraction. This
step is optionally follo w ed b y target-sp ecific tec hnology mapping.
• Finally , the hardw are realization mo dule implemen ts the mapp ed m ulti-
lev el circuit and other op erations defined in the DNN on a target platform.
Suc h an end-to-end solution enables unpreceden ted lev els of energy-efficiency and
lo w latency while main taining acceptable lev els of classification accuracy .
The rest of this dissertation is organized as follo ws. Chapter 2 explains prelim-
inaries and reviews related w ork. Next, Chapter 3 and Chapter 4 detail differen t
8
mo dules of NullaNet. After that, Chapter 5 presen ts exp erimen tal results and
discussions. Finally , Chapter 6 concludes this dissertation.
9
2
Preliminaries & Related W ork
2.1 Artificial Neural Net w orks
Artificial neural net w orks (ANNs) constitute a class of mac hine learning mo d-
els whic h are inspired b y biological neural net w orks. An ANN is comprised of
artificial neurons and synaptic connections. Eac h artificial neuron (neuron, for
short) receiv es information from its input synaptic connections, pro cesses the in-
formation, and pro duces an output whic h is consumed b y neurons connected to
its output synaptic connections. On the other hand, eac h synaptic connection
(called an edge) determines the strength of the connection b et w een its pro ducer
and consumer neurons using a w eigh t v alue.
The first mathematical mo del of an artificial neuron w as presen ted b y W ar-
ren S. McCullo c h and W alter Pitts in 1943 [ 45 ]. A McCullo c h-Pitts neuron (a.k.a.
the threshold logic unit) tak es a n um b er of binary excitatory inputs and a binary
10
inhibitory input, compares the sum of excitatory inputs with a threshold, and pro-
duces a binary output of one if the sum exceeds the threshold and the inhibitory
input is not set. More formally ,
y =
1 if
n−1
∑
i=1
x
i
≥b and x
0
= 0
0 otherwise,
where eac h x
i
represen ts one of the n binary inputs (x
0
is the inhibitory input
while the remaining inputs are excitatory), b is the threshold (a.k.a. bias), and y
is the binary output of the neuron.
It is eviden t that a McCullo c h-Pitts neuron can easily implemen t v arious logical
op erations suc h as the logical conjunction (AND), the logical disjunction (OR),
and the logical negation (NOT) b y setting appropriate thresholds and inhibitory
inputs. As a result, an y arbitrary Bo olean function can b e mapp ed to an ANN
that is comprised of McCullo c h-Pitts neurons.
One of the main shortcomings of McCullo c h-Pitts neurons is the absence of
w eigh ts whic h determine the strength of synaptic connections b et w een neurons.
A p er c eptr on , whic h w as first prop osed b y F rank Rosen blatt in 1958 [ 46 ], ad-
dresses some of the shortcomings of McCullo c h-Pitts neurons b y in tro ducing tun-
able w eigh ts and allo wing real-v alued inputs. The output of a p erceptron is found
b y
y =
0 if
n−1
∑
i=0
w
i
x
i
0
31: while there exists an y A CTIVE cub es ofF do // W e ma y still drop literals from c in
order to mak e it PRIME
32: Apply the MINI strategy of raising c // This strategy raises part of c that is
common in the large st n um b er of cub es ofF
33: end whil e
34: Mark c as PRIME and INA CTIVE
35: A dd c toF
p
36: end for
37: returnF
p
47
Algorithm 4.2. distill_cub es
Input:
C // C o v er of a function
flag // Flag used for filtering cub es
Output:
C
f
// Co v er of the function only con taining cub es that ha v e the sp ecified flag
1: C
f
=∅
2: for eac h cub e c∈C do
3: if c has flag then
4: A dd c toC
f
5: end if
6: end for
7: returnC
f
Algorithm 4.3. feasibly_co v ered
Input:
c // A cub e b eing expanded
p // The cub e to b e co v ered if p ossible
R // Co v er of the off-set
Output:
is_feasible // Can c b e expanded to co v er p without in tersectingR ?
1: c
+
= expanded v ersion of c to minimally co v er p
2: for eac h cub e r inR do // LOOP-4.1
3: if c
+
in tersects with r then
4: return false
5: end if
6: end for
7: return true
complexit y of O(n|R|) .
A t this p oin t, the algorithm has iden tified ev ery unco v ered cub e of F that can
b e feasibly co v ered along with its corresp onding lo w ering_set. LOOP-3.2 iterates
o v er eac h feasibly co v ered cub e p to get coun t[p ], whic h is the n um b er of feasibly
co v ered cub e s in F that are disjoin t from the lo w ering_set[p ]. This lo op has a
total run time complexit y of O(|F|) .
LOOP-4.2 iterates o v er eac h feasibly co v ered cub eq ofF and c hec ks whether it
is disjoin t from lo w ering_set[p ]. Disjoin t c hec k ha v e a run time complexit y ofO(n) .
Therefore, LOOP-4.2 has a total run time complexit y of O(n|F|) .
48
Algorithm 4.4. calc_disjoin t_cn t
Input:
lp // Lo w ering set of a cub e p
F // Co v er of the on-set
Output:
coun t // Num b er of FEASIBLE cub es inF that are disjoin t from lp
1: coun t = 0
2: for eac h cub e q inF do // LOOP-4.2
3: if q is FEASIBLE and q is disjoin t from lp then
4: coun t += 1
5: end if
6: end for
7: return coun t
Considering all the aforesaid lo ops, the o v erall time complexit y of the EXP AND
step is: O(n
2
|F|
2
|R|) .
4.2.1 ESPRESSO-I I’s In ternal Data Represen tation
The soft w are implemen tation of ESPRESSO-I I represen ts eac h input/output
v ariable using t w o bits: 01 if the v ariable has a v alue of zero, 10 if the v ariable has
a v alue of one, and 11 if the v ariable has a v alue of 2 (don’t-care). It then pac ks
ev ery 16 v ariables in to an unsigned in teger to minimize memory consumption.
A dditionally , it stores the mete-data corresp onding to eac h cub e in a separate
unsigned in teger. The meta-data includes information ab out whether the cub e is
prime, co v ered, etc. The soft w are implemen tation of ESPRESSO-I I stores all cub es
that constituteF ,R , orD in an arra y of unsigned in tegers where eac h co n tiguous
sub-arra y of size k represen ts an individual cub e (the v alue of k is determined b y
the n um b er of input v ariables and output v ariables). Therefore, the F will b e
represen ted with an arra y of size k|F| (R and theD are represen ted with similar
arra ys). Figure 4.1 illustrates an example of suc h arra y .
49
Cube 2
Meta-
data
32 bits
01 11 10
Meta-
data
32 bits
Input & Output Variables
Cube 1
Figure 4.1. An arra y of unsigned in tegers represen ting t w o cub es. Eac h
cub e consists of four unsigned in tegers where the first one stores the
meta-data and the rest store v alues of input and output v ariables.
4.3 ESPRESSO-GPU
Single-core execution of EXP AND for large sparse ISF s ma y tak e hours or da ys
to complete, whic h effectiv ely m ak es it imp ossible to iterate o v er differen t designs.
This necessitates the dev elopmen t of heuristics that are amenable to parallelization
in addition to emplo ying m ulti-core pro cessors for implemen ting suc h heuristics to
ac hiev e substan tially lo w er optimization times. The fo cus of this dissertation is
on parallelizing the EXP AND step, whic h tak e most of the optimization time in
ESPRESSO-I I.
4.3.1 Review of CUD A
A GPU is implemen ted as a set of m ultipro cessors as illustrated in Figure 4.2 .
Eac h m ultipro cessor has a single instruction, m ultiple thread (SIMT) arc hitecture
where at an y giv en clo c k cycle, differen t pro cessors of a m ultipro cessor execute the
same instruction on differen t data. A p ortion of an application that is executed
man y times, but indep enden tly on differen t data, can b e cast as a function, whic h is
executed on differen t threads on a GPU. T o that effect, suc h a function is mapp ed
to a set of instructions c hosen from the instruction set of the GPU and the resulting
program, called a k ernel, is do wnloaded to the GPU. A batc h of threads that realize
a k ernel is organized as a grid of thread blo c ks (a.k.a. blo c ks) as illustrated in
50
Device
MultiProcessor N
Multiprocessor N-1
Multiprocessor N
Multiprocessor 1
Global Memory
Shared Memory
Constant & Texture Memories
Register Register
...
Processor 1 Processor M
Instruction
Unit
Figure 4.2. A high-lev el view of a GPU’s hardw are arc hitecture.
Figure 4.3 . Eac h thread blo c k is pro cessed b y only one m ultipro cessor so that the
shared memory space resides in the on-c hip shared memory of that m ultipro cessor.
This, in turn, leads to v ery fast memory accesses.
The thread blo c ks that are pro cessed b y one m ultipro cessor are referred to as
activ e blo c ks. Eac h activ e blo c k is split in to SIMT groups of threads called w arps
where eac h w arp con tains the same n um b er of threads (i.e., the w arp size). Note
that all threads in a w arp execute the same instruction and run in a lo c k-step
manner. A ctiv e w arps, i.e., all w arps from all activ e blo c ks, are time-sliced. Once
a thread blo c k is launc hed on a m ultipro cessor, all of its w arps are residen t un til
their execution finishes. A thread sc heduler p erio dically switc hes from one w arp
to another to maximize the use of the m ultipro cessor’s computational resources.
CUD A is a hardw are-soft w are arc hitecture whic h exp oses the parallel data
51
Grid 1
Block (0,0) Block (0,1)
Block (1,1) Block (1,0)
Grid 2
Block (0,0) Block (0,1)
Thread (0,0) Thread (0,1)
Thread (1,0) Thread (1,1)
Thread (2,0) Thread (2,1)
Block (1,0)
Kernel 1
Kernel 2
Host Device
Figure 4.3. The organization of threads that realize a k ernel in to grid
blo c ks.
pro cessing capabilities of GPUs. CUD A allo ws users to view the GPU as a
highly m ulti-threaded co-pro cessor that offloads the CPU when executing compute-
in tensiv e applications. CUD A programming mo del pro vides an application pro-
gramming in terface (API) for non-graphics applications. CUD A pro vides general
DRAM memory addressing on GPUs for more programming flexibilit y and sup-
p orts b oth scatter and gather memory op erations. As a result, from a programming
p ersp ectiv e, this translates in to the abilit y to read/write data from/to an y lo cation
in DRAM similar to a CPU. CUD A also pro vides access to a parallel data cac he or
an on-c hip shared memory w ith v ery fast read and write accesses. Differen t threads
of a computer program written in CUD A can b enefit from suc h fast memories to
reduce round trips to DRAM whic h, in turn, mak es the program less dep enden t
52
on the DRAM bandwidth.
CUD A also pro vides a sync hronization function syncthreads() whic h acts as
a barrier for all threads running i n the same thread blo c k. This means that the
co de after a syncthreads() call in some thread executes only after all threads
in the thread blo c k ha v e reac hed the same syncthreads() call. One applica-
tion of this function is where a n um b er of threads transfer data from the DRAM
to the on-c hip shared mem ory of their corresp onding m ultipro cessor. By using
syncthreads() , it is guaran teed that all data transfer is completed b efore com-
putations on the data b egin. A dditionally , CUD A pro vides atomic op erations lik e
atomicAdd() and atomicSwap() to enable dev elop ers a v oid race conditions while
accessing/up dating shared data structures. These op erations return the previous
v alue stored in the memory lo cation b eing accessed. It is w orth men tioning that
unlik e syncthreads() , atomic op erations serialize the access to a data structure
for all threads in a k ernel (as opp osed to doing it only for the threads running in
the same thread blo c k).
4.3.2 P arallel Implemen tation of EXP AND
As detailed in Section 4.2 , the bulk of computations in the EXP AND step
happ ens in four nested lo ops. This section explains whic h lo ops are parallelized
on GPUs and wh y in addition to details of parallelization.
The first lo op iterates o v er unco v ered cub es of F and expands eac h unco v ered
cub e c in to a prime implican t. P arallelizing this lo op b y expanding m ultiple un-
co v ered cub es at the same time has three main disadv an tages. First and foremost,
it is lik ely that some of the cub es under expansion can co v er eac h other. In other
w ords, if those cub es w ere expanded serially , some of them w ould ha v e b een co v-
ered b y previous expansions and nev er considered indep enden tly for expansion. As
53
a result, the c hances of p erforming w asteful computations is increased. Second,
b ecause differen t cub es are expanded indep enden tly of eac h other, they ma y co v er
the same cub es of F m ultiple times. Therefore, the n um b er of prime implican ts
found at the end of the EXP AND step is lik ely to increase. While this phenomenon
ma y also o ccur in the serial v ersion of the EXP AND step, the heuristic b ehind the
EXP AND step is designed to fa v or co v ering unco v ered cub es. Consequen tly , the
serial v ersion of the EXP AND step is exp ected to ha v e few er prime implican ts.
Third, b ecause the expansion of eac h cub e requires mo difying the meta-data cor-
resp onding to b othF andR , these co v ers m ust b e replicated for eac h cub e under
expansion and later merged prop erly to a consisten t state. Suc h replication of
F and R increases memory requiremen ts and reduces efficien t use of the DRAM
bandwidth.
The second lo op, whic h p erforms an iterativ e expansion, cannot b e parallelized
due to lo op-carried dep endencies. As a result, this dissertation implemen ts the
second lo op serially . The remaining lo ops of the EXP AND step, whic h implemen t
a single iteration of the iterativ e EXP AND step, are ones that can b e parallelized
effectiv ely . In terms of memory accesses, it is required to transfer F and R from
the CPU to the GPU b efore the iterativ e EXP AND pro cedure starts (the GPU
implemen tation uses the exact same data represen tation as ESPRESSO-I I). A d-
ditionally , the up dated F and R in addition to the expanded cub e need to b e
transferred from the GPU to the CPU when the expansion ends. Because the
n um b er of computations p erformed on F and R in the innermost lo ops is v ery
large, the amortized cost of these data transfers is negligible.
It is imp ortan t to note that parallelization is p erformed for differen t reasons
for differen t blo c ks of co de in the t w o innermost lo ops. Some blo c ks of co de or
functions lik e distance calculation are inheren tly parallelizable while some other
54
are serial in nature or include conditional statemen ts. The blo c ks of co de whic h are
inheren tly parallelizable are pro cessed b y GP Us to b o ost p erformance. Ho w ev er,
the blo c ks of co de whic h are serial in nature should preferably b e mapp ed to
CPUs due to their adv anced capabilities suc h as branc h prediction, out-of-order
execution, and m ulti-lev el cac hing. Mapping those blo c ks to GPUs will lead to
p erformance degradation unless a v ery large n um b er of threads is launc hed. In
that case, the GPU’s p erformance will b e on a par with that of a CPU. T o ac hiev e
p erformance gains in pro cessing the EXP AND step and to a v oid m ultiple data
transfers b et w een the CPU and the GPU, differen t functions of the EXP AND
step need to b e restructured carefully to enable efficien t parallelization for b oth
inheren tly serial and parallel blo c ks of co de. P arallelizing computations of the said
lo ops requires defining m ultiple CUD A k ernels, some of whic h can b e reused for
differen t parts of the computations.
Filtering cub es: As explained in Section 4.2 , the iterativ e EXP AND pro ce-
dure only deals with activ e cub es of F to reduce the n um b er of computations.
Extracting activ e cub es requires t w o steps. First, one needs to lo ok at the meta-
data corresp onding to eac h c ub e and c hec k the status of the activ e flag. Next, one
should store the indices of activ e cub es in an arra y for future use.
The CUD A k ernel whic h extracts activ e cub es pro ceeds as follo ws (similar
CUD A k ernels can b e designed to find feasible or co v ered cub es). It first creates
one or more thread blo c ks where eac h thread blo c k deals with a subset of all cub es
b y examining a con tiguous sub-arra y in the arra y that represen ts F . Next, the
threads inside eac h thread blo c k examine the activ e flag of their corresp onding
cub e. A t this p oin t, all activ e cub es are iden tified. Ho w ev er, lo cations in the
output arra y in whic h eac h thread has to write its activ e cub es are y et to b e
determined. There are t w o pieces of information that are required for finding the
55
lo cations (a.k.a. indices) of eac h activ e cub e in the output arra y . The first one is
the n um b er of activ es cub es in eac h thread blo c k while the second one is the index
of an activ e cub e within eac h thread blo c k. The total n um b er of activ e cub es
in thread blo c ks that precede a certain thread blo c k determines an index from
whic h the thread blo c k has to write its activ e cub es. Similarly , the index of eac h
activ e cub e inside that thread blo c k determines the offset that should b e added to
the start index to find t he actual index corresp onding to the activ e cub e. These
indices ha v e to b e calculated dynamically b ecause the n um b er of activ e cub es in
eac h thread blo c k ma y c hange from one iteration to another.
T o determine the n um b er of activ e cub es in eac h thread blo c k, eac h thread
atomically incremen ts a coun ter lo cal to the thread blo c k when its corresp onding
cub e is activ e. Concurren tly , it reads the previous v alue whic h w as stored in the
lo cal coun ter and uses that v alue as the offset (all coun ters are initialized to zero).
Next, the thread with an index of zero in eac h thread blo c k (the leader thread)
atomically adds v alue of the lo cal coun ter corresp onding to its thread blo c k to a
global c oun ter while reading the v alue whic h w as previously stored in the global
coun ter. The read v alue determines the start index of that thread blo c k. Eac h
thread corresp onding to an activ e cub e adds the v alue read b y the leader thread
to its offset to find its resulting index. Finally , all threads write their activ e cub es
in the output arra y in paral lel. Figure 4.4 illustrates an example execution of a
CUD A k ernel whic h implemen ts parallel filtering.
Finding feasibly co v ered cub es (LOOP 3.1 & 4.1): By definition, feasi-
bly co v ered cub es are cub es ofF that can b e co v ered b y raising some literals of the
cub e under expansion and without in tersectingR . ESPRESSO-I I iterates o v er all
unco v ered, non-prime cub es ofF and temp orarily expands the cub e under expan-
sion to co v er that cub e. It then calculates the distance b et w een the temp orarily
56
Threads
(* shows leader threads)
2 0 1 0 1
2 0
Block-level, local indices
Input array
Global start indices
Output array
*
*
Addition of local
and global indices
`
Figure 4.4. P arallel distillation of cub es. Chec k marks indicate activ e
cub es.
expanded cub e and all cub es of R to find p ossible in tersections (in this con text,
distance reflects the n um b er of v ariables where one cub e has a v alue of zero while
the other has a v alue of one). If no suc h in tersections exist, the unco v ered, non-
prime cub e is mark ed as feasible. F or eac h feasible cub e, a lo w ering set is defined
as a set of v ariables of the cub e under expansion whic h cannot b e raised if it is
expanded to co v er the unco v ered, non-prime cub e.
Since the temp orary expansion of the cub e under expansion for eac h unco v ered
non-prime cub e is indep enden t of that of other cub es, this step can b e parallelized
on GPUs. Ho w ev er, naïv e parallelization w ould lead to reading the wholeR from
the DRAM for eac h unco v ered, non-prime cub e whic h, in turn, defeats the purp ose
of parallelization. ESPRESSO-GPU tak es adv an tage of tiling to maximize data
reuse in shared, on-c hip memory and reduce the n um b er of DRAM accesses.
Assume F , R , and a distance matrix that k eeps trac k of pairwise distances
b et w een temp orarily expanded cub es and cub es of R as illustrated in Figure 4.5 .
The prop osed implemen tation partitions the distance matrix and assigns the dis-
tance calculations of eac h part to a grid of thread blo c ks. Next, it assigns eac h
57
On-set
Off-set
Grid of thread blocks
mapped to a part of
the distance matrix
Thread block
mapped to a tile
Figure 4.5. P arallel distance calculation using tiling. In this example,
|F| =|R| = 8 and therefore, the distance matrix is 8×8 . This matrix is
partitioned in to four parts (a.k.a. grids), eac h of whic h is 4×4 . Eac h
grid consists of four thread blo c ks where eac h thread blo c k pro cesses
a 2× 2 sub-matrix. Eac h thread pro cesses one distance v alue in eac h
grid as sho wn b y the color-co ded threads. This amortizes the cos t of
launc hing a thread across four computations.
thread blo c k in the grid to a subset of the part that corresp onds to the grid as
sho wn in Figure 4.5 (eac h of these subsets is referred to as a tile). After that, the
leftmost threads of a thread blo c k read the cub es of the on-set that are required
for distance calculation while the topmost threads read the required cub es of the
off-set and store them in the shared memory . Finally , differen t threads of a thread
blo c k quic kly calculate distances using shared data. Tiling results in significan t
58
sa vings in memory bandwidth b ecause of its inheren t sharing. F or example, de-
signing 32× 32 tiles leads to 32× reduction in DRAM accesses for reading the
cub es of the on-set and off-set i.e., eac h fetc hed cub e of the on-set/off-set will b e
used in 32 distance calculations with 32 cub es of the off-set/on-set.
ESPRESSO-GPU assigns m ultiple distance calculations to eac h thread to amor-
tize the cost of launc hing threads across m ultiple computations. W e refer to the
n um b er of distance calculations a thread p erforms as thr e ad r euse factor (TRF).
Coun ting feasibly co v ered cub es that are disjoin t from the lo w ering
set of a giv en feasibly co v ered cub e (LOOP-3.2 & 4.2): As describ ed in
Algorithm 4.1 , for eac h feasible cub e, it coun ts the n um b er of feasible cub es that
are disjoin t from the lo w ering set of that feasible cub e. Next, b y pic king the
feasible cub e with the highest coun t, it enables a large n um b er of unco v ered, non-
prime cub es to remain feasible and con tribute t o further expansions. Similar to the
CUD A k ernel that finds feasible cub es (see Figure 4.5 ), the CUD A k ernel for finding
the n um b er of disjoin t feasible cub es form ulates this problem as a t w o-dimensional
coun t matrix where b oth dimensions are asso ciated with feasible cub es. It then
launc hes a t w o-dimensional grid of thread blo c ks, whic h is resp onsible for coun ting
plus data transfer while adopting tiling to minimize data transfer b et w een the
GPU and the off-c hip memory .
Doing Argmax o v er disjoin t coun ts: The CUD A k ernel that finds the
feasible cub e with the highest n um b er of disjoin t cub es partitions the coun t matrix
and assigns the computations for finding the feasible cub e with the highest n um b er
of disjoin t cub es in eac h part to a thread blo c k. The threads inside eac h thread
blo c k p erform a tree-based reduction to find suc h a feasible cub e as sho wn in
Figure 4.6 .
Assuming a thread blo c k witht threads has to find a feasible cub e in an arra y
59
13 12 9 2 7 4 8 6 13 12 9 6 13 12 13 13
13 12 9 2 7 4 8 6 13 12 9 6 13 12 13 12
13 12 9 2 7 4 8 6 13 12 9 6 13 12 9 10
13 12 9 2 7 4 8 6 13 12 9 6 7 11 8 10
13 12 9 2 7 4 8 6 12 3 2 6 1 11 3 10
0 1 2 3 4 5 6 7
0 1 2 3
0 1
0
Input array
Thread IDs
Stride 8 Stride 4 Stride 2 Stride 1
Figure 4.6. P arallel tree-based reduction for finding the maxim um v alue
(or the index thereof ) in an arra y .
of size 2t , eac h thread i loads the data at indices i and i+t and returns the cub e
with a higher n um b er of disjoin t cub es. This halv es the size of the arra y in eac h
iteration un til the feasible cub e with the maxim um n um b er of disjoin t cub es is
found inside eac h thread blo c k. Finally , differen t thread blo c ks write their output
feasible cub es in to a secondary arra y whic h is passed to a tree-based reduction
thread blo c k to find the target feasible cub e across all thread blo c ks.
It is imp ortan t to note that this dissertation parallelizes the EXP AND step suc h
that its output is exactly the same as the output of ESPRESSO-I I’s EXP AND step.
This facilitates debugging and end-to-end testing.
60
4.4 Divide & Conquer-based TLM
One of the ma jor shortcomings of existing TLM heuristics is that they do not
scale w ell for designs where the on-set or the off-set ha v e h undreds of thousands or
millions of terms. T o ameliorate the scalabilit y issue, this section presen ts a tec h-
nique based on divide and conquer that is comprised of three steps. The presen ted
tec hnique first emplo ys a decision tree to partition the large on-set and off-set in to
small enough on-sets and off-sets whic h can b e easily dealt with in a reasonable
time (eac h leaf no de in the decision tree will ha v e manageable on-set and off-set
sizes). Next, it applies ESPRESSO-GPU or one of the other presen ted tec hniques
to optimize eac h leaf no de and finally , com bines optimized leaf no des at the ro ot
of the tree and runs ESPRESSO-GPU on them to further optimize the design.
This section first describ es ho w the dec ision tree is constructed. Next, it details
optimizations p erformed on leaf no des and finally explains ho w the optimized leaf
no des are merged and optimized one last time.
4.4.1 Decision T ree Construction
The construction of decision tree starts b y taking the on-set and off-set that
describ e an ISF as inputs and merging them in to a single training set where eac h
sample in this training set is assigned a lab el of one or zero based on whether it
b elongs to the on-set or the off-set. Using the training set, the decision tree is
structured in a suc h w a y that, at eac h of its no des, it will ha v e a discriminativ e
(splitting) v ariable suc h that when the Bo olean function asso ciated with the no de is
co-factored with resp ect to the p ositiv e and negativ e p olarities of the said splitting
v ariable, an impurit y measure for the co-factored left c hild c
l
and righ t c hild c
r
no des is minimized.
61
Cho osing the righ t splitting v ariable is ac hiev ed b y ranking v ariables according
to an information theoretic measure called the Gini gain, whic h measures the
w eigh ted c hange in Gini impurit y of c hild no des and is defined as follo ws:
g
G
(x) =−
∑
l∈{c
l
,cr}
w
l
(x)I
l
G
(x),
wherew
l
(x) denotes the n um b er of samples in c hild no del divided b y the n um b er
of samples in the paren t no de whileI
l
G
(x) determines the Gini impurit y of c hildl .
The Gini impurit y is defined as follo ws:
I
G
(x) =−
∑
i∈{0,1}
p(x
i
)(1−p(x
i
)),
where p(x
i
) denotes the probabilit y of a sample b eing i ∈ {0,1} . The v ariable
c hosen is one that results in the maxim um Gini gain among all p ossible splitting
v ariables. Note that the Gini impurit y reac hes its minim um v alue of zero when
a c hild no de is pure (meaning that all min terms for the c hild b elong to one class
only).
Splitting no des con tin ues un til a leaf b ecomes pure or it results in few er thant
SL
training samples in at least one of its branc hes (i.e., t
SL
determines the threshold
for the n um b er of training samples p er leaf ). t
SL
is a h yp erparameter that needs
to b e set prior to the construction of the decision tree.
Figure 4.7 illustrates an example where sev en on-set min terms and 19 off-set
min terms represen ting a fiv e-input ISF are merged to create training samples for
the construction of the decision tree. In this example,t
SL
is s et to three. The final
tree closely resem bles a free binary decision diagram.
Notice that the supp ort set of terms included in eac h leaf no de comprises only a
subset of all input v ariables. More sp ecifically , v ariables that are used as splitting
62
[0, 4] [2, 1]
𝑥! =0 𝑥
!
=1
[2, 2] [4, 0]
𝑥
"
=0 𝑥" =1
[19, 7]
[11, 0]
𝑥# =0
[8, 7]
𝑥# =1
[6, 2] [2, 5]
𝑥$ =0 𝑥$ =1
00000 0 01000 0 10000 1 11000 1
00001 - 01001 0 10001 0 11001 1
00010 0 01010 0 10010 1 11010 1
00011 0 01011 - 10011 - 11011 0
00100 0 01100 0 10100 1 11100 0
00101 - 01101 0 10101 0 11101 0
00110 - 01110 0 10110 0 11110 0
00111 0 01111 - 10111 1 11111 0
Inputs: 𝑥
!
..𝑥
"
Output: 𝑦
"
Figure 4.7. A decision tree that partitions sev en on-set and 19 off-set
terms in to fiv e leaf no des when t
SL
is set to three (all 32 terms and their
outputs are sho wn on the left). The tuple inside eac h no de reflects the
n um b er of off-set and on-set terms in that no de while the v ariables and
v alues on edges determine the splitting v ariables and their v alues in
eac h branc h.
v ariables during the construction of the decision tree and are on the path from the
ro ot of the tree to a leaf no de m ust b e remo v ed from the supp ort set of that leaf
no de, resulting in a co-factored training sample set for the leaf no de.
After emplo ying a decision tree for partitioning the training samples, the pre-
sen ted approac h con v erts co-factored training samples of eac h leaf no de to an on-set
and an off-set for the leaf no de based on the corresp onding lab els of the individual
samples. It then applies one of the follo wing exact or appro ximate minimization
tec hniques to simplify the on-set and the off-set that define eac h leaf no de. Among
differen t optimization tec hniques defined for leaf no des, ESPRESSO-GPU is the
only exact tec hnique while the remaining tec hniques are all appro ximate. The
appro ximate heuristics presen ted in this dissertation are particularly b eneficial in
applications that are inheren tly toleran t to noise and appro ximation, e.g., image
pro cessing and DNN computations.
63
4.4.2 ESPRESSO-GPU on Leaf No des
Emplo ying a decision tree not only reduces sizes of the on-set and off-set in eac h
leaf no de (in terms of the sample coun t for eac h set), it also reduces the cardinalit y
of the v ariable supp orts of eac h set. This results in an imp ortan t efficiency gain
b ecause the run time of ESPRESSO-GPU is a function of b oth the min term coun t
and v ariable supp ort cardinalit y of the on-set and off-set. As a result, it enables
the application of ESPRESSO-GPU to eac h leaf no de to find a co v er for its ISF in
a reasonable amoun t of time. Eviden tly , although eac h leaf no de is initially an ISF,
the leaf no de p ost-ESPRESSO-GPU optimization will b e a completely sp ecified
function.
4.4.3 Dominan t Lab el Assignmen t
This appro ximate optimization approac h replaces eac h leaf no de with its dom-
inan t lab el. The degree of error in tro duced in eac h leaf no de due to appro ximation
is a function of the distribution of lab eled samples in the leaf no de: the less impure
the leaf no de is (i.e., the more dominan t an y one lab el is), the more accurate the
appro ximation b ecomes.
This tec hnique has t w o ma jor dra wbac ks. First, it is difficult to predict the
error in eac h leaf no de as a function of t
SL
. Second, the o v erall appro ximation
error cannot b e calculated as a function of t
SL
or the error in eac h leaf no de.
Despite these shortcomings, this tec hnique can greatly optimize a giv en Bo olean
function realization, resulting in a hardw are cost that is substan tially lo w er than
other tec hniques, esp ecially for large v alues of t
SL
.
64
4.4.4 SVM-based Sample Selection
The optimization tec hniques presen ted so far p erform an aggressiv e lo cal op-
timization in eac h leaf no de follo w ed b y a final round of optimization using
ESPRESSO-GPU on the collection of optimized leaf no des. While the final round
of optimization t ypically yields some hardw are cost sa vings, the sa vings tend to
b e small due to the leaf no des already b eing highly optimized. The ob jectiv e of
SVM-based sample selection approac h is to initially p erform a less aggressiv e op-
timization on leaf no des to enable the final round of optimization, whic h has a
global view of the problem, do a more extensiv e optimization. Since most of the
optimization is done in the latter stage of syn thesis with a global view of all leaf
no des, the accuracy of the realized Bo olean function is exp ected to b e higher.
The SVM-based sample selection tec hnique ac hiev es this goal b y training a
supp ort v ector mac hine in eac h leaf no de and pic king a subset of the leaf no de’s
samples (a.k.a. supp ort v ectors) as a condensed represen tation of the leaf no de.
The SVM implicitly maps the samples that b elong to the leaf no de to a higher
dimensional space and constructs a h yp erplane that segregates 0- and 1-lab eled
samples suc h that the distance b et w een the h yp erplane and the closest sample to
it is maximized (this distance is kno wn as the mar gin ). It is w ell-kno wn that the
equation describing the h yp erplane is dominated b y nearb y samples compared to
the distan t ones (see Section 7.1.5 of [ 90 ]) and in the limit, the h yp erplane b ecomes
indep enden t of the distan t samples. The samples that determine the h yp erplane
are referred to as supp ort v ectors.
The presen ted tec hnique pro ceeds b y constructing a new on-set and off-set in
eac h leaf based on its supp ort v ectors and marks the remaining samples, whic h
are far from the decision b oundary (i.e., the h yp erplane), as don’t-cares. In other
w ords, the 1-lab eled supp ort v ectors constitute the new on-set while the 0-lab eled
65
supp ort v ectors constitute the off-set. One rationale for this appro ximation is
that during the EXP AND step of the final call to ESPRESSO-GPU at the ro ot of
the tree, samples close to the decision b oundary determine the direction in whic h
a pro duct term can b e expanded while those farther a w a y samples will ha v e a
negligible impact on the c hoice of that direction.
As a result of this appro ximation, eac h leaf no de is represen ted with an ISF
b oth b efore and after the optimization. The difference, ho w ev er, is that the ISF
found after the SVM-based sample selection will ha v e a smaller on-set and off-set
and a larger don’t-care-set. Shrinking the sizes of the on-set and off-set enables
running ESPRESSO-GPU on the collection of optimized leaf no des in a reasonable
amoun t of time.
4.4.5 Error-budget-driv en Leaf No de Elimination
This appro ximate optimization tec hnique com bines a top-do wn and a b ottom-
up pro cedure to ensure that an error budget is not exceeded after the appro xima-
tion. It tak es an on-set, an off-set, and an error budget as its inputs and similar to
the previous tec hniques, utilizes a decision tree tec hnique to partition the on-set
and the off-set in its top-do wn phase. Ho w ev er, in con trast to previous approac hes,
it only terminates recursion when a leaf no de is pure and no longer uses t
SL
for
early termination. In this case, replacing eac h leaf no de with its dominan t lab el
as explained in Section 4.4.3 do es not in tro duce an y error. The b ottom-up phase
seeks to lev erage the error budget to simplify the tree b y eliminating differen t pairs
of leaf no des that share a paren t and replacing the paren t, whic h is certainly not
pure, with its dominan t lab el. The details of the b ottom-up phase are describ ed
next.
66
The b ottom-up phase starts with calculating the error incurred if pairs of leaf
no des that share a paren t w ere to b e eliminated from the tree and their paren t
w as to b e replaced with its dominan t lab el
∗
. F or example, assume a pure leaf
no de with six 1-lab eled samples and its pure sibling with t w o 0-lab eled samples.
Replacing an y of these leaf no des with its dominan t lab el do es not in tro duce an y
error b ecaus e the leaf no des are pure. Ho w ev er, replacing their paren t, whic h has
six 1-lab eled samples and t w o 0-lab eled samples, with its dominan t lab el (i.e., 1),
will lead to an error of t w o out of eigh t samples. Therefore, the total elimination
error for this pair of sibling leaf no des is equal to 2−(0+0) = 2 .
The presen ted tec hnique then sorts pairs of sibling leaf no des based on their
calculated elimination e rror, from the smallest to the largest. If the smallest error
is lo w er than the remaining error budget, it eliminates the corresp onding sibling
leaf no des and reduces the error budget to reflect the c hange
†
. It then marks their
paren t no de as a new leaf no de, whic h can p ossibly b e eliminated, if its sibling is
also a leaf no de. In that case, the prop osed tec hnique calculates their corresp onding
elimination error and places them in the righ t p osition in the sorted list of leaf no des
and errors. It then rep eats this pro cess un til the smallest elimination error is larger
than the remaining error budget or all no des are eliminated except the ro ot.
A t that p oin t, it re places all leaf no des with their dominan t lab els. Suc h
optimization strategy ensures the decision tree nev er has an in-sample error higher
than what is imp osed b y the error budget. Algorithm 4.5 summarizes differen t
steps of the error-budget-driv en leaf no de elimination tec hnique (note that the
∗
No paren t no de will b e replaced with its dominan t lab el un til the elimination phase is
completed.
†
If the error budget is pro vided as a p ercen tage, the p ercen tage is m ultiplied b y the n um b er
of input samples to find the absolute error budget.
67
algorithm uses a min-heap instead of a sorted list to k eep trac k of sibling leaf
no des with the smallest elimination error).
Algorithm 4.5. Error-budget-driv en leaf no de elimination
Input:
T // the decision tre e that partitions an on-set and off-set
b // the error budget
Output:
T
opt
// the optimized tree
1: h = new _min _heap()
2: for eac h pair of sibl ing leaf no des l
i
,sibling(l
i
) do
3: e
i
= calculate _elim _error(l
i
,sibling(l
i
))
4: h.insert(key =e
i
,value = (l
i
,sibling(l
i
)))
5: end for
6: while h is not e mpt y do
7: e
i
,l
i
,sibling(l
i
) =h.extract _min()
8: if e
i
≤b then
9: l
p
= parent(l
i
)
10: mark l
p
as a leaf no de
11: if sibling(l
p
) is a leaf no de then
12: e
p
= calculate _elim _error(l
p
,sibling(l
p
))
13: h.insert(key =e
p
,value = (l
p
,sibling(l
p
)))
14: end if
15: b =b−e
i
16: T.remove(sibling(l
i
))
17: T.remove(l
i
)
18: else
19: break
20: end if
21: end while
22: for eac h leaf no de l
i
in T do
23: Appro ximate l
i
with its dominan t lab e l
24: end for
25: T
opt
=T
26: return T
opt
68
4.4.6 ESPRESSO-GPU on a Collection of Optimized Leaf
No des
After optimizing all leaf no des with one of the aforemen tioned algorithms, the
last step is to com bine the optimized leaf no des prop erly and run a final round of
ESPRESSO-GPU on the com bined leaf no des. The details of ho w this com bination
is done are explained next.
Regardless of whether optimized leaf no des are completely sp ecified or incom-
pletely sp ecified functions, they are represen ted with an on-set and an off-set. The
prop osed tec hnique tak es the on-set and the off-set of eac h leaf no de and inserts
prop er v alues of v ariables whic h are used for splitting during the construction of
decision tree in their pro duct terms. In other w ords, it w alks up from eac h leaf
no de to the ro ot and inserts the v alues of encoun tered splitting v ariables in the
pro duct terms of b oth the on-set and off-set represen ting that leaf no de. This
ensures the supp ort set of eac h optimized on-set/off-set includes all input v ari-
ables and not just the ones that w ere used during the optimization of eac h leaf
no de. Then, it safely com bines these newly-created on-sets and off-sets together
b y taking the union of individual on-sets and individual off-sets to create one large
on-set and one large off-set whic h are optimized b y ESPRESSO-GPU. Note that
the com bination of individual on-sets and off-sets that represen t leaf no des do es
not in tro duce an y in tersection b et w een the large on-set and off-set.
69
4.5 Results & Discussion
4.5.1 ESPRESSO-GPU
This section explains ho w h yp erparameters of ESPRESSO-GPU’s k ernels are
set to yield the lo w est execution time and details the sp eedup of ESPRESSO-GPU
o v er ESPRESSO-I I for ISF s of differen t sizes.
W e ev aluate the efficacy of ESPRESSO-GPU on 30 ISF s whic h are equally
divied in to three classes: a smal l class where eac h ISF has 10,000 min terms, a
me dium class where eac h ISF has 60,000 min terms, and a lar ge class where eac h
ISF has 100,000 min terms. Figure 4.8 compares the a v erage time p er call to the
distance calculation k ernel for differen t v alues of tile width and thread reuse factor.
W e observ e that the prop er v alue of the TRF balances b et w een amortization of
the cost of launc hing threads and the degree of parallelization. If a thread is
reused to o man y times, parallelization will b e hamp ered. In our exp erimen ts, w e
find a TRF v alue of 16 to yield the lo w est execution time. W e also observ e that
increasing tile width ab o v e eigh t t ypically leads to an increase in the execution
time. Therefore, w e set the v alue of TRF to 16 and the v alue of tile width to
eigh t for later exp erimen ts. Please note that w e ha v e run similar exp erimen ts for
other k ernels of ESPRESSO-GPU to find their b est set of h yp erparameters that
4 8 16 32
Tile Width
0:5
1:0
1:5
Average Time per Call (ms)
Small
4 8 16 32
Tile Width
10
20
30
Medium
4 8 16 32
Tile Width
50
100
150
Large
TRF = 4
TRF = 8
TRF = 16
TRF = 32
Figure 4.8. A v erage time p er call to the distance calculation k ernel for
differen t v alues of tile width and thread reuse factor.
70
minimize the execution time.
T able 4.1 compares the mean and standard deviation of the execution time of
the EXP AND step of ESPRESSO-I I and ESPRESOO-GPU. W e observ e that as
the sizes of ISF s increase, the sp eedup of ESPRESSO-GPU o v er ESPRESSO-I I
b ecomes more pronounced. This is b ecause the sheer n um b er of computations
in large ISF s lev erages the massiv ely parallel arc hitecture of GPUs to a greater
exten t. With the gained sp eedup, ISF s that tak e more than a da y to run a single
EXP AND step on ESPRESSO-I I can b e giv en to ESPRESSO-GPU to generate
the same output in 10 to 15 min utes.
T able 4.1. Comparison of the mean and standard deviation of time tak en
b y ESPRESSO-I I and ESPRESSO-GPU to optimize ISF s of differen t
size and the observ ed sp eedup.
ISF R un time (s)
Sp eedup
Class ESPRESSO-I I ESPRESSO-GPU
Small 92.7± 37.7 3.4± 1.07 25.7
Medium 8177.8± 1702.5 72.7± 27.6 126.4
Large 107056± 21512.03 757.6± 129.86 140.4
4.5.2 Divide & Conquer-based TLM
W e ev aluate the efficacy of the presen ted divide and conquer-based TLM heuris-
tics b y p erforming t w o sets of exp erimen ts. The first set of exp erimen ts p er-
forms differen t optimizations on 10 differen t ISF s, eac h of whic h includes 300,000
min terms. The min terms corresp ond to a sampled sub-set of activ ations of a filter
in the V GG16 [ 2 ] arc hitecture trained on the CIF AR-10 [ 86 ] dataset. F or ev ery
ISF, w e also generate a test set of 30,000 min terms from the same filter of V GG16
while making sure that no min term in the test set is presen t in the original ISF of
300,000 min terms. This helps us in ev aluating the generalization capabilities of the
71
optimized ISF on unseen data. F or this set of exp erimen ts, w e rep ort the a v erage
optimization time, compression factor (i.e., the n um b er of input min terms divided
b y the n um b er of cub es in the optimized design), accuracy on the min terms of the
original ISF (a.k.a. in-sample accuracy) and the accuracy on the test set min terms
(a.k.a. out-of-sample accuracy). Please note that the optimization time includes
the time it tak es to partition the input min terms using a decision tree in addition
to the optimization time of eac h leaf no de and the final ESPERSSO-GPU call
p erformed at the ro ot.
The second set of exp erimen ts uses a subset of presen ted tec hniques whic h
ha v e a lo w er optimization time to optimize an ISF with ab out 3,000,000 min terms.
Note that there is no test set in this exp erimen t since the min terms represen t all
encoun tered activ ations of a filter (and not a subset of them). Similar to the first
exp erimen t, w e rep ort the optimization time, compression factor, and accuracy .
W e implemen t all differen t divide and conquer-based TLM heuristics presen ted
in this c hapter in the Python programming language while making calls to ex-
isting libraries that facilitate the implemen tation of some of the presen ted tec h-
niques. F or implemen ting decision trees, w e extend the implemen tation a v ailable
in scikit-learn [ 91 ] and for training supp ort v ector mac hines, w e e mplo y the
ThunderSVM [ 92 ] library .
It is imp ortan t to note that none of the existing t w o-lev el logic minimization
tec hniques (including those review ed in the related w ork c hapter) w ere unable to
optimize suc h large sparse ISF s ev en after running for a few da ys. As a result, w e
only compare the results of differen t tec hniques presen ted in this c hapter.
T able 4.2 compares the a v erage optimization time, compression factor, and
accuracy for the first set of exp erimen ts and for differen t v alues of t
SL
. This
table only includes the results for tec hniques where t
SL
is a h yp erparameter that
72
T able 4.2. Comparison of the a v erage optimization time, compression
factor, and accuracy for ISF s with 300,000 min terms and for differen t
v alues of t
SL
.
t
SL
Figure of Merit
Optimization T ec hnique
4.4.2 4.4.3 4.4.4
10
Optimization Time (s) 4,256 31 4,220
Compression F actor 17.94 95.76 129.03
A ccuracy (%) 100.00 90.58 93.55
T est A ccuracy (%) 83.57 83.93 83.33
100
Optimization Time (s) 1,113 9 7,432
Compression F actor 17.00 1,088.93 99.97
A ccuracy (%) 100.00 84.99 97.72
T est A ccuracy (%) 84.77 83.67 84.79
1,000
Optimization Time (s) 758 6 3 ,668
Compression F actor 15.26 9,615.38 107.79
A ccuracy (%) 100.00 81.31 95.00
T est A ccuracy (%) 83.80 81.06 82.61
10,000
Optimization Time (s) 764 4 869
Compression F actor 14.90 63,829.79 156.24
A ccuracy (%) 100.00 78.56 84.74
T est A ccuracy (%) 83.10 76.43 76.00
needs to b e set prior to the optimization (i.e., ESPRESSO-GPU on leaf no des,
dominan t lab el assignmen t, and SVM-based sample selection). Similarly , T able 4.3
compares the same figures of merit for the error-budget-driv en leaf no de elimination
tec hnique and for differen t error budgets.
W e mak e a few observ ations from the results presen ted in T able 4.2 and T a-
ble 4.3 . First, the tec hnique based on running ESPRESSO-GPU on leaf no des
(describ ed in Section 4.4.2 ) can optimize ISF s that ESPRESSO-GPU fails to opti-
mize. This is a direct consequence of the divide and conquer approac h presen ted in
this c hapter. It is w orth men tioning that if ESPRESSO-GPU w as able to optimize
the ISF s under study , it w ould ha v e ac hiev ed a higher compression rate compared
to 4.4.2 . The reason is that ESPRESSO-GPU solv es the optimization problem
73
T able 4.3. Comparison of the a v erage optimization time, compression
factor, and accuracy for Bo olean functions with 300,000 min terms and
for differen t error budgets.
Error
Figure of M erit
Optimization T ec hnique
Budget (%) 4.4.5
0.5
Optimization Time (s) 23
Compression F actor 20.36
A ccuracy ( %) 99.50
T est A ccuracy (%) 82.46
1.0
Optimization Time (s) 25
Compression F actor 21.74
A ccuracy ( %) 99.00
T est A ccuracy (%) 82.42
5.0
Optimization Time (s) 11
Compression F actor 45.81
A ccuracy ( %) 95.00
T est A ccuracy (%) 83.79
10.0
Optimization Time (s) 7
Compression F actor 137.96
A ccuracy ( %) 90.00
T est A ccuracy (%) 84.47
globally , while 4.4.2 solv es a few lo cal optimization problems follo w ed b y a global
optimization on a part of the space sp ecified b y the solutions to lo cal problems.
Second, the dominan t lab el assignmen t tec hnique (describ ed in Section 4.4.3 )
and the error-budget-driv en leaf no de elimination tec hnique (describ ed in Sec-
tion 4.4.5 ) ha v e a considerably lo w optimization time compared to other tec h-
niques. This enables them to handle ev en larger ISF s with millions of min terms at
reasonably short p erio ds of time. A dditionally , 4.4.3 can quic kly pro vide a lo w er
b ound on accuracy for tec hniques that require setting the v alue of t
SL
, and there-
fore, can b e used as a prepro cessing step for quic kly exploring the h yp erparameter
space.
Third, the SVM-based sample selection tec hnique (describ ed in Section 4.4.4 )
74
ac hiev es high compression rate at relativ ely high accuracy . A t ab out the same
lev el of accuracy , 4.4.4 has a m uc h b etter compression rate compared to 4.4.5 at
the cost of substan tially higher optimization time. This mak es 4.4.4 a go o d fit
for optimizing ISF s with relativ ely smaller n um b er of min terms, e.g., few er than
500,000 min terms, while it mak es 4.4.5 the preferred approac h for optimizing ISF s
with a larger n um b er of min terms (> 500,000).
F ourth, W e observ e another in teresting trade-off b et w een 4.4.3 and 4.4.4 when
t
SL
= 100. The compression factor increases b y 10 times in 4.4.3 at the cost of 1%
degradation in test accuracy . So, when generalization and cost of implemen tation
is more imp ortan t than the accuracy on the originally sp ecified function, 4.4.3
could b e used o v er 4.4.4 .
Lastly , w e note that with increasing the error budget in 4.4.5 , the in-sample
accuracy decreases as exp ected, but the out-of-sample accuracy increases. This
sho ws that the 4.4.5 o v erfits at v ery lo w v alues of error budget and should b e used
carefully in applications where generalization is essen tial.
T able 4.4 sho ws the breakdo wn of execution time in differen t steps of optimiza-
tion for the divide and conquer-based TLM heuristics (t
SL
= 1,000 is used for eac h
tec hnique here). It is clear from the table that the ma jorit y of time sp en t b y eac h
optimizer is in the last global optimization step whic h runs ESPRESSO-GPU.
T able 4.5 compares differen t figures of merit for the error-budget-driv en leaf
no de elimination tec hnique and for differen t error budgets when optimizing an
ISF with ab out 3,000,000 min terms. W e observ e that the presen ted tec hnique can
optimize suc h a large sparse ISF at considerably lo w optimization time (i.e., less
than six min utes) while it ac hiev es high accuracy and compression factors.
75
T able 4.4. Breakdo wn of optimization time (in seconds) for P artition,
Lo cal Optimization, Merging, and Global Optimization steps.
T ec hnique
Time Breakdo wn
P artition
Lo cal
Merging
Global T otal
Opt. Opt. Time
4.4.2 70 369 32 288 758
4.4.3 5.60 0.00 0.04 0.36 6
4.4.4 6 11 40 3611 3668
4.4.5 5.6 0.4 2.9 9.2 18.1
T able 4.5. Comparison of the a v erage optimization time, compression
factor, and accuracy for Bo olean functions with 3,000,000 min terms and
for differen t error budgets.
Error
Figure of Merit
Optimization T ec hnique
Budget (%) 4.4.5
0.5
Optimization time (s) 332
Compression factor 58.54
A ccuracy (%) 99.50
1.0
Optimization time (s) 241
Compression factor 71.68
A ccuracy (%) 99.00
5.0
Optimization time (s) 67
Compression factor 1,273.70
A ccuracy (%) 95.00
76
5
Results & Discussions
5.1 Exp erimen tal Setup
5.1.1 Datasets
This dissertation ev aluates differen t asp ects of the presen ted tec hniques on
the follo wing tasks: jet substructure classification (JSC), net w ork in trusion detec-
tion (NID), handwritten digits recognition (MNIST), and color image classification
(CIF AR-10). The details of the tasks are describ ed b elo w.
JSC: the JSC dataset [ 93 ] consists of 789,444 training samples and 197,362
test samples, where eac h sample has 16 input features and an output lab el corre-
sp onding to one of fiv e p ossible classes. W e use 197,362 samples of the training set
for v alidation. Collisions in hadron colliders result in color-neural hadrons formed
b y a com bination of quarks and gluons. These are observ ed as collimated spra y
of hadrons whic h are referred to as jets. The jet substructure classification is the
task of finding in teresting jets from large jet substructures.
NID: the NID task uses the UNSW-NB 15 dataset [ 94 ] as prepro cessed b y [ 95 ].
The dataset consists of 25,767 samples, where eac h prepro cessed sample has 593
binary features corresp onding to 49 original features and an output lab el of whether
77
the sample (i.e., the net w ork pac k et) is malicious or not.
MNIST: the MNIST dataset of handwritten digits includes 60,000 samples for
training and 10,000 samples for testing, where eac h sample is a 28× 28 gra yscale
image. The last 10,000 samples of the training set are used as v alidation set for
mo del selection. The ob jectiv e is to classify eac h image in to one of ten classes 0–9.
CIF AR-10: the CIF AR-10 dataset includes 50,000 samples for training and
10,000 samples for testing, where eac h sample is a 32× 32 color image. The ob jec-
tiv e is to classify eac h image in to one of ten classes: airplane, automobile, bird,
cat, deer, dog, frog, horse, ship, and truc k.
5.1.2 T raining Strategy
W e use the PyT orc h mac hine learning library for training our neural net w orks.
W e train all mo dels for 90 ep o c hs while v arying the learning rate according to a
cosine sc heduler [ 96 ] and with w eigh t deca y .
5.2 Impact of A ctiv ation Quan tization on
Classification A ccuracy
5.2.1 Batc h Normalization
T able 5.1 sho ws the impact of batc h normalization on classification accuracy of
an MLP trained for the JSC task (the MLP has the same arc hitecture as the JSC-S
mo del sho wn in T able 5.5 ). W e observ e that while using batc h normalization leads
to impro v emen ts in classification accuracy for b oth quan tized and full-precision
78
T able 5.1. Impact of batc h normalization on classification accuracy of a
quan tized neural net w ork trained for the JSC task.
A ctiv ation Num b er of Classification A ccuracy (%)
F unction Bits Without BN With BN
ReLU 32 74.27 75.34
Sign 1 57.40 59.13
P A CT 1 56.00 62.27
PHT 1 57.13 62.95
mo dels, it has a higher impact on neural net w orks with binary activ ations. In par-
ticular, using batc h normalization leads to ab out six p ercen t impro v emen t in clas-
sification accuracy in mo dels that emplo y the P A CT or PHT activ ation function.
Another in teresting observ ation is that replacing the sign activ ation function with
an activ ation function with trainable clipping v alue or threshold can significan tly
impro v e the classification accuracy . W e also observ e that binary quan tization of
the JSC-S arc hitecture hamp ers the classification accuracy so m uc h that ma y mak e
the mo del infeasible for use in real-w orld applications. This necessitates supp ort
for m ulti-bit quan tization of the mo del to ac hiev e a classification accuracy close to
that of the full-precision mo del.
5.2.2 Bit-width & Choice of A ctiv ation F unction
T able 5.2 sho ws ho w increasing the n um b er of bits to t w o or three can signif-
ican tly impro v e the classification accuracy compared to binary mo dels sho wn in
T able 5.1 . A dditionally , it sho ws ho w emplo ying differen t activ ation functions for
differen t la y ers of the neural net w ork impro v e classification accuracy compared to a
neural net w ork that only uses one t yp e of activ ation function. P articularly , simply
emplo ying b oth the P A CT and PHT activ ation functions impro v es the accuracy
b y ab out 2% compared to mo dels that emplo y only one of the t w o functions.
79
T able 5.2. Impact of bit-width and activ ation function on classification
accuracy for the JSC task.
A ctiv ation Num b er of Classification
F unction Bits A ccuracy (%)
P A CT 2 69.94
PHT 2 68.32
PHT & P A CT 2 71.10
PHT & P A CT 3 72.73
5.3 Impact of V estigial La y ers on
Classification A ccuracy
T able 5.3 demonstrates ho w in tro ducing a v estigial la y er in an MLP trained
for the MNIST dataset impro v es the classification accuracy b y ab out 1%. This
impro v emen t in accuracy comes for free during inference b ecause b oth the shal-
lo w and deep mo dels ha v e ab out the same resource utilization and latency . It is
w orth men tioning that the deep er mo del uses the ReLU activ ation function for
the in termediate la y ers with 100 neurons to a v oid an y accuracy degradation due
to activ ation quan tization.
T able 5.3. Impact of adding a v estigial la y er on classification accuracy
of a neural net w ork trained for the MNIST task.
Neurons A ctiv ation Num b er of Classification A ccuracy (%)
p er La y er F unction Bits MA C-based NullaNet
50, 50 ReLU 32 98.09 –
50, 50 P A CT 1 96.76 96.28
50, (100, 100), 50 P A CT/ReLU 1 97.41 97.09
80
5.4 Impact of Con text-a w are T raining Data Sam-
pling on Classi fication A ccuracy
T able 5.4 compares classification accuracy of differen t v estigial neural net w orks
trained for MNIST dataset when con text-a w are training data sampling is emplo y ed
during the optimization of ISF s correp onsding to neurons of the neural net w ork
describ ed in T able 5.3 . W e observ e that classificatoin accuracy descreases b y only
ab out 1.5% ev en if 5% of the training data is used for constructing ISF s. A ddi-
tionally , w e observ e that at 20% sampling, the optimized v estigial neural net w ork
ac hiev es ab out the same lev el of accuracy as the original binarized MLP with t w o
hidden la y ers. In terestingly , this same lev el of accuracy is ac hiev ed not only at
faster TLM time due to emplo ying smaller ISF s but also leads to ab out four times
reduction in resource utilization.
T able 5.4. Impact of con text-a w are training data sampling on classifi-
cation accuracy of a neural net w ork trained for the MNIST task.
Neurons A ctiv ation Num b er of Sampling Classification A ccuracy (%)
p er La y er F unction Bits Rate (%) MA C-based NullaNet
50, 50 ReLU 32 – 98.09 –
50, 50 P A CT 1 100 96.76 96.28
50, (100, 100), 50 P A CT/ReLU 1 100 97.41 97.09
50, (100, 100), 50 P A CT/ReLU 1 20 97.41 96.39
50, (100, 100), 50 P A CT/ReLU 1 10 97.41 95.81
50, (100, 100), 50 P A CT/ReLU 1 5 97.41 95.58
81
5.5 Pro cessing DNN La y ers with NullaNet
5.5.1 JSC
T able 5.5 compares classification accuracy and hardw are realization metrics
of differen t MLPs trained for the JSC task. W e observ e that compared to the
LogicNets pap er [ 97 ], whic h is a deriv ativ e of NullaNet, this dissertation ac hiev es
at least 1.5% impro v emen t in classification accuracy , while it consumes ab out three
to 10 times few er LUT s and has a higher theoretically ac hiev able clo c k frequency .
W e also observ e that emplo ying a deep er MLP and increasing the bit-width for
activ ation quan tization in addition to fan-in impro v es accuracy b y ab out 1%. The
classification accuracy of the deep MLP is v ery close to that of the full-precision
mo del.
5.5.2 NID
T able 5.6 compares classification accuracy and hardw are realization metrics of
differen t MLPs trained for the NID task. W e observ e that compared to LogicNets,
this dissertation ac hiev es up to ab out 10% impro v emen t in classification accuracy ,
T able 5.5. Comparison b et w een the hardw are realization metrics of
NullaNet with those of LogicNets on the JSC task.
Arc h.
Neurons
q
a
r
A ccuracy Lo ok-up T ables Flip Flops f
max
p er La y er (% Inc.) (Dec. ratio) (Dec. ratio) (Inc. ratio)
JSC-S 64, 32, 32, 32 2 3
69.65% 39 75 2079 MHz
(+1.85%) (5.50× ) (3.30× ) (1.30× )
JSC-M 64, 32, 32, 32 3 4
72.33% 1,553 151 841 MHz
(+1.73%) (9.30× ) (2.90× ) (1.40× )
JSC-L 32, 64, 192, 192, 16 3
∗
4
∗
73.35% 11,752 565 436 MHz
(+1.55%) (3.20× ) (1.40× ) (1.02× )
∗
The bit-width of activ ations for the first and last la y ers are four and sev en, resp ectiv ely ,
and the last la y er’s fan-in is fiv e.
82
T able 5.6. Comparison b et w een the hardw are realization metrics of
NullaNet with those of LogicNets on the NID task.
Arc h.
Neurons
q
a
r
A ccuracy Lo ok-up T ables Flip Flops f
max
p er La y er (% Inc.) (Dec. ratio) (Dec. ratio) (Inc. ratio)
NID-S 593, 100 2 7
93.14% 95 153 1560 MHz
(+9.26%) (37.75× ) (8.63× ) (1.92× )
NID-M 593, 256, 128, 128 2 7
93.43% 671 480 1099 MHz
(+2.13%) (23.77× ) (2.65× ) (2.33× )
NID-L 593, 100, 100, 100 3
†
5
†
93.28% 205 373 1319 MHz
(+4.60%) (122.20× ) (3.81× ) (3.16× )
†
The bit-width of activ ations for the first la y er is t w o and the fan-in of t he first la y er is
sev en.
while it consumes up to ab out 120 times few er LUT s and has up to three times
higher theoretically ac hiev able clo c k frequency .
5.5.3 CIF AR-10
T able 5.7 compares differen t la y ers of the V GG16 CNN in terms of the size of
w eigh t tensor, pro cessing time, and impact on output qualit y . W e find the impact
on output qualit y b y initially pruning 10% of w eigh ts in all la y ers. W e then train
the CNN for a few ep o c hs and use a gradien t-based pruning tec hnique [ 26 ] to
eliminate some w eigh ts from la y ers with a lo w impact on classification accuracy and
redistribute the eliminated w eigh ts to la y ers with a high impact on classification
accuracy . The final sparsit y v alues of la y ers determine whic h la y er are prone to
appro ximation, and therefore, are go o d candidates for optimization and pro cessing
with NullaNet.
A ccording to T able 5.7 , la y ers eigh t to 13 ha v e a large n um b er of w eigh ts whic h
need to b e stored in p ermanen t storage and read during inference, a high execution
time when pro cessed on an optimized systolic arra y of m ultiply-and-accum ulators,
83
and lo w er impact on classification accuracy . As a result, these la y ers are the b est
candidates to b e optimized and pro cessed with NullaNet.
T able 5.8 compares the pro cessing time of la y ers eigh t to 13 of V GG16 when
using differen t implemen tations. W e observ e that NullaNet ac hiev es ab out 750
times impro v emen t in pro cessing time compared to an optimized DNN accelerator
while it only reduces classification accuracy b y less than 1%. In terestingly , Nul-
laNet implemen ts eac h filter of these la y ers b y only consuming ab out 20 LUT s on
a v erage.
84
T able 5.7. Comparison of differen t la y ers of the V GG16 arc hitecture
in terms of the size of w eigh t tensor, pro cessing time, and impact on
output qualit y .
La y er Num b er of W eigh ts Pro cessing Impact on Output Qua lit y
Index (thousands) Time (µs ) Initial Sparsit y (%) Final Sparsit y ( %)
1 1.73 69 10.00 –
2 36.86 31 10.00 0.00
3 73.73 41 10.00 0.00
4 147.46 64 10.00 0.00
5 294.91 107 10.00 0.00
6 589.82 207 10.00 0.00
7 589.82 204 10.00 0.00
8 1,179.65 384 10.00 0.00
9 2,359.30 772 10.00 0.00
10 2,359.30 769 10.00 0.06
11 2,359.30 753 10.00 0.15
12 2,359.30 760 10.00 0.22
13 2,359.30 756 10.00 0.31
14 262.14 86 10.00 0.07
15 262.14 86 10.00 0.08
16 5.12 6 10.00 –
T able 5.8. Comparison of pro cessing time of la y ers 8–13 of the V GG16
arc hitecture using differen t implemen tations.
La y er Pro cessing Time (µs )
Sp eedup
Index MA C-based NullaNet
8 384.0 1.5 256
9 772.0 1.5 515
10 769.0 1.5 513
11 753.0 0.2 3,765
12 760.0 0.4 1,900
13 756.0 0.4 1,890
8–13 4,194 5.5 763
‡ §
‡
Less than 1% drop in classification accuracy .
§
A v erage of 20 LUT s p er filter.
85
6
Conclusions & P ossible Researc h Directions
This dissertation presen ted Nul laNet , a tec hnique for the design, optimiza-
tion, and pro cessing of deep neural net w orks (DNNs) for applications with strin-
gen t latency and throughput requiremen ts. NullaNet form ulates lo w-latency , high-
throughput pro cessing of DNNs as a logic minimization problem where arithmetic
op erations of a DNN are replaced with lo w-cost logic op erations and the DNN’s
parameters are bak ed in to the realized logic. F an-in-constrained pruning, con text-
a w are training data sampling, and million-scale t w o-lev el logic minimization are
among con tributions of this dissertation that made the NullaNet tec hnique p ossi-
ble.
NullaNet is capable of impro ving end-to-end inference latency compared to
state-of-the-art DNN pro cessors. Ho w ev er, b ecause it emplo ys a radically differen t
approac h for pro cessing DNNs, it op ens the do or to a slew of researc h directions
whic h can impro v e its scalabilit y , classification accuracy , and latency , among other
things.
One of the main c hallenges of applying NullaNet to more sophisticated DNNs
trained for h uge datasets is scalabilit y . While fan-in-constrained pruning, con text-
a w are training data sampling, ESPRESSO-GPU, and million-scale t w o-lev el logic
86
minimization all mak e NullaNet more scalable, they ma y not b e enough for op-
timizing DNNs whic h con tin ue to gro w in size and complexit y . Pro cessing suc h
DNNs requires more inno v ation across differen t domains from training to logic
minimization to hardw are realization.
One of the c haracteristics of NullaNet is that it allo cates as man y resources
as required for pro cessing differen t filters and la y ers. This allo ws NullaNet to
ac hiev e record latency v alues while it t ypically consumes few er computing and
memory resources compared to other DNN pro cessors. Ho w ev er, for platforms
where computing or memory resources are scarce, suc h an implemen tation ma y
b e infeasible. Therefore, a p ossible researc h direction is to explore the design of a
Bo olean pro ces sor whic h is capable of pro cessing Bo olean op erations while reusing
the same resources across differen t time steps. This also requires the design of
an instruction set arc hitecture, a compiler for generating appropriate instructions,
and additional circuitry for the execution of generated instructions.
Another p ossible researc h direction is to explore the use of a h ybrid computing
fabric whic h is not only capable of pro cessing la y ers optimized with NullaNet but
also la y ers that require m ultiply-and-accum ulate or XNOR op erations. Suc h an
arc hitecture will probably b e v ery efficien t on devices suc h as FPGAs whic h are
comprised of lo ok-up tables (LUT s) in addition to digital signal pro cessors (DSPs).
It allo ws balancing the usage of differen t resources b y mapping la y ers optimized
with NullaNet to LUT s and those that require m ultiply-and-accum ulate op erations
to DSPs. La y ers that require XNOR op erations can b e mapp ed to b oth LUT s and
DSPs (man y DSPs ha v e supp ort for bit-wise op erations).
Another p ossible researc h direction whic h w as not discussed in this dissertation
is to study the resiliency of DNNs optimized with NullaNet to differen t t yp es of
mac hine learning attac ks carried out b y adv ersaries. This includes studying the
87
impact of existing attac ks on DNNs optimized with NullaNet in addition to finding
new attac ks whic h can b e sp ecifically designed to target DNNs emplo ying custom
com binational logic for their pro cessing. F ortifying NullaNet against effectiv e ad-
v ersarial attac ks through mo difying the training lo op or logic optimization w ould
b e a v ery c hallenging y et in teresting problem.
88
Bibliograph y
[1] A. Krizhevsky , I. Sutsk ev er, and G. E. Hin ton, “ImageNet classifica-
tion with deep con v olutional neural net w orks,” in A dvanc es in Neur al
Information Pr o c essing Systems , 2012, pp. 1106–1114. [Online]. A v ail-
able: http://papers. nips. cc/paper/4824- imagenet- classification- with- deep-
convolutional- neural- networks
[2] K. Simon y an and A. Zisserman, “V ery deep con v olutional net w orks for
large-scale image recognition,” in International Confer enc e on L e arning
R epr esentations , 2015. [Online]. A v ailable: http://arxiv. org/abs/1409. 1556
[3] C. Szegedy , W. Liu, Y. Jia, P . Sermanet, S. E. Reed, D. Anguelo v,
D. Erhan, V. V anhouc k e, and A. Rabino vic h, “Going deep er with
con v olutions,” in Confer enc e on Computer V ision and Pattern R e c o gnition .
IEEE Computer So ciet y , 2015, pp. 1–9. [Online]. A v ailable: https:
//doi. org/10. 1109/CVPR. 2015. 7298594
[4] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
image recognition,” in Confer enc e on Computer V ision and Pattern
R e c o gnition . IEEE Computer So ciet y , 2016, pp. 770–778. [Online]. A v ailable:
https://doi. org/10. 1109/CVPR. 2016. 90
[5] S. Zagoruyk o and N. K omo dakis, “Wide residual net w orks,” in British
Machine V ision Confer enc e . BMV A Press, 2016 . [Online]. A v ailable:
http://www. bmva. org/bmvc/2016/papers/paper087/index. html
[6] G. Huang, Z. Liu, L. v an der Maaten, and K. Q. W ein b erger, “Densely
connected con v olutional net w orks,” in Confer enc e on Computer V ision and
Pattern R e c o gnition . IEEE Computer So ciet y , 2017, pp. 2261–2269. [Online].
A v ailable: https://doi. org/10. 1109/CVPR. 2017. 243
89
[7] S. Ho c hreiter and J. Sc hmidh ub er, “Long short-term memory ,” Neur al
Computation , v ol. 9, no. 8, pp. 1735–1780, 1997. [Online]. A v ailable:
https://doi. org/10. 1162/neco. 1997. 9. 8. 1735
[8] D. Bahdanau, K. Cho, and Y. Bengio, “Neural mac hine translation b y join tly
learning to align and translate,” in International Confer enc e on L e arning
R epr esentations , 2015. [Online]. A v ailable: http://arxiv. org/abs/1409. 0473
[9] A. V asw ani, N. Shazeer, N. P armar, J. Uszk oreit, L. Jones, A. N. Gomez,
L. Kaiser, and I. P olosukhin, “A tten tion is all y ou need,” in A dvanc es
in Neur al Information Pr o c essing Systems , 2017, pp. 5998–6008. [Online].
A v ailable: http://papers. nips. cc/paper/7181- attention- is- all- you- need
[10] J. Devlin, M. Chang, K. Lee, and K. T outano v a, “BER T: pre-
training of deep bidirectional transformers for language understanding,”
in Confer enc e of the North A meric an Chapter of the A sso ciation for
Computational Linguistics: Human L anguage T e chnolo gies . Asso ciation
for Computational Linguistics, 2019, pp. 4171–4186. [Online]. A v ailable:
https://doi. org/10. 18653/v1/n19- 1423
[11] D. Steinkrau, P . Y. Simard, and I. Buc k, “Using gpus for mac hine
learning algorithms,” in International Confer enc e on Do cument A nalysis
and R e c o gnition . IEEE Computer So ciet y , 2005, pp. 1115–1119. [Online].
A v ailable: https://doi. org/10. 1109/ICDAR. 2005. 251
[12] K. Chellapilla, S. Puri, and P . Simard, “High p erformance con v olutional neu-
ral net w orks for do cumen t pro cessing,” 2006.
[13] R. Raina, A. Madha v an, and A. Y. Ng, “Large-scale deep unsup ervised
learning using graphics pro cessors,” in International Confer enc e on Machine
L e arning , ser. A CM In ternational Conference Pro ceeding Series, v ol. 382.
A CM, 2009, pp. 873–880. [Online]. A v ailable: https://doi. org/10. 1145/
1553374. 1553486
[14] D. C. Ciresan, U. Meier, and J. Sc hmidh ub er, “Multi-column deep neural
net w orks for image classification,” in Confer enc e on Computer V ision and
Pattern R e c o gnition . IEEE Computer So ciet y , 2012, pp. 3642–3649. [Online].
A v ailable: https://doi. org/10. 1109/CVPR. 2012. 6248110
[15] M. Rastegari, V. Ordonez, J. Redmon, and A. F arhadi, “XNOR-Net:
ImageNet classification using binary con v olutional neural net w orks,” in
Eur op e an Confer enc e on Computer V ision , ser. Lecture Notes in Computer
Science, v ol. 9908. Springer, 2016, pp. 525–542. [Online]. A v ailable:
https://doi. org/10. 1007/978- 3- 319- 46493- 0_32
90
[16] I. Hubara, M. Courbariaux, D. Soudry , R. El-Y aniv, and Y. Bengio,
“Binarized neural net w orks,” in A dvanc es in Neur al Information Pr o c essing
Systems , 2016, pp. 4107–4115. [Online]. A v ailable: http://papers. nips. cc/
paper/6573- binarized- neural- networks
[17] S. Zhou, Z. Ni, X. Zhou, H. W en, Y. W u, and Y. Zou, “DoReF a-
Net: T raining lo w bit width con v olutional neural net w orks with lo w
bit width gradien ts,” CoRR , v ol. abs/1606.06160, 2016. [Online]. A v ailable:
http://arxiv. org/abs/1606. 06160
[18] F. Li and B. Liu, “T ernary w eigh t net w orks,” CoRR , v ol. abs/1605.04711,
2016. [Online]. A v ailable: http://arxiv. org/abs/1605. 04711
[19] A. Zhou, A. Y ao, Y. Guo, L. Xu, and Y. Chen, “Incremen tal net w ork
quan tization: T o w ards lossless CNNs with lo w-precision w eigh ts,” in
International Confer enc e on L e arning R epr esentations . Op enReview.net,
2017. [Online]. A v ailable: https://openreview. net/forum?id=HyQJ- mclg
[20] C. Zh u, S. Han, H. Mao, and W. J. Dally , “T rained ternary quan tization,”
in International Confer enc e on L e arning R epr esentations . Op enReview.net,
2017. [Online]. A v ailable: https://openreview. net/forum?id=S1_pAu9xl
[21] A. K. Mishra, E. Nurvitadhi, J. J. Co ok, and D. Marr, “WRPN:
wide reduced-precision net w orks,” in International Confer enc e on L e arning
R epr esentations . Op enReview.net, 2018. [Online]. A v ailable: https:
//openreview. net/forum?id=B1ZvaaeAZ
[22] J. Choi, Z. W ang, S. V enkataramani, P . I. Ch uang, V. Sriniv asan, and
K. Gopalakrishnan, “P A CT: parameterized clipping activ ation for quan tized
neural net w orks,” CoRR , v ol. abs/1805.06085, 2018. [Online]. A v ailable:
http://arxiv. org/abs/1805. 06085
[23] S. Han, J. P o ol, J. T ran, and W. J. Dally , “Learning b oth w eigh ts and
connections for efficien t neural net w orks,” CoRR , v ol. abs/1506.02626, 2015.
[Online]. A v ailable: http://arxiv. org/abs/1506. 02626
[24] M. Zh u and S. Gupta, “T o prune, or not to prune: Exploring the
efficacy of pruning for mo del compression,” in International Confer enc e
on L e arning R epr esentations . Op enReview.net, 2018. [Online]. A v ailable:
https://openreview. net/forum?id=Sy1iIDkPM
[25] T. Zhang, S. Y e, K. Zhang, J. T ang, W. W en, M. F ardad, and Y. W ang,
“A systematic DNN w eigh t pruning framew ork using alternating direction
metho d of m ultipliers,” in Eur op e an Confer enc e on Computer V ision , ser.
91
Lecture Notes in Computer Science, v ol. 11212. Springer, 2018, pp. 191–207.
[Online]. A v ailable: https://doi. org/10. 1007/978- 3- 030- 01237- 3_12
[26] T. Dettmers and L. Zettlemo y er, “Sparse net w orks from scratc h: F aster
training without losing p erformance,” CoRR , v ol. abs/1907.04840, 2019.
[Online]. A v ailable: http://arxiv. org/abs/1907. 04840
[27] X. Ding, G. Ding, X. Zhou, Y. Guo, J. Han, and J. Liu, “Global sparse
momen tum SGD for pruning v ery deep neural net w orks,” in A dvanc es
in Neur al Information Pr o c essing Systems , 2019, pp. 6379–6391. [Online].
A v ailable: http://papers. nips. cc/paper/8867- global- sparse- momentum- sgd-
for- pruning- very- deep- neural- networks
[28] G. E. Hin ton, O. Vin y als, and J. Dean, “Distilling the kno wledge in
a neural net w ork,” CoRR , v ol. abs/1503.02531, 2015. [Online]. A v ailable:
http://arxiv. org/abs/1503. 02531
[29] A. K. Mishra and D. Marr, “Appren tice: Using kno wledge distillation
tec hniques to impro v e lo w-precision net w ork accuracy ,” in International
Confer enc e on L e arning R epr esentations . Op enReview.net, 2018. [Online].
A v ailable: https://openreview. net/forum?id=B1ae1lZRb
[30] A. P olino, R. P ascan u, and D. Alistarh, “Mo d el compression via
distillation and quan tization,” in International Confer enc e on L e arning
R epr esentations . Op enReview.net, 2018. [Online]. A v ailable: https:
//openreview. net/forum?id=S1XolQbRW
[31] L. Theis, I. K orsh uno v a, A. T ejani, and F. Huszár, “F aster gaze prediction
with dense net w orks and fisher pruning,” CoRR , v ol. abs/1801.05787, 2018.
[Online]. A v ailable: http://arxiv. org/abs/1801. 05787
[32] N. Rotem, J. Fix, S. Ab dulraso ol, S. Deng, R. Dzhabaro v, J. Hegeman,
R. Lev enstein, B. M aher, N. Satish, J. Olesen, J. P ark, A. Rakho v,
and M. Smely anskiy , “Glo w: Graph lo w ering compiler tec hniques for
neural net w orks,” CoRR , v ol. abs/1805.00907, 2018. [Online]. A v ailable:
http://arxiv. org/abs/1805. 00907
[33] T. Chen, T. Moreau, Z. Jiang, L. Zheng, E. Q. Y an, H. Shen, M. Co w an,
L. W ang, Y. Hu, L. Ceze, C. Guestrin, and A. Krishnam urth y , “TVM:
an automated end-to-end optimizing compiler for deep learning,” in
USENIX Symp osium on Op er ating Systems Design and Implementation .
USENIX Asso ciation, 2018, pp. 578–594. [Online]. A v ailable: https:
//www. usenix. org/conference/osdi18/presentation/chen
92
[34] H. Sharma, J. P ark, D. Maha jan, E. Amaro, J. K. Kim, C. Shao,
A. Mishra, and H. Esmaeilzadeh, “F rom high-lev el deep neural mo dels
to fpgas,” in International Symp osium on Micr o ar chite ctur e . IEEE
Computer So ciet y , 2016, pp. 17:1–17:12. [Online]. A v ailable: https:
//doi. org/10. 1109/MICRO. 2016. 7783720
[35] S. I. V enieris and C. Bouganis, “fpgaCon vNet: Mapping regular and irregular
con v olutional neural net w orks on FPGAs,” IEEE T r ansaction on Neur al
Networks and L e arning Systems , v ol. 30, no. 2, pp. 326–342, 2019. [Online].
A v ailable: https://doi. org/10. 1109/TNNLS. 2018. 2844093
[36] V. Gokhale, J. Jin, A. Dundar, B. Martini, and E. Culurciello, “A 240
G-ops/s mobile copro cessor for deep neural net w orks,” in Confer enc e on
Computer V ision and Pattern R e c o gnition . IEEE Computer So ciet y , 2014,
pp. 696–701. [Online]. A v ailable: https://doi. org/10. 1109/CVPRW. 2014. 106
[37] Z. Du, R. F asth ub er, T. Chen, P . Ienne, L. Li, T. Luo, X. F eng, Y. Chen, and
O. T emam, “ShiDianNao: shifting vision pro cessing closer to the sensor,”
in International Symp osium on Computer A r chite ctur e . A CM, 2015, pp.
92–104. [Online]. A v ailable: https://doi. org/10. 1145/2749469. 2750389
[38] C. Zhang, P . Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, “Optimizing
FPGA-based accelerator design for deep con v olutional neural net w orks,” in
International Symp osium on Field-Pr o gr ammable Gate A rr ays . A CM, 2015,
pp. 161–170. [Online]. A v ailable: https://doi. org/10. 1145/2684746. 2689060
[39] Y. Chen, J. S. Emer, and V. Sze, “Ey eriss: A spatial arc hitecture for
energy-efficien t dataflo w for con v olutional neural net w orks,” in International
Symp osium on Computer A r chite ctur e . IEEE Computer So ciet y , 2016, pp.
367–379. [Online]. A v ailable: https://doi. org/10. 1109/ISCA. 2016. 40
[40] X. Y ang, M. Gao, J. Pu, A. Na y ak, Q. Liu, S. Bell, J. Setter,
K. Cao, H. Ha, C. K ozyrakis, and M. Horo witz, “DNN dataflo w
c hoice is o v errated,” CoRR , v ol. abs/1809.04070, 2018. [Online]. A v ailable:
http://arxiv. org/abs/1809. 04070
[41] D. Kim, J. Kung, S. M. Chai, S. Y alamanc hili, and S. Mukhopadh y a y ,
“Neuro cub e: A programmable digital neuromorphic arc hitecture with
high-densit y 3D memory ,” in International Symp osium on Computer
A r chite ctur e . IEEE Computer So ciet y , 2016, pp. 380–392. [Online].
A v ailable: https://doi. org/10. 1109/ISCA. 2016. 41
[42] M. Gao, J. Pu, X. Y ang, M. Horo witz, and C. K ozyrakis, “TETRIS:
scalable and efficien t neural net w ork acceleration with 3D memory ,”
93
in International Confer enc e on A r chite ctur al Supp ort for Pr o gr amming
L anguages and Op er ating Systems. A CM, 2017, pp. 751–764. [Online].
A v ailable: https://doi. org/10. 1145/3037697. 3037702
[43] A. Shafiee, A. Nag, N. Muralimanohar, R. Balasubramonian, J. P . Strac han,
M. Hu, R. S. Williams, and V. Srikumar, “ISAA C: A con v olutional
neural net w ork accelerator with in-situ analog arithmetic in crossbars,” in
International Symp osium on Computer A r chite ctur e . IEEE Computer So ciet y ,
2016, pp. 14–26. [Online]. A v ailable: https://doi. org/10. 1109/ISCA. 2016. 12
[44] P . Chi, S. Li, C. Xu, T. Zhang, J. Zhao, Y. Liu, Y. W ang, and Y. Xie,
“PRIME: A no v el pro cessing-in-memory arc hitecture for neural net w ork
computation in ReRAM-based main memory ,” in International Symp osium
on Computer A r chite ctur e . IEEE Computer So ciet y , 2016, pp. 27–39.
[Online]. A v ailable: https://doi. org/10. 1109/ISCA. 2016. 13
[45] W. S. McCullo c h and W. Pitts, “A logical calculus of the ideas immanen t in
nerv ous activit y ,” The bul letin of mathematic al biophysics , v ol. 5, no. 4, pp.
115–133, 1943.
[46] F. Rosen blatt, “The p erceptron: a probabilistic mo del for information storage
and organization in the brain. ” Psycholo gic al r eview , v ol. 65, no. 6, p. 386,
1958.
[47] R. H. Hahnloser, R. Sarp eshkar, M. A. Maho w ald, R. J. Douglas, and H. S. Se-
ung, “Digital selection and analogue amplification co exist in a cortex-inspired
silicon circuit,” Natur e , v ol. 405, no. 6789, pp. 947–951, 2000.
[48] S. Ioffe and C. S zegedy , “Batc h normalization: A ccelerating deep
net w ork training b y reducing in ternal co v ariate shift,” in International
Confer enc e on Machine L e arning , ser. JMLR W orkshop and Conference
Pro ceedings, v ol. 37. JMLR.org, 2015, pp. 448–456. [Online]. A v ailable:
http://proceedings. mlr. press/v37/ioffe15. html
[49] O. R ussak o vsky , J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang,
A. Karpath y , A. Khosla, M. S. Bernstein, A. C. Berg, and F. Li,
“ImageNet large scale visual recognition c hallenge,” International Journal on
Computer V ision , v ol. 115, no. 3, pp. 211–252, 2015. [Online]. A v ailable:
https://doi. org/10. 1007/s11263- 015- 0816- y
[50] D. Soudry , I. Hubara, and R. Meir, “Exp ectation bac kpropaga-
tion: P arameter-free training of m ultila y er neural net w orks with
con tin uous or discrete w eigh ts,” in A dvanc es in Neur al Information
Pr o c essing Systems , 2014, pp. 963–971. [Online]. A v ailable: http:
94
//papers. nips. cc/paper/5269- expectation- backpropagation- parameter- free-
training- of- multilayer- neur al- networks- with- continuous- or- discrete- weights
[51] Z. Cheng, D. Soudry , Z. Mao, and Z. Lan, “T raining binary m ultila y er
neural net w orks for image classification using exp ectation bac kpropagation,”
CoRR , v ol. abs/1503.03562, 2015. [Online]. A v ailable: http://arxiv. org/abs/
1503. 03562
[52] K. Hw ang and W. Sung, “Fixed-p oin t feedforw ard deep neural net w ork
design using w eigh ts +1, 0, and -1,” in W orkshop on Signal Pr o c essing
Systems . IEEE, 2014, pp. 174–179. [Online]. A v ailable: https://doi. org/
10. 1109/SiPS. 2014. 6986082
[53] J. Kim, K. Hw ang, and W. Sung, “X1000 real-time phoneme recognition
VLSI using feed-forw ard deep neural net w orks,” in International Confer enc e
on A c oustics, Sp e e ch and Signal Pr o c essing . IEEE, 2014, pp. 7510–7514.
[Online]. A v ailable: https://doi. org/10. 1109/ICASSP. 2014. 6855060
[54] M. Courbariaux, Y. Bengio, and J. Da vid, “BinaryConnect: T raining deep
neural net w orks with binary w eigh ts during propagations,” in A dvanc es
in Neur al Information Pr o c essing Systems , 2015, pp. 3123–3131. [Online].
A v ailable: http://papers. nips. cc/paper/5647- binaryconnect- training- deep-
neural- networks- with- binary- weights- during- propagations
[55] Z. Liu, B. W u, W. Luo, X. Y ang, W. Liu, and K. Cheng, “Bi-
Real Net: Enhancing the p erformance of 1-bit CNNs with impro v ed
represen tational capabilit y and adv anced training algorithm,” in Eur op e an
Confer enc e on Computer V ision , ser. Lecture Notes in Computer
Science, v ol. 11219. Springer, 2018, pp. 747–763. [Online]. A v ailable:
https://doi. org/10. 1007/978- 3- 030- 01267- 0_44
[56] Y. Bengio, N. Léonard, and A. C. Courville, “Estimating or propagating
gradien ts through sto c hastic neurons for conditional computation,” CoRR ,
v ol. abs/1308.3432, 2013. [Online]. A v ailable: http://arxiv. org/abs/1308. 3432
[57] P . Ramac handran, B. Zoph, and Q. V. Le, “Searc hing for activ ation
functions,” in International Confer enc e on L e arning R epr esentations . Op en-
Review.net, 2018. [Online]. A v ailable: https://openreview. net/forum?id=
Hkuq2EkPf
[58] S. Darabi, M. Belbahri, M. Courbariaux, and V. P . Nia, “BNN+: impro v ed
binary net w ork training,” CoRR , v ol. abs/1812.11800, 2018. [Online].
A v ailable: http://arxiv. org/abs/1812. 11800
95
[59] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep in to rectifiers: Surpassing
h uman-lev el p erformance on imagenet classification,” in International
Confer enc e on Computer V ision . IEEE Computer So ciet y , 2015, pp.
1026–1034. [Online]. A v ailable: https://doi. org/10. 1109/ICCV. 2015. 123
[60] X. Glorot and Y. Bengio, “Understanding the difficult y of training deep
feedforw ard neural net w orks,” in International Confer enc e on A rtificial
Intel ligenc e and Statistics , ser. JMLR Pro ceedings, v ol. 9. JMLR.org,
2010, pp. 249–256. [Onli ne]. A v ailable: http://proceedings. mlr. press/v9/
glorot10a. html
[61] C. Szegedy , S. Ioffe, V. V anhouc k e, and A. A. Alemi, “Inception-v4, Inception-
ResNet and the impact of residual connections on learning,” in Confer enc e on
A rtificial Intel ligenc e . AAAI Press, 2017, pp. 4278–4284. [Online]. A v ailable:
http://aaai. org/ocs/index. php/AAAI/AAAI17/paper/view/14806
[62] H. Li, A. Kada v, I. Durdano vic, H. Samet, and H. P . Graf,
“Pruning filters for efficien t con vnets,” in International Confer enc e on
L e arning R epr esentations . Op enReview.net, 2017. [Online]. A v ailable:
https://openreview. net/forum?id=rJqFGTslg
[63] W. W en, C. W u, Y. W ang, Y. Chen, and H. Li, “Learning structured sparsit y
in deep neural net w orks,” in A dvanc es in Neur al Information Pr o c essing
Systems , 2016, pp. 2074–2082. [Online]. A v ailable: http://papers. nips. cc/
paper/6504- learning- structured- sparsity- in- deep- neural- networks
[64] Y. He, X. Zhang, and J. Sun, “Channel pruning for accelerating
v ery deep neural net w orks,” in International Confer enc e on Computer
V ision . IEEE Computer So ciet y , 2017, pp. 1398–1406. [Online]. A v ailable:
https://doi. org/10. 1109/ICCV. 2017. 155
[65] S. P . Bo yd, N. P arikh, E. Ch u, B. P eleato, and J. Ec kstein, “Distributed
optimization and statistical learning via the alternating direction metho d of
m ultipliers,” F oundations and T r ends in Machine L e arning , v ol. 3, no. 1, pp.
1–122, 2011. [Online]. A v ailable: https://doi. org/10. 1561/2200000016
[66] D. W. Blalo c k, J. J. G. Ortiz, J. F rankle, and J. V. Guttag, “What is the state
of neural net w ork pruning?” in Machine L e arning and Systems . mlsys.org,
2020. [Online]. A v ailable: https://proceedings. mlsys. org/book/296. pdf
[67] R. K. Bra yton, G. D. Hac h tel, C. T. McMullen, and A. L. Sangio v anni-
Vincen telli, L o gic Minimization A lgorithms for VLSI Synthesis , ser. The
Klu w er In ternational Series in Engineering and Computer Science. Springer,
1984, v ol. 2. [Online]. A v ailable: https://doi. org/10. 1007/978- 1- 4613- 2821- 6
96
[68] Wikip edia con tributors, “Logic optimization — Wikip edia, the
free encyclop edia,” https://en. wikipedia. org/w/index. php?title=
Logic_optimization&oldid=962421898 , 2020, [Online; accessed 15-July-
2020].
[69] R. L. R udell, “Logic syn thesis for vlsi design,” Ph.D. di ssertation, EECS
Departmen t, Univ ersit y of California, Berk eley , April 1989. [Online].
A v ailable: http://www2. eecs. berkeley. edu/Pubs/TechRpts/1989/1223. html
[70] S. Sa pra, M. Theobald, and E. M. Clark e, “SA T-based algorithms for
logic minimization,” in International Confer enc e on Computer Design .
IEEE Computer So ciet y , 2003, p. 510. [Online]. A v ailable: https:
//doi. org/10. 1109/ICCD. 2003. 1240948
[71] J. Hla vic ka and P . Fišer, “BOOM - A heuristic Bo olean minimizer,”
in International Confer enc e on Computer-A ide d Design . IEEE Computer
So ciet y , 2001, pp. 439–442. [Online]. A v ailable: https://doi. org/10. 1109/
ICCAD. 2001. 968667
[72] P . Fišer and H. Kubáto v á, “Flexible t w o-lev el Bo olean minimizer BOOM-I I
and its applications,” in Eur omicr o Confer enc e on Digital System Design:
A r chite ctur es, Metho ds and T o ols . IEEE Computer So ciet y , 2006, pp.
369–376. [Online]. A v ailable: https://doi. org/10. 1109/DSD. 2006. 53
[73] D. T oman and P . Fišer, “A SOP minimizer for logic functions describ ed b y
man y pro duct terms based on ternary trees,” in International W orkshop on
Bo ole an Pr oblems , 2010.
[74] R. Bra yton and A. Mishc henk o, “Recursiv e decomp osition of sparse
incompletely-sp ecified functions. ”
[75] J.-H. R. Jiang and S. Dev adas, “Chapter 6 - logic syn thesis in a n utshell,”
in Ele ctr onic Design A utomation . Morgan Kaufmann, 2009, pp. 299 –
404. [Online]. A v ailable: http://www. sciencedirect. com/science/article/pii/
B9780123743640500138
[76] E. M. Sen to vic h, K. J. Singh, L. La v agno, C. Mo on, R. Murgai, A. Saldanha,
H. Sa v o j, P . R. Stephan, R. K. Bra yton, and A. Sangio v anni-Vincen telli, “SIS:
A system for sequen tial circuit syn thesis,” 1992.
[77] M. Gao, J.-H. Jiang, Y. Jiang, Y. Li, S. Sinha, and R. Bra yton, “MVSIS,” in
Pr o c. of the Intl. W orkshop on L o gic Synthesis , 2001.
[78] R. K. Bra yton and A. Mishc henk o, “ABC: an academic industrial-strength
v erification to ol,” in International Confer enc e on Computer A ide d V erific ation ,
97
ser. Lecture Notes in Computer Science, v ol. 6174. Springer, 2010, pp.
24–40. [Online]. A v ailable: https://doi. org/10. 1007/978- 3- 642- 142 95- 6_5
[79] A. Mishc henk o, S. Chatterjee, R. Jiang, and R. K. Bra yton, “FRAIGs: A
unifying represen tation for logic syn thesis and v erification,” ERL T ec hnical
Rep ort, T ec h. Rep., 2005.
[80] A. Mishc henk o, S. Chatterjee, and R. K. Bra yton, “D A G-a w are AIG
rewriting a fresh lo ok at com binational logic syn thesis,” in Design
A utomation Confer enc e . A CM, 2006, pp. 532–535. [Online]. A v ailable:
https://doi. org/10. 1145/1146909. 1147048
[81] R. K. Bra yton, “The decomp osition and factorization of b o olean expressions,”
in International Symp osium on Cir cuits and Systems , 1982.
[82] R. Gong, X. Liu, S. Jiang, T. Li, P . Hu, J. Lin, F. Y u, and J. Y an,
“Differen tiable soft quan tization: Bridging full-precision and lo w-bit neural
net w orks,” in International Confer enc e on Computer V ision . IEEE, 2019, pp.
4851–4860. [Online]. A v ailable: https://doi. org/10. 1109/ICCV. 2019. 00495
[83] Y. Bengio, P . Y. Simard, and P . F rasconi, “Learning long-term
dep endencies with gradien t descen t is difficult,” IEEE T r ansactions on
Neur al Networks , v ol. 5, no. 2, pp. 157–166, 1994. [Online]. A v ailable:
https://doi. org/10. 1109/72. 279181
[84] M. R. Hestenes, “Multiplier and gradien t metho ds,” Journal of optimization
the ory and applic ations , v ol. 4, no. 5, pp. 303–320, 1969.
[85] M. J. P o w ell, “A metho d for nonlinear constrain ts in minimization problems,”
Optimization , pp. 283–298, 1 969.
[86] A. Krizhevsky , G. Hin ton et al. , “Learning m ultiple la y ers of features from
tin y images,” 2009.
[87] Y. Li, Z. Li, L. Ding, P . Y ang, Y. Hu, W. Chen, and X. Gao,
“Supp ortNet: solving catastrophic forgetting in class incremen tal learning
with supp ort data,” CoRR , v ol. abs/1806.02942, 2018. [Online]. A v ailable:
http://arxiv. org/abs/1806. 02942
[88] S. Rebuffi, A. K olesnik o v, G. Sp erl, and C. H. Lamp ert, “iCaRL: Incremen tal
classifier and represen tation learning,” in Confer enc e on Computer V ision
and Pattern R e c o gnition . IEEE Computer So ciet y , 2017, pp. 5533–5542.
[Online]. A v ailable: https://doi. org/10. 1109/CVPR. 2017. 587
98
[89] N. Sriv asta v a, G. E. Hin ton, A. Krizhevsky , I. Sutsk ev er, and R. Salakh utdi-
no v, “Drop out: a simple w a y to prev en t neural net w orks from o v erfitting,”
Journal of Machine L e arning R ese ar ch , v ol. 15, no. 1, pp. 1929–1958, 2014.
[Online]. A v ailable: http://dl. acm. org/citation. cfm?id=2670313
[90] C. M. Bishop, Pattern r e c o gnition and machine le arning , ser. Information
science and statistics. Springer, 2007. [Online]. A v ailable: https:
//www. worldcat. org/oclc/71008143
[91] F. P edregosa, G. V aro quaux, A. Gramfort, V. Mic hel, B. Thirion, O. Grisel,
M. Blondel, P . Prettenhofer, R. W eiss, V. Dub ourg, J. V anderplas, A. P assos,
D. Cournap eau, M. Bruc her, M. P errot, and E. Duc hesna y , “Scikit-learn:
Mac hine learning in Python,” Journal of Machine L e arning R ese ar ch , v ol. 12,
pp. 2825–2830, 2011.
[92] Z. W en, J. Shi, Q. Li, B. He, and J. Chen, “Th underSVM: A fast SVM
library on GPUs and CPUs,” Journal of Machine L e arning R ese ar ch , v ol. 19,
pp. 797–801, 2018.
[93] J. M. Duarte, S. Han, P . C. Harris, S. Jindariani, E. Kreinar, B. Kreis, J. Nga-
diuba, M. Pierini, R. Riv era, N. T ran, and Z. W u, “F ast inference of deep
neural net w orks in FPGAs for particle ph ysics,” CoRR , v ol. abs/1804.06913,
2018.
[94] N. Moustafa and J. Sla y , “UNSW-NB15: a comprehensiv e data set for
net w ork in trusion detection systems (UNSW-NB15 net w ork data set),” in
Military Communic ations and Information Systems Confer enc e . IEEE, 2015,
pp. 1–6. [Online]. A v ailable: https://doi. org/10. 1109/MilCIS. 2015. 7348942
[95] T. Muro vic and A. T rost, “Massiv ely parallel com binational binary neural
net w orks for edge pro cessing,” Elektr otehniski V estnik , v ol. 86, no. 1/2, pp.
47–53, 2019.
[96] L. N. Smith and N. T opin, “Sup er-con v ergence: V ery fast training of neu-
ral net w orks using large learning rates,” in A rtificial Intel ligenc e and Ma-
chine L e arning for Multi-Domain Op er ations A pplic ations , v ol. 11006, 2019,
p. 1100612.
[97] Y. Um uroglu, Y. Akhauri, N. J. F raser, and M. Blott, “LogicNets: Co-
designed neural net w orks and circuits for extreme-throughput applications,”
in International Confer enc e on Field-Pr o gr ammable L o gic and A pplic ations .
IEEE, 2020, pp. 291–297.
99
Abstract (if available)
Abstract
Significant advancements in building both general-purpose and custom hardware have been among the critical enablers for shifting deep neural networks (DNNs) from rather theoretical concepts to practical solutions for a wide variety of problems. However, DNNs are growing in size and complexity to improve their output quality, demanding ever more compute cycles, memory footprint, and I/O bandwidth during their inference. To sustain the ubiquitous deployment of deep learning models and cope with their computational and memory complexities, this dissertation introduces NullaNet, a technique for the design, optimization, and processing of DNNs for applications with stringent latency and throughput requirements. NullaNet formulates low-latency, high-throughput processing of DNNs as a logic minimization problem where it replaces arithmetic operations of a DNN with low-cost logic operations and bakes the DNN's parameters into the realized logic. Fan-in-constrained pruning, context-aware training data sampling, and million-scale two-level logic minimization are among the contributions of this dissertation that make the NullaNet technique possible. Experimental results show the superiority of the NullaNet technique compared to state-of-the-art DNN processors, where end-to-end inference latency is slashed by a factor of five or more.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Compiler and runtime support for hybrid arithmetic and logic processing of neural networks
PDF
An FPGA-friendly, mixed-computation inference accelerator for deep neural networks
PDF
Graph machine learning for hardware security and security of graph machine learning: attacks and defenses
PDF
Exploring complexity reduction in deep learning
PDF
Algorithms and frameworks for generating neural network models addressing energy-efficiency, robustness, and privacy
PDF
Security-driven design of logic locking schemes: metrics, attacks, and defenses
PDF
Edge-cloud collaboration for enhanced artificial intelligence
PDF
Power optimization of asynchronous pipelines using conditioning and reconditioning based on a three-valued logic model
PDF
Towards efficient edge intelligence with in-sensor and neuromorphic computing: algorithm-hardware co-design
PDF
Attacks and defense on privacy of hardware intellectual property and machine learning
PDF
Electronic design automation algorithms for physical design and optimization of single flux quantum logic circuits
PDF
Energy consumption and lifetime/reliability improvement of computing systems using voltage overscaling (VOS) approximation technique
PDF
Deep generative models for time series counterfactual inference
PDF
Circuit design with nano electronic devices for biomimetic neuromorphic systems
PDF
Advanced cell design and reconfigurable circuits for single flux quantum technology
PDF
Experimental analysis and feedforward design of neural networks
PDF
Efficient learning: exploring computational and data-driven techniques for efficient training of deep learning models
PDF
Design of modular multiplication
PDF
Custom hardware accelerators for boolean satisfiability
PDF
Inferring mobility behaviors from trajectory datasets
Asset Metadata
Creator
Nazemi, Mahdi
(author)
Core Title
Ultra-low-latency deep neural network inference through custom combinational logic
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Degree Conferral Date
2021-12
Publication Date
11/01/2021
Defense Date
10/01/2021
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
deep neural networks,hardware acceleration,logic synthesis,OAI-PMH Harvest
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Pedram, Massoud (
committee chair
), Beerel, Peter (
committee member
), Nuzzo, Pierluigi (
committee member
), Razaviyayn, Meisam (
committee member
)
Creator Email
mahdi.nazemi@gmail.com,mnazemi@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC16344596
Unique identifier
UC16344596
Legacy Identifier
etd-NazemiMahd-10190
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Nazemi, Mahdi
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright. The original signature page accompanying the original submission of the work to the USC Libraries is retained by the USC Libraries and a copy of it may be obtained by authorized requesters contacting the repository e-mail address given.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
deep neural networks
hardware acceleration
logic synthesis