Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Energy consumption and lifetime/reliability improvement of computing systems using voltage overscaling (VOS) approximation technique
(USC Thesis Other)
Energy consumption and lifetime/reliability improvement of computing systems using voltage overscaling (VOS) approximation technique
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
ENERGY CONSUMPTION AND LIFETIME/RELIABILITY IMPROVEMENT OF
COMPUTING SYSTEMS USING VOLTAGE OVERSCALING (VOS)
APPROXIMATION TECHNIQUE
BY
Hassan Afzalikusha
A Dissertation Presented to the
FACULTY OF THE GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(ELECTRICAL ENGINEERING)
May 2021
Copyright 2021 Hassan Afzalikusha
ii
Acknowledgment
First and foremost, I would like to express my special appreciation and thanks to my advisor, Professor
Massoud Pedram, for his tremendous mentorship and continuous support of my Ph.D. studies. I truly
appreciate the ideas and guidance he provided, which significantly helped my Ph.D. research. He has given
me the opportunity to work towards my research in many ways.
I would also like to thank my dissertation and qualifying exam committees, made up of Professor Sandeep
Gupta, Professor Stephan Haas, Professor Peter Beerel, and Professor Pierluigi Nuzzo, for their insightful
comments and consistent encouragement.
I would like to thank all the members of the System Power Optimization and Regulation Technology
(SPORT) Lab for making a great environment to work together.
Last but not least, I deeply thank my parents and sister for their continuing encouragement, and unending
patience. I wish to thank my wife who has stood by me through this journey. They all gave me unconditional
help and support.
iii
TABLE OF CONTENTS
Acknowledgment ........................................................................................................... ii
LIST OF TABLES ......................................................................................................... v
LIST OF FIGURES ...................................................................................................... vi
Abstract ............................................................................................................. x
CHAPTER 1. Introduction ......................................................................................... 1
1.1 Voltage Overscaling (VOS) Approximation Technique ....................................... 1
1.2 Aging Mechanisms Suppression Using the VOS Technique ................................ 3
1.3 Approximate Coarse-Grained Reconfigurable Arrays (CGRA) ........................... 4
1.4 Approximate Arithmetic Units based on the VOS Technique .............................. 6
1.5 Approximate Deep Learning Accelerators based on the VOS Technique ........... 6
1.6 Contributions and Organization of The Dissertation ............................................ 7
CHAPTER 2. Backgrounds & Related Works ........................................................... 8
2.1 Voltage Dependences of Power/Energy Components ........................................... 8
2.2 Effect of VOS on Delay of Circuit ........................................................................ 9
2.3 Power Reduction Techniques ................................................................................ 9
2.3.1 Based on VOS ............................................................................................. 9
2.3.2 Based on Operation Level Approximation ............................................... 11
2.4 Bias Temperature Instability ............................................................................... 11
2.5 Carry Look Ahead Adders .................................................................................. 12
2.6 Approximate Adders ........................................................................................... 13
2.7 Dadda Multiplier ................................................................................................. 13
2.8 Approximate Multipliers ..................................................................................... 14
2.8.1 Approximate Multipliers Based on Circuit Simplification/Pruning ......... 14
2.8.2 Approximate Multipliers Based on Voltage Overscaling ......................... 16
2.8.3 Accuracy-Configurable Approximate Multipliers .................................... 18
2.9 Typical CGRA Architecture ................................................................................ 18
2.10 Approximated CGRA .......................................................................................... 20
2.11 CGRA Lifetime Improvement ............................................................................. 21
2.12 NVDLA: NVIDIA Deep Learning Accelerator .................................................. 21
2.12.1 NVDLA Hardware .................................................................................... 22
2.12.2 NVDLA Software ..................................................................................... 25
2.12.3 NVDLA Virtual Platform ......................................................................... 25
2.13 Approximate Implementation of NN Accelerators ............................................. 26
2.13.1 Approximate Multipliers in NN Accelerators........................................... 26
2.13.2 Applying VOS to On-Chip Memories in NN Accelerators ...................... 27
2.13.3 Simultaneous Application of the VOS Technique to both Computation
and Memory Units of NN Accelerators .................................................... 28
2.13.4 Runtime Accuracy Reconfigurable NN Accelerators ............................... 28
iv
CHAPTER 3. Energy Consumption and Lifetime Improvement of Coarse-Grained
Reconfigurable Architectures Targeting Low-Power Error-Tolerant
Applications ....................................................................................... 29
3.1 Key Idea and Its Potentials for CGRA Optimization .......................................... 29
3.2 Proposed Architectures ........................................................................................ 31
3.2.1 Proposed CGRA Architecture without Voltage Island ............................. 31
3.2.2 Proposed CGRA Architecture with Voltage Island .................................. 35
3.3 Results and Discussion ........................................................................................ 42
3.3.1 Simulation Setup ....................................................................................... 42
3.3.2 Benchmarks............................................................................................... 43
3.3.3 Results ....................................................................................................... 44
3.4 Conclusion ........................................................................................................... 62
CHAPTER 4. Energy-Efficient Accuracy-Configurable Adder AND Multiplier
With Improved Lifetime Based on Voltage Overscaling .................. 63
4.1 Low-power Accuracy-configurable Carry Look Ahead Adder .......................... 64
4.1.1 Proposed Approximate AC-CLA .............................................................. 64
4.1.2 Results and Discussion ............................................................................. 66
4.2 The X-Dadda Structures ...................................................................................... 73
4.2.1 Design Based on Bi-VOS ......................................................................... 73
4.2.2 Simulation Setup ....................................................................................... 76
4.2.3 Results ....................................................................................................... 78
4.3 Conclusion ........................................................................................................... 95
CHAPTER 5. X-NVDLA: Runtime Accuracy Configurable NVDLA based on
Employing Voltage Overscaling Approach....................................... 97
5.1 X-NVDLA ........................................................................................................... 97
5.1.1 Approximate MAC Array (AxC) .............................................................. 97
5.1.2 Approximate Convolution Memory Buffer (AxM) ................................ 100
5.1.3 Simultaneous Use of Approximate Multiplier and Buffer ...................... 101
5.2 Simulation Setup ............................................................................................... 102
5.2.1 Simulation Platform and Considered Neural Networks .......................... 102
5.2.2 Modeling the Approximate Multiplier in the MAC Array ..................... 103
5.2.3 Modeling the Approximate Convolution Buffer..................................... 104
5.2.4 Lifetime Improvement Modeling ............................................................ 106
5.3 Results and Discussion ...................................................................................... 107
5.3.1 Results for Energy-Accuracy Characteristic ........................................... 107
5.3.2 Results for Lifetime Improvement .......................................................... 114
5.3.3 Comparing X-NVDLA with Some Prior Related Works ....................... 115
5.4 Conclusion ......................................................................................................... 116
CHAPTER 6. Conclusion ....................................................................................... 117
6.1 Dissertation Conclusion .................................................................................... 117
6.2 Future Work ...................................................................................................... 120
REFERENCES ......................................................................................................... 122
v
LIST OF TABLES
Table 2.1. The values of parameters used in (2-4) for calculating the BTI threshold voltage
drift [47]. ................................................................................................................ 12
Table 3.1. The numbers of total nodes and each operation type for each benchmark. ............ 44
Table 3.2. The model and simulated MSE of FIR and PoE ..................................................... 48
Table 3.3. Energy reduction (%) of different benchmarks under different minimum output
qualities, voltage island sizes and operating voltage resolutions. .......................... 52
Table 3.4. Aging rate reduction (%) of different benchmarks for different minimum output
qualities, voltage island sizes and operating voltage resolutions. .......................... 54
Table 3.5. Aging rate reduction improvement (%) using the folding technique for different
benchmarks, different minimum output qualities, voltage island sizes, and
operating voltage resolutions (Imp: Improvement, Dir: Folding Direction) .......... 59
Table 3.6. The improvements of the proposed method and the one suggested in [38] compared
to those of the conventional exact CGRA. ............................................................. 60
Table 3.7. A comparison between the proposed work and some others in the areas of CGRA,
approximate computing, and voltage overscaling. ................................................. 62
Table 4.1. Accuracy loss for the NN implemented with the selected designs of AC-CLA ..... 73
Table 4.2. Target performances for different designs without and with truncation at the supply
voltage of 0.80V. .................................................................................................... 77
Table 4.3. The accuracy variation of the error parameters of the proposed approximate
multiplier (without truncation) under the process variation for voltage levels of
0.40V and 0.50V for the approximate part. ............................................................ 83
Table 4.4. Features of the studied multipliers .......................................................................... 87
Table 4.5. The PSNR of the sharpening application for different images and approximate
multipliers with two VOS levels. ........................................................................... 93
Table 4.6. The PSNR of the smoothing application for different images and approximate
multipliers with two VOS levels. ........................................................................... 94
Table 4.7. The PSNR of the DCT-IDCT application for different images and approximate
multipliers with two VOS levels. ........................................................................... 95
Table 5.1. The time needed for each image inference on the NN model ............................... 103
Table 5.2. The CACTI energy consumption report for a 32KB SRAM array. ...................... 105
Table 5.3. The fps for LeNet-5 and ResNet-50 ...................................................................... 106
Table 5.4. Comparison of different parameters X-NVDLA with Prior related works .......... 116
vi
LIST OF FIGURES
Figure 2.1. The structure of the 32-bit Carry Look Ahead adder with block size of 8-bits. .... 12
Figure 2.2. The structure of an 8-bit Dadda multiplier [112]................................................... 14
Figure 2.3. A typical 4 × 4 CGRA architecture [40]. .............................................................. 19
Figure 2.4. The hardware diagram of a) The core block of NVDLA has been redrawn from
[85] b) The convolutional core in NVDLA core block redrawn from [84]. ......... 23
Figure 2.5. The detailed schematic of a) The MAC array redrawn from [86]. b) The
convolution buffer redrawn from [84]. ................................................................. 24
Figure 2.6. The block diagram of the virtual platform redrawn from [87]. ............................ 25
Figure 3.1 a) A simple DFG and b) DFG with multiple choices of over-scaled voltage levels.
............................................................................................................................... 30
Figure 3.2 The proposed CGRA Architecture to implement the Voltage Over-Scaling
technique. .............................................................................................................. 32
Figure 3.3. The proposed architecture of each process element (PE). ..................................... 33
Figure 3.4 Mapping of an eight-tap FIR filter DFG on a 6 6 CGRA for a) 90% b) 80% c)
70% d) 60% e) 50% quality. ................................................................................. 37
Figure 3.5 The overall architecture of proposed CGRA. ......................................................... 38
Figure 3.6 The DFG of the a) FIR [65], and b) PoE benchmarks. ........................................... 43
Figure 3.7 a) Overscaled voltage level b) Aging rate Reduction c) Energy Reduction for
different arithmetic operations of the FIR filter benchmark (A: Adder; M:
Multiplier) ............................................................................................................. 45
Figure 3.8 a) Overscaled voltage level b) Aging rate Reduction c) Energy Reduction for
different arithmetic operations of the PoE (A: Adder; M: Multiplier) ................. 47
Figure 3.9 The energy reduction of different benchmarks under different quality constraints
mapped on CGRA with (a) 2×1 (b) 1×3 voltage islands in the case of considering
five voltage levels has been considered. ............................................................... 49
Figure 3.10 Mapping of a) MMM b) SMT benchmarks on a 6 6 CGRA under the minimum
quality of 90%. ...................................................................................................... 50
Figure 3.11 The decrease of the energy reduction in the case of 1×3 voltage island when the
number of operating voltage levels is lowered from five to two. ......................... 51
Figure 3.12 The threshold voltage change rate reduction for different benchmarks for different
minimum acceptable qualities and five operating voltage levels in the case of (a)
2×1 (b) 1×3 voltage islands. ................................................................................. 52
Figure 3.13. Decrease of the aging rate improvement in the case of 1×3 voltage island when
the number of operating voltage levels is lowered from five to two. ................... 53
Figure 3.14. Folding a 4×4 array of PEs with respect to a (a) vertical, (b) horizontal, and (c)
diagonal folding lines passing through the middle of the array. ........................... 56
Figure 3.15. The increase in aging rate reduction (%) for all the benchmarks under different
minimum output qualities in the case of five voltage levels and (a) 2x1, (b) 1×3
voltage islands. ...................................................................................................... 57
Figure 4.1. The structure of the 32-bit AC-CLA-B8 design. ................................................... 65
Figure 4.2. MED, MRED, and MNED of the 32-bit AC-CLA adder with 4-bit block ((a)-(c))
and 8-bit block ((d)-(f)) versus NAP_B for different VDD_AP. ................................. 67
vii
Figure 4.3. Energy of the AC-CLA structures versus NAP_B for different VDD_AP (a) 4-bit block
width (b) 8-bit block width. .................................................................................. 69
Figure 4.4. Figure of Merit of the 32-bit AC-CLA adder structures versus NAP_B for different
VDD_AP (a) 4-bit block width (b) 8-bit block width. .............................................. 69
Figure 4.5. Comparing the energy consumption, area, NMED, and FoM of different accuracy
configurable approximate adders. ......................................................................... 70
Figure 4.6. BTI induced delay degradation of the explored structures for different VDD_AP and
NAP_B (a) 4-bit block width (b) 8-bit block width. ................................................ 71
Figure 4.7. The structure of the 8-bit X-Dadda design with 𝑤𝑖𝑑𝑡 ℎ_𝑟𝑒𝑔𝑖𝑜𝑛 2 of 9 and 5
illustrating the cases of without and with truncation. ........................................... 74
Figure 4.8. The delay versus VDD_AP of the level shifter. ........................................................ 75
Figure 4.9. The circuit diagram of the level shifter employed in this work [125]. .................. 75
Figure 4.10. MED, MRED, and NED of the 8-bit X-Dadda multiplier without ((a)-(c)) and
with ((d)-(f)) 4-bit truncation under different VDD_AP and 𝑤𝑖𝑑𝑡 ℎ_𝑟𝑒𝑔𝑖𝑜𝑛 2. ........ 78
Figure 4.11. Energy consumption of the 8-bit X-Dadda structures versus 𝑤𝑖𝑑𝑡 ℎ_𝑟𝑒𝑔𝑖𝑜𝑛 2 for
(a) without and (b) with truncation designs for different VOS levels. ................. 80
Figure 4.12. (a) MED, (b) MRED, and (c) NED of the 16-bit X-Dadda multiplier under
different VDD_AP and 𝑤𝑖𝑑𝑡 ℎ_𝑟𝑒𝑔𝑖𝑜𝑛 2. ................................................................. 81
Figure 4.13. Energy consumption of the 16-bit X-Dadda structures versus 𝑤𝑖𝑑𝑡 ℎ_𝑟𝑒𝑔𝑖𝑜𝑛 2 for
different VOS levels. ............................................................................................ 82
Figure 4.14. The increase in the threshold voltage magnitude for NMOS and PMOS versus
the supply voltage after ten years. ......................................................................... 84
Figure 4.15. BTI induced delay degradation of the explored structures for different VDD_AP
and 𝑤𝑖𝑑 𝑡 ℎ_𝑟𝑒𝑔𝑖𝑜𝑛 2 values for (a) without and (b) with truncation designs for
different VOS level. .............................................................................................. 85
Figure 4.16. MRED reduction of the X-Dadda when working in the approximate mode and
the clocks of the designs were set as the delay of the X-Dadda in exact mode after
10 years operation for (a) without and (b) with truncation. .................................. 86
Figure 4.17. The MRED vs energy and delay for different approximate multipliers. ............. 88
Figure 4.18. Energy vs MRED for different approximate multipliers. .................................... 88
Figure 4.19. Delay vs MRED for different approximate multipliers. ...................................... 89
Figure 4.20. The accuracy loss of the considered neural network by using different X-Dadda
structures as well as VDD_AP values compared to case of the exact implementation.
............................................................................................................................... 91
Figure 5.1. An example for a) Fine-Grain type design of the MAC array b) Coarse-Grain type
design of MAC array. ........................................................................................... 99
Figure 5.2. The schematic of the X-SRAM design (each box is a cell). ................................ 101
Figure 5.3. The energy reduction for different VDD,AxC and WAxC values in the case of 8-bit X-
Dadda multiplier [132]. ....................................................................................... 104
Figure 5.4. The BER vs. VOS level for 6T SRAM cell redrawn from [143]. ....................... 104
Figure 5.5. Utilization numbers of different VOS levels for a) two modes of FG (weight
interval of 7.5) and CG (weight interval of 7.5) in the case of LeNet-5 and b) two
modes of FG (weight interval of 10) and CG (weight interval of 7.5) in the case of
ResNet-50 ........................................................................................................... 108
Figure 5.6. The energy reduction of the MAC array versus accuracy reduction of AxC for a)
LeNet-5 b) ResNet-50. ........................................................................................ 109
viii
Figure 5.7. The energy breakdown of SRAM banks for LeNet-5 and ResNet-50................. 110
Figure 5.8. The convolution buffer energy reduction versus accuracy reduction of AxM for a)
LeNet-5 b) ResNet-50. ........................................................................................ 111
Figure 5.9. The energy consumption contributions of the MAC array and the SRAM buffer in
the total energy consumption of the convolution core for both LeNet-5 and
ResNet-50. .......................................................................................................... 112
Figure 5.10. The energy reduction versus accuracy degradation when both AxC and AxM
were used for a) LeNet-5 and b) ResNet-50. ...................................................... 113
Figure 5.11. Delay degradation improvement of the MAC array after 10 years. .................. 115
ix
ABSTRACT
Current and future applications of computing systems are characterized by different features that they offer.
The features demand more computing power where the lifetime/reliability of the system is also a critical
parameter. Higher computing power requires less delay per function as well as lower energy consumption
such that the system energy consumption does not exceed the energy/power constraint. One of the design
approaches to reduce the energy consumption of the system is approximate computing (AxC) where some
errors in the computations are tolerated when the application is error resilient. There exist different
approaches at the different design abstraction levels for implementing the approximate computing as well
as using approximation in the memory of computing systems. In this dissertation, we investigate the use of
voltage overscaling (VOS) approximation technique in energy reduction and improving lifetime/reliability
of the computation and memory units of computing systems. The VOS technique reduces both the dynamic
and leakage power by reducing the supply voltage and improves the lifetime/reliability of the circuit, thanks
to reduced electric field in the devices. Since the voltage level(s) may be changed during the runtime, the
VOS technique provides an online output quality (error level) reconfigurability. While there are some works
concentrating on the VOS approximation technique, the use of the technique for coarse-grained
reconfigurable arrays (CGRAs) as reconfigurable fabrics which have optimized paths for word-level
operations is investigated. The CGRAs consist of processing elements (PEs) consisting of adder and
multiplier modules and the memory unit whose accuracies can be varied by applying different VOS levels.
In this dissertation, the CGRAs are implemented in a 15nm FinFET technology whose nominal supply
voltage is 0.8V. Lower voltage levels, such as 0.75, 0.7, 0.65, 0.6, 0.55, 0.5, 0.45 and 0.4 V are considered
as VOS levels to be applied to each individual PE. The approach provides considerable lifetime/reliability
and energy consumption improvements over those of the conventional exact and approximate computation
approaches. In the second step of studying the efficacy of VOS in CGRAs, in order to make the hardware
implementation of our scheme more efficient, PEs are clustered into groups of (e.g., 3 1, and 2 1) voltage
islands. To assess the efficacy of this added constraint, different combinations of the minimum output
x
quality constraints, voltage levels, and cluster sizes for several benchmarks are studied. Simulation results
indicate considerable reductions in energy consumption and aging rate when compared to the conventional
exact CGRA.
In the third step of our study, we invoke the VOS technique to propose an approximate adder, called
AC-CLA and an approximate multiplier based on Dadda structure, called X-Dadda,. In both computation
modules, to reduce the accuracy degradation while improving the lifetime and reducing the energy
reduction, the VOS technique is only applied to the least significant bits (LSB) while the nominal voltage
is applied to the most significant bits (MSB).
In the last step, to move further in evaluating the efficacy of VOS, the technique is applied to NVDLA
(NVIDIA Deep Learning Accelerator) as a computing system used for running deep learning neural
network (NN) models. The NVDLA, which the VOS technique is applied to its computation (multiply-and-
accumulate) and memory units for enhancing its energy and lifetime/reliability characteristics, is called X-
NVDLA. For the implementation of the multipliers in the multiply-and-accumulate (MAC) array, the X-
Dadda multiplier is employed. On the other hand, the VOS technique is applied to the LSB units of the
SRAM banks of the convolution core of the NVDLA. The energy saving and lifetime improvement of the
X-NVDLA are assessed using LeNet-5 (small) and ResNet-50 (large) neural network models. The study
includes the energy saving versus accuracy reduction as well as lifetime characteristics of the neural
networks for different approximation configurations.
1
CHAPTER 1. INTRODUCTION
In this chapter, first, the voltage overscaling (VOS) as an approximation technique to reduce the energy
consumption and suppressing the aging rate is discussed. Then, the application of the VOS technique to
Coarse-Grained Reconfigurable Arrays (CGRAs) which consists of registers and processing elements (PEs)
are motivated. Next, VOS based approximation of the Dadda multiplier (and Carry look ahead adders)
along with its combination with approximation technique based on circuit simplification as instances of
approximate units are given. Finally, the application of the VOS technique to deep learning accelerators
consisting of multiply-and-accumulate (MAC) arrays and SRAM buffers, as a more complete digital
computing system is briefly described. Finally, the organization of this dissertation is presented.
1.1 Voltage Overscaling (VOS) Approximation Technique
Lowering the energy/power consumption of digital computing systems is of critical concern in almost all
types of applications, especially embedded processors in battery-operated systems. Most computing
systems are power limited, whether it is a cell phone with a 1W power budget, or the server with a 100W
power budget [9]. For example, embedded processors should perform the computations required for
applications without exceeding a given power budget constraint.
There are different approaches for lowering the energy consumption of the computation at different
design levels spanning from the algorithm level to physical design level. The examples for the circuit-level
power reduction techniques include dual 𝑉 𝐷𝐷
[24], dual 𝑉 𝑇𝐻
[25], dynamic voltage (frequency) scaling [26],
and power gating [27] techniques. Most of the potential energy benefit of the (dynamic) voltage scaling has
been provided since its inception and not much room for further voltage reduction is expected [28].
Similarly, at higher design abstraction levels, parallelism and manycore processing are also running out of
steam in the vast portion of the system on chips (SoCs) whose workloads are not naturally parallelized [29].
From these considerations, keeping the historical ~100X energy reduction per decade indicates a hard
challenge, under the 4 benefit through the technology scaling [29]. In the next decade, innovations in
2
device, circuit, and system level as well as new design paradigms and trade-offs that favor energy reduction,
are needed to provide the remaining ~25 energy reduction.
One class of techniques is based on approximate computing (AxC) paradigm where lowering the energy
consumption is achieved at the price of some output accuracy loss. Fortunately, there are applications which
do not require computational exactness for several reasons [30]. Examples of these applications include
video processing, image recognition, text and data mining, and machine learning [30]. Therefore, this
paradigm may be invoked for applications where the impact of computation errors on the output quality
degradation is tolerable. AxC may be employed at different levels of the design abstraction.
At the hardware level, it may be divided grossly into two categories of functional approximation and
overscaling approximation [33]. The first category includes simplifying the target hardware, reducing width
of the data path, and ignoring lower significant bits of operands. It modifies (simplifies/reduces gate counts)
the circuit structure of the function 𝐹 , which turns it into another function 𝐺 (= ~𝐹 ) (see, e.g., [34]). More
specifically, at the circuit level, this technique is based on circuit simplification/pruning, where some of the
gates are replaced with simpler ones or omitted to improve the energy efficiency of the circuit. In addition
to lower energy consumption, lower area for the circuit is achieved.
The second category is based on lowering the circuit supply voltage below the nominal errorless one,
which is called voltage overscaling (see [19]). The supply voltage reduction provides the opportunity to
reduce both dynamic and static power components of the circuit. Since in the VOS technique the operating
voltage is scaled down without lowering the corresponding operating frequency, the critical path delay of
the circuit with this overscaled voltage becomes larger than the operating clock period (including its guard-
band), where some setup time violation errors may occur at the output of the circuit. An erroneous output
may be perceived as an approximate result.
The number of output errors is a function of the difference between the overscaled voltage and the
nominal voltage (assumed to be a safe voltage level for the desired computational speed). The lower the
supply voltage is, the larger is the number of the paths which timing requirements are not satisfied and the
3
higher is the output error rate. Therefore, there is a trade-off between the amount of the energy reduction
and the output accuracy. The characteristic of this trade-off is specified using a Pareto-optimal curve in the
energy versus accuracy space.
This technique for approximate computing has the additional advantage of not requiring the redesign of
the circuit hardware. Additionally, the level of accuracy (alternatively, the approximation level) may be
dynamically adapted at runtime, providing the runtime accuracy reconfigurability thanks to the ability of
changing the VOS level during the runtime [35]. Both mentioned points mean that by simply applying the
specified nominal voltage level, the perfect computation (a 100% output quality (errorless output)) may be
achieved. Furthermore, it should be noted that designs which support different output accuracy levels are
highly demanded. The reasons include the facts that the available energy for an application may vary over
time as well as different minimum output quality constraint for different applications. The resolution for
the accuracy reconfigurability, which is achieved through the levels that the dynamic voltage scaling unit
can provide, is normally much higher than the one which normally would be obtained through the
architectural techniques. This technique, however, requires a dynamic scaling unit capable of providing the
required voltage levels for the circuit employing VOS. In this dissertation, we investigate the use of this
technique for lowering the energy consumption of digital computing systems.
1.2 Aging Mechanisms Suppression Using the VOS Technique
While decelerating transistor aging and improving device reliability has always been a design challenge,
the problem has become worse as the technology scaling has continued. As a result, nowadays, there is a
lot of renewed interest in device aging and reliability [15]. With continuous scaling down of the process
technology, digital circuits are facing significant reliability issues, due to increasingly serious aging effects,
such as negative biased temperature instability (NBTI), positive biased temperature instability (PBTI), hot
carrier injection (HCI), electromigration (EM) and time-dependent dielectric breakdown (TDDB) [16][17].
These effects can change the threshold voltage (𝑉 𝑡 ℎ
), increase the delay, cause a decrease in the meantime
to failure (MTTF) of the system and degrade the reliability [18].
4
Among these circuit aging mechanisms, bias temperature instability (BTI) and hot carrier injection (HCI)
are directly related to the transistor lifetime [19]. NBTI and HCI are the two most significant aging
mechanisms influencing the devices and circuits characteristics [17][20][21]. Both mechanisms cause an
increase in the threshold voltage of the affected transistors, causing the on-current (𝐼 𝑜𝑛
) to decrease. In [22],
it is shown that the NBTI-induced increase in the switching delay of a PMOS transistor is up to 20% in 10
years. Also, [21] shows that the average delay degradation of the critical path increases by 10.1% for NBTI
and 3.3% for HCI. A criterion of 20% (10% for automotive applications) performance degradation is a
common threshold used to signal the end of the circuit life [23].
The rate of 𝑉 𝑡 ℎ
change (aging rate) depends on the supply voltage (𝑉 𝐷𝐷
), die temperature, clock
frequency, and activity factor in the circuit. One reason for worsening of the aging effect is that the reduction
in device physical dimensions has not been accompanied with the same reduction in supply voltage levels.
The connection is that the aging effects are related to the induced electric field (voltage divided by spacing)
in the device active region.
In addition to reducing the energy consumption, the use of the VOS technique for realizing the AxC
paradigm has the advantage of increasing the lifetime and reliability at the same time due to existence of
lower electric filed in the devices. This is especially important for highly scaled state-of-the-art
technologies, where the device dimensions have been considerably reduced, making lifetime/reliability as
a serious concern and a challenging design task in the design of digital systems [19]. The improvement in
the lifetime of digital computing systems when the VOS technique is employed is also studied in this
research.
1.3 Approximate Coarse-Grained Reconfigurable Arrays (CGRA)
Among different platforms that exist for the execution of digital computations required by algorithms, one
may employ field-programmable gate arrays (FPGAs which consist of configurable logic blocks, DSP
blocks, memory bank, connection blocks, switch boxes and I/O’s [4]) and coarse-grained reconfigurable
arrays (CGRAs which consist of PEs, host controller, context and data memories) [3]. In fact, these
5
platforms may be configured to play the role of accelerator which are used along with the main processors
in digital systems to improve the speed and energy consumption of the computations. The use of these
reconfigurable architectures provides the flexibility of optimizing the accelerator for each individual
application. Since the bit-level granularity and long wires of conventional bit-level FPGAs usually incurs
high energy and area overheads, coarse-grained reconfigurable arrays become the next most promising
option for reconfigurable fabrics, which have optimized paths for word-level operations [5]. It is a hardware
platform whose performance may be optimized (reconfigured) for different applications. Using careful
scheduling and mapping, CGRA can achieve a computational efficiency close to that of a custom hardware
architecture [6]. While lacking the fine reconfigurability of FPGAs, their incorporation of arithmetic units
enables CGRAs to run compute-intensive applications efficiently both in terms of computation speed and
energy consumption. CGRAs have short reconfiguration times, high performance, and low power
consumption since they admit standard cell implementations [7].
In addition to the flexibility and high-performance features of CGRAs, lower power consumption is also
crucial for the CGRAs to be used as a competitive processing core in embedded systems [8]. The energy
efficiency of the CGRA fabrics may be further improved by using approximate computing (AxC), in
general, and VOS in particular. Using approximation techniques, trade-offs between energy/power
consumption and computational accuracy are achievable [38] [39]. Here, when the exact computation is
required, the nominal supply voltage may be applied to the PEs and when some approximation may be
tolerated, depending on the required accuracy level, overscaled voltages (i.e., those lower than a nominal
voltage level) are applied to the PEs. More explicitly, overscaled voltage levels can be determined and
applied to PEs such that the circuit lifetime/reliability is maximized. The unique feature of this approach is
the concurrent use of different overscaled voltage levels for different operations of a data flow graph (DFG)
while maintaining a required output quality without needing error detection and correction units and without
performance penalty. In this research, the use of the VOS technique in enhancing the energy efficiency and
lifetime characteristics of CGRAs as a flexible computing fabric is investigated.
6
1.4 Approximate Arithmetic Units based on the VOS Technique
Adders and multipliers are the two main arithmetic units of digital computing systems. When the VOS
technique is applied to these systems, these two units should operate at lower supply voltages. The lower
supply voltage may be applied to the whole unit or selectively to some part of it. In addition, the VOS
technique may be combined with other approximation techniques to further improve energy or area
efficiency of the circuit. There are different adder and multiplier types. In this dissertation, we study
applying the VOS technique to Carry Look Ahead adder (CLA) as one of the most used adders due to its
high speed. CLA energy consumption is rather high and by properly applying the VOS technique, its energy
efficiency may be improved. In addition, Dadda multiplier structure, which is among the fast multipliers,
is considered for applying the VOS technique. To make the Dadda design an area-efficient one, in addition
to applying the VOS technique, the structure with 4-bit truncation is also considered.
1.5 Approximate Deep Learning Accelerators (DLAs) based on the VOS Technique
The use of artificial intelligence (AI), in general, and machine learning (ML) using deep learning
algorithms, more specifically, in many applications domains, such as, machine vision, voice recognition,
and natural language processing, have surged rapidly in recent years. AI- and ML- based systems, which
are realized using neural networks (NNs), may be implemented using general purpose and embedded
computing platforms, where high quality and sophisticated services to the end user are provided [103].
Since embedded devices normally have limited computation capability and energy budget, generally, the
amounts of computation and energy consumption for a given task should be lowered as much as possible.
An efficient handling of neural network workloads is performed using a neural network accelerator which
performs the tasks faster while consuming a lower energy. To improve the energy efficiency further, one
may apply approximation techniques, in general, and the VOS technique, in particular, to the accelerators.
Obviously, more approximation, while providing lower energy consumption, may degrade the accuracy. If
the approximation level can be modified during the runtime, one can optimize the operating point of the
accelerator based on the trade-off specified by the energy-accuracy characteristic. The main arithmetic
7
operation in NNs during inference is multiply-and-accumulate (MAC) operation [103]. Particularly,
millions of multiplication and addition are performed in the convolution and fully connected layers [103].
Additionally, the energy consumed by the memory accesses in a convolutional neural network (CNN)
accelerator is significant [94]. Since accessing DRAM memory is slow and consumes more energy than
accessing the on-chip buffers memory, the CNN architectures are designed in a way that the highest possible
hit rate for the on-chip buffer memories is achieved [94]. The on-chip buffers are usually SRAMs [94].
Based on the above explanation, to reduce the energy consumption, one may adopt approximate computing
(AxC) and memory (AxM) [146].
In this research, we study the use of the VOS technique in improving the energy efficiency and lifetime
improvement of deep learning accelerators as a more complete computing fabric. This is performed through
using NVIDIA Deep Learning Accelerator (NVDLA) which has the features of scalability, configurability,
and modular design of its hardware. More importantly, while most of other accelerator designs are not open
source, NVDLA hardware and software codes are open source, providing the opportunity of exploring the
accelerator design parameters to optimize the energy efficiency, accuracy, and lifetime.
1.6 Contributions and Organization of The Dissertation
In Chapter 2, background material and related works are briefly discussed. Chapter 3 deals with efficiently
applying VOS levels to PEs in CGRAs to improve their energy consumption and lifetime/reliability
characteristics. In Chapter 4, the application of the VOS technique to Carry Look Ahead adders and Dadda
multipliers along with some optimizations are presented. In Chapter 5, the results of employing VOS for
the computation and memory units of a deep learning accelerator are presented. Finally, Chapter 6 contains
a summary of the studies and some suggestion for future research.
8
CHAPTER 2. BACKGROUNDS & RELATED WORKS
In this chapter, first, we should mention that the review of the works related to the voltage overscaling
technique, which is the approximation technique investigated in this dissertation, is presented in the context
and relevance of the researches performed in this work. The chapter begins by briefly reviewing the
background materials about the voltage dependences of the power/energy consumption, and then bias
temperature instability as one of the main aging mechanisms in the state-of-the-art digital CMOS design
technologies. Then, we briefly discuss the structures of two arithmetic units of Carry Look Ahead adder
and Dadda multiplier and some works relevant to the approximation of these units are explained. The
application of the voltage overscaling technique to these units will be discussed in the following chapters.
Next, the CGRA structure and the works dealing with its approximation are reviewed. Finally, the
architecture and some other details about NVDLA are given.
2.1 Voltage Dependences of Power/Energy Components
To remind the dependence of power (energy) consumptions, here, we present analytical expressions for the
power consumptions of the circuit. The expression containing the dynamic (switching) and leakage power
components is given as [44]
𝑃 =𝑃 𝑠𝑤𝑖𝑡𝑐 ℎ𝑖𝑛𝑔 +𝑃 𝑙𝑒𝑎𝑘 =𝛼 .𝐶 𝑒𝑓𝑓 𝑣 2
𝑓 +𝜂 𝜃 2
𝑒 (
−𝑞 𝑉 𝑡 ℎ
𝑛𝑘𝜃 )
(2-1)
where the 𝛼 is the activity of the system, 𝑓 is the operating frequency, 𝑣 is the operating voltage, 𝜃 is the
temperature, 𝐶 𝑒𝑓𝑓 , 𝜂 , 𝑛 , 𝑞 , and 𝑘 𝜃 are technology- and circuit-dependent parameters. To be complete, the
short circuit power also should be added to the above expression. The expression is obtained from [45]
𝑃 𝑠 ℎ𝑜𝑟𝑡𝑐𝑖𝑟𝑐𝑢𝑖𝑡 =
𝛽 12
(𝑣 −2𝑉 𝑡 ℎ
)
3
𝜏 𝑇
(2-2)
where 𝛽 denotes the conductivity of the transistor per power voltage in the linear region, T is the input
rise/fall time, and 𝜏 is the gate delay. These equations show the strong dependency of the power (energy)
consumption to the operating voltage level.
9
2.2 Effect of VOS on Delay of Circuit
Applying the VOS technique to the circuit increases its delay. A simple first order approximation circuit
delay model based on the alpha power law is given by
𝑇𝑎𝑠 𝑘 𝐷𝑒𝑙𝑎𝑦 ∝
𝑉 𝐷𝐷
(𝑉 𝐷𝐷
−𝑉 𝑡 ℎ
)
𝛼
(2-3)
where 𝑇𝑎𝑠 𝑘 𝐷𝑒𝑙𝑎𝑦 is the delay of the circuit and 𝛼 is a technology dependent parameter considered to be
1.3 for sub-20nm technologies [122]. This suggests that the use of the VOS technique for the combinational
circuits may cause the violation of the setup time (of the destination flip-flops operating at nominal voltage).
This violation may occur for the outputs corresponding to the longest timing paths of the circuit [53].
The number of timing paths with failed timing requirement increases as the supply voltage is decreased to
lower VOS levels where the output error and quality degradation increase significantly [53].
2.3 Power Reduction Techniques
2.3.1 Based on VOS
In [36], by shaping the quality-energy trade-off, the VOS technique was used to significantly improve the
energy-efficiency. The efficiency of the technique was evaluated by applying it to the adder component in
the architecture used for running the inverse discrete cosine transform (IDCT) benchmark.
In the instruction set architecture (ISA)-extended design of [35], a processor had a dual-voltage register
file and two arithmetic logic units (ALUs), one was supplied by a high supply voltage level for exact
computations and the other with a low supply voltage for approximate computations. In [55], the authors
proposed a new design methodology with the name of scalable effort hardware design. The notion of
scalable effort was embodied in to the design process at different levels of abstractions. The scalable effort
hardware design involved identifying mechanisms at each level of abstraction that can be used to change
the computational effort expended to generate accurate results and using these mechanisms as the control
knobs to trade-off between the accuracy and energy-efficiency.
A micro-architecture named Lazy Pipeline was suggested in [37]. In this architecture, the VOS
approximate technique was utilized to improve the power efficiency. To reduce the timing errors of the
10
approximated Functional Units (FUs), this microarchitecture employed vacant cycles in a VOS functional
unit to extend execution and reduce the error rate. In [56], the authors proposed approximated operators
based on the VOS technique for error-resilient applications. The energy efficiency and accuracy of the
approximated operators were characterized using three knobs of the supply voltage, body biasing, and clock
frequency to create models for the approximate operators (different adder designs). The statistical behaviors
of these adders were modeled in a formulated framework such that they could be used at the algorithmic
level to achieve an optimum trade-off between the energy efficiency and error margin
In [57], techniques for designing the kernel of the error-resilient application, which can tolerate more
scaling under the VOS technique, were discussed. All the identified computational kernels (L1 and L2
norm, Dot product) in the studied applications, used the accumulator, which was the integral part of three
computational kernels and first block experiencing time starvation under VOS. The authors used dynamic
segmentation with multi-cycle error compensation and delay budgeting of the chained units to make the
accumulator more VOS friendly.
There are several works on detecting and correcting the timing errors originating from VOS FUs. As an
example, in [58], Razor was suggested as a technique for detecting and correcting timing errors in a VOS
circuit. Lazy Pipeline micro-architecture is orthogonal to Razor and could be combined with it to reduce
the number of cases where an operation needs to be repeated due to a timing error.
Recently, the use of an accuracy-aware operating voltage management unit for improving the lifetime
using aggressive voltage scaling during the runtime of error-resilient applications was suggested [19]. The
unit determines the operating voltage of the processor based on the type of the running application and the
predefined minimum acceptable quality resulting in lifetime/reliability and power improvement at the cost
of tolerable accuracy loss. There are several other works which have used the VOS technique using the
similar of error detection/correction or error reduction hardware solution to reduce the quality degradation
of the output. The review of these works is omitted for the sake of space.
11
2.3.2 Based on Operation Level Approximation
A different approximation-based approach works on simplifying the computations need to be performed
for an operation leading to shortening the critical path delay of the function. In [60], the use of approximate
computing techniques at different abstraction levels was discussed. At the software level, pruning the source
code based on the profiling was suggested. At the register transfer level, the authors proposed internal signal
substitution (variable to variable (V2V) and variable to constant (V2C)) as well as bit-level optimization.
At the high-level synthesis (HLS)-level, employing different approximated adders and multipliers was
considered. This operation level approximation provides the possibility of voltage downscaling without
violating the simplified circuit critical path delay. The downscaled voltage would have violated the critical
path delay of the original circuit for the exact operation. For example, in [59], the authors achieved this
type of operation-level approximation by bit rounding and more aggressive operation elimination. These
reductions are used for applying VOS at the operation level[59]. In [61], the authors suggested the similar
idea of using different approximate types for each operation DFG while making sure that the output quality
constraint of the application is met. Downscaled voltages were applied to the operations for approximate
operations without causing any timing error [61].
The use of voltage islands for improving the efficiency of multi-level voltages in the approximate
computing also has been considered in the literature. For example, in [62], the authors introduced an
approximate accelerators synthesis framework that enabled the usage of approximate techniques at different
levels (multi-level) of algorithm, circuit, and logic level.
2.4 Bias Temperature Instability
The bias temperature instability affects both NMOS and PMOS transistors by generating the interface traps
at the Si/SiO 2 interface [46]. Based on the wafer-level extended Measure-Stress-Measure (eMSM)
measurement on sub-20nm FinFET technology nodes [47] the long-term aging-induced shift of the
threshold voltage incorporating both stress and relaxation phases, is fitted by a power law given by [48]
∆𝑉 𝑡 ℎ,𝑁𝐵𝑇𝐼 ≅𝐴 𝑒 −
κ
𝜃 𝑡 𝛼 𝐸 𝑂𝑋
𝛾 𝑑 𝑓 𝛽
(2-4)
12
Table 2.1. The values of parameters used in (2-4) for calculating the BTI threshold voltage drift [47].
Parameter Description NMOS Value PMOS Value
𝛾 Power-law exp. 5.2 3
α Power-law exp. 0.158 0.173
A Fitting constant 3.12e-2 2.02e-2
β Fitting constant 1/6 1/6
κ Fitting constant 50 50
Vth High performance Vth 0.25V 0.25V
Tinv (nm) Inversion layer 1.4 1.4
where A, κ , α, β, and γ are technology and power-law fitting parameters (whose values are given in Table
2.1), t is the total stress time (s), θ is the temperature in degree Kelvin (ºK), df is the duty factor of the stress
signal, and 𝐸 𝑂𝑋
=(𝑉 𝐷𝐷
−𝑉 𝑡 ℎ
)/𝑇 𝐼𝑁𝑉 is the electric field across the gate oxide (where TINV is the thickness
of gate inversion layer [47]). The power-law fitting is consistent with the existing aging Bias Temperature
Instability (BTI) models, such as the Reaction-Diffusion mode [47]. This relation demonstrates the strong
voltage dependence of the threshold voltage shift (aging) on the supply voltage.
2.5 Carry Look Ahead Adders
The structure of a 32-bit Carry Look Ahead (CLA) adder is shown in Figure 2.1 [147]. The basic idea of
this adder is to compute several carries simultaneously to reduce the critical path delay of the adder[147].
In the best-case scenario, all carries may be implemented at the same time [158]. However, the
implementation of this scenario is impractical due to the large number of gates with large number of inputs
[158]. Hence the input vector is divided into groups so the carries inside each group are computed
simultaneously [158]. The inputs are divided to blocks (groups) with length of 8 in the CLA shown in
Figure 2.1. These blocks are connected as in a ripple carry adder [158]. Each block generates the output
carry and the sum bits based on the inputs fed to the block [158].
Figure 2.1. The structure of the 32-bit Carry Look Ahead adder with block size of 8-bits.
B
4
B
3
B
2
B
1
C
15 C
23 C
out
C
7
S[31:24] S[23:16] S[15:8] S[7:0]
A[7:0]
B[7:0]
B[15:8]
A[15:8]
A[23:16]
B[23:16]
A[31:24]
B[31:24]
13
2.6 Approximate Adders
For the sake of space, we limit the review to the two works performed on the design of approximate adders
based on biased VOS (Bi-VOS) discussed in [78] and [79] as well as the reconfigurable accuracy adders of
RAP-CLA [119] and GeAr [121] and one of the recent proposed approximate adders which is BCSA [120].
In [78], the idea of Bi-VOS where the circuit for each bit position could have a different supply voltage was
presented. The optimization methodology for finding the optimum supply voltages were discussed in [79].
In this work, the optimization framework was based on minimizing the error subject to a given energy
budget while allowing the voltage level to take certain predefined values. In both of these works, the Bi-
VOS technique was applied on the ripple carry adder (RCA). Also, for connecting the signals from different
voltage domains, the level shifter was not used making the design vulnerable to the noise. In [119],
reconfigurable approximate Carry Look-Ahead adder (RAP-CLA) based on the exact carry look ahead was
proposed. This adder has the ability to switch between exact and approximated operating mode during
runtime. The RAP-CLA uses extra multiplexers to provide accuracy re-configurability which results in
higher area, delay and power for their design. In [121], they present a low-latency generic accuracy
configurable adder to support variable approximation modes. An error correction unit is integrated to the
adder to increase the accuracy of the adder where there is a higher requirement for accuracy. One of the
recent approximate adder designs has been proposed in [120]. In [120], a low energy consumption block-
based carry speculative approximate adder is proposed. Structure of the adder is based on partitioning the
adder into some non-overlapped summation blocks whose structure may be selected from both carry
propagate and parallel-prefix adders.
2.7 Dadda Multiplier
The structure of an 8-bit Dadda multiplier is shown in Figure 2.2 [112]. For most of this paper, including
the comparative study part, the focus is on the 8-bit X-Dadda multiplier because of the high usage of low
bit multipliers in image processing and machine learning applications. This multiplier is constructed from
three stages. In the partial product generation (PPG) stage, “AND” operations of the bits of the multiplicand
14
Figure 2.2. The structure of an 8-bit Dadda multiplier [112].
with the bits of the multiplier are performed. In the partial product reduction (PPR) stage, 8 partial product
terms are compressed to two terms by adding them in some way. Finally, in the merge stage, the final two
terms are added to produce the output. Among these stages, PPR has the highest delay, area, and power
consumption [124]. To reduce the delay of the PPR stage, 4:2 and 5:2 compressors are vastly used [124].
For the X-Dadda design, the exact 4:2 compressor, which is comprised of two full adders connected serially,
is used [34].
2.8 Approximate Multipliers
In this section, we briefly review a few works which are mainly more relevant to the proposed X-Dadda
structure (chapter 4) and/or included in the comparative study performed in chapter 4.
2.8.1 Approximate Multipliers Based on Circuit Simplification/Pruning
In [104], to build approximate multipliers, approximate compressors implemented using AND-OR gates
were proposed. In addition, a simple algorithm for an efficient use of compressors in the first steps of the
partial product reduction tree of the multipliers was suggested. In [105], first, an approximate compressor
with a large output error was proposed. Then, by encoding the inputs of the compressor using generate and
propagate signals, a more accurate compressor was suggested. By employing these improved approximate
Compressor
FA
HA
Adder Block
Level 1
Level 2
Merge
Output of FA
PP
Partial Product
Generator
Partial Product Reduction
Output of Compressor
Output of HA
15
compressors, two 4×4 approximate multipliers used as the building blocks for the 16×16 and 32×32
multipliers were designed. In [106], non-iterative and iterative approximate logarithmic multipliers (ALMs)
were suggested. In the non-iterative ALMs, three different approximate adders for the mantissa adder were
used. The proposed iterative ALMs used a set-one adder (an n-bit set-one adder consists of two parts of an
m-bit inexact adder and an (n-m)-bit exact adder where the sum of the m-bit adder is set to the logic one)
for the mantissa adders during an iteration. They also utilized lower-part-OR adders and approximate mirror
adders for the final addition. The work described in [107] suggested an approximate Booth multiplier which
was based on approximate radix-4 Booth encoding (MBE) algorithms and a regular partial product array
using an approximate Wallace tree. The authors also described two approximate Booth encoders for error-
tolerant computing applications.
In [108], a simplification of 4:2 and 5:2 approximate compressors using the K-map were proposed.
Furthermore, for more simplification, the carry in and carry out (C in and C out) of the compressors were
removed. Finally, by realizing the proposed compressors, some approximate 8-bit Dadda multipliers were
suggested. Using the approximate computing for simplifying the tree stage of the Dadda multiplier was
discussed in [109]. The authors suggested altering the generated partial products to other partial products
with some properties useful for the approximation. Based on the properties of the altered partial products,
approximate half adder, full adder, and 4:2 compressor were suggested. By utilizing these cells as well as
exact OR gates, two approximate multipliers were proposed. In the first one, the approximation was applied
to all of the columns of the partial products of the multiplier while for the second one, the approximation
circuits were used in the 𝑛 −1 least significant columns of the multiplier.
An approximate multiplier (called BAM) where the carry-save adders (CSAs) were utilized for its partial
product reduction stage (an add and shift structure) was proposed in [110]. To improve the power and delay
parameters, some of the CSAs were omitted horizontally and vertically. More specifically, some CSAs, in
the first rows of the partial product array and some from the least significant columns of the array were
pruned. In [111], by using Cartesian Genetic Programming (CGP), a library of 8-bit approximate multipliers
16
was generated. The structure of the multiplier was represented by integer chromosomes where the mutation
operation was employed to generate new populations in the employed genetic algorithm. The fitness
function was defined based on the MRED of the circuits represented by the chromosomes. In addition, to
limit the search space, some constraints on the error, power, and delay were defined. Based on this approach,
a library containing 471 8-bit approximate multipliers were generated.
In [112], two 4:2 approximate compressors used for realizing approximate Dadda multipliers was
suggested. Four approximate multipliers, based on the two suggested approximate compressor structures,
was designed. In [80], an inaccurate 4:2 counter for reducing the partial products levels in the Wallace
multiplier, without any modification to the other parts of the structure, was proposed. This inaccurate 4:2
counter was used to build a basic power efficient inaccurate 4×4 Wallace multiplier. Arbitrary large
multipliers have been built based on this multiplier. They also proposed an efficient error detection and
correction circuit.
In [113], a partial product perforation technique for designing approximate multipliers was introduced.
The authors mathematically proved that the error was bounded and predictable based on the input
distribution.
2.8.2 Approximate Multipliers Based on Voltage Overscaling
Applying the overscaled voltage selectively to some parts of the arithmetic hardware unit has been
introduced in prior works (see, [81] and [114]) where the term biased VOS (Bi-VOS) was used to
differentiate it from the case of applying it to the whole unit. This scheme offers less accuracy loss for a
given energy reduction. In [81], the idea of Bi-VOS+SLEEP in a 32-bit floating point multiplier was
presented. Since the multiplier hardware was not implemented, the error and energy were not accurately
extracted. In [114], the usage of Bi-VOS for minimizing the error with a specified energy budget for an
array multiplier was investigated. They derived few theoretical results for assigning the VOS levels when
designing an approximate multiplier with minimum error, given an energy budget constraint [114]. Also, it
was stated that the full adders (FAs) of the same column should have the same voltage [114].
17
In chapter 4, we focus on investigating an energy efficient accuracy configurable Dadda Multiplier (X-
Dadda) based on VOS. In our X-Dadda structure, we also apply the use of the VOS technique to some
columns of the partial product reduction and merge the stages of the Dadda multiplier (see Figure 4.7).
There are, however, several differences between our work in chapter 4 and the ones suggested in [81] and
[114] which focused on the floating point multipliers and array multipliers, respectively. Both of these
designs did not use voltage level shifter circuits (for the transition from the low voltage to the high voltage)
in their structures. The use of the level shifter reduces the output error as well as its uncertainty. In addition,
none of these works suggested runtime accuracy configurability for the proposed multipliers. Both works
of [81] and [114] did not implement the full hardware realizations of the multipliers with the Bi-VOS
technique. Hence, a complete and accurate analysis of the output error and energy determinations through
the whole circuit simulation are not possible and hence they will not be included in the comparative study
of Section V. More specifically, in [81], for calculating the output error of the 32-bit floating point
multiplier, a simulator (based on the C programming language) was developed. The simulator used the error
of a 1-bit full adder calculated using HSPICE simulation. Also, for calculating the energy, they used
HSPICE to calculate the energy consumed when the sum and carry (i.e., outputs) of a 1-bit full-adder was
toggled [81]. Then, the number of toggles that occurred in a multiplier was obtained by simulating the
Verilog HDL of the circuit, and next, based on the amount of the toggles, the total energy consumed in a
multiplier was determined [81]. The authors assumed that the energy was consumed only when any output
changed [81]. Moreover, the static energy consumption was not considered in this work. In [114], the
authors used a high-level analytical expression for measuring only the output error of the array multiplier.
In addition, for extracting the energy, short-circuit, leakage, and DC current drawn from the supply voltage
were not considered. Finally, none of these works reported the standard error metrics including mean error
distance, mean relative error distance, and mean normalized error distance for evaluating the accuracy of
the Bi-VOS based approximate multipliers.
18
2.8.3 Accuracy-Configurable Approximate Multipliers
Supporting output accuracy configurability is a feature which has received renewed attention for designing
approximate multipliers in dynamically adjustable quality-of-service (QoS) applications. In [34], four
configurable approximate compressors for implementing approximate multipliers were suggested. The
multiplier structures had the flexibility of switching between the exact and approximate operation modes.
In [115], methods for designing quality configurable circuits were developed. The method was based on
CGP in which the exact and approximate modes of the circuit was realized at the same time. The method
was experimentally evaluated by designing configurable approximate multiplier which had both exact and
approximate modes. This multiplier only has two modes of exact and approximate computations and hence
was not considered in our comparative study. In [116], an accuracy-configurable multiplier is proposed.
The final product of the multiplier is generated by a carry-maskable adder. The accuracy of the proposed
multiplier can dynamically be configured by changing the length of the carry propagation. Furthermore, the
partial product tree of the multiplier is approximated by the proposed tree compressor. An extension of this
design which had smaller area and power consumption was described in [117]. The authors used a simpler
approximate tree compressor. In [118], an approximate multiplier based on an approximate adder design
which limits its carry propagation to the nearest neighbors was suggested. The adder was used for fast
partial production accumulation [118]. Different levels of accuracy were provided through their
configurable error recovery circuit realized using OR gates or the proposed approximate adders[118]. Since
the proposed structure in chapter 4 does not use error recovery unit, this multiplier was not included in the
comparative study of chapter 4.
2.9 Typical CGRA Architecture
A typical CGRA architecture which mainly consists of a 2-D mesh array of PEs, host controller, context
and data memories is shown in Figure 2.3 [40].
19
Figure 2.3. A typical 4 × 4 CGRA architecture [40].
The CGRA is connected to the host CPU through the host controller which is connected to the data memory
and context memory (used for storing the configuration information). The host controller executes control
intensive irregular code segments [41]. The host controller responsibility is to execute non-loop or outer
loop code may be a VLIW processor (e.g., ADRES), a DSP processor (e.g., Montium), or a general-purpose
microprocessor (e.g., MOLEN) [42]. This host also controls the reconfiguration of the array[42]. Each PE
has an arithmetic logic unit (ALU), a local register file and an output register [40]. The data memory
provides operand data to PE array though a high-bandwidth data bus and the context memory stores the
context words used for configuring the PE array elements [8]. One of the important properties of CGRAs
is the notion of multiple contexts [43]. A single configuration of the CGRAs PEs functionality and routing
connectivity is called a context [43]. A CGRA with two-context would contain two copies of the PEs
configuration and routing connectivity: configuration 0, and configuration 1 [43]. The CGRA is set to cycle
between the two contexts on a cycle-by-cycle basis [43]. Therefore, the functionality of the PEs and the
routing connectivity can change each cycle [43]. This means that the resource of the CGRAs is time
multiplexed, where the PEs and the routing can be used for different purposes in each context [43]. The
architecture is symmetric where each PE is connected to its neighbors. Also, the register files are employed
to hold temporary data. The ALU of PEs can provide multiple logic and arithmetic operations determined
by the configuration context data. Here, for the sake of simplicity, we have assumed that each ALU
20
component only consists of an adder and a multiplier. For each specific application, the algorithm
(computations) changes, and hence, the context data configures the CGRA accordingly. The configuration
may be optimized for power, speed, and/or lifetime. Schedulers assign an PE and time to every operation
in the program DFG where the operand values should be routed between producing and consuming PEs.
Since dedicated routing resources are not provided, an PE either serves as a compute resource or as a routing
resource at a given time. A compiler scheduler manages the computation and flow of operands across the
array to effectively map applications onto CGRAs [7].
There are quite a few research efforts which have focused on different aspects of CGRAs (such as [49],
[50], and [51]). In [49], an architecture-agnostic integer linear programming (ILP) approach for CGRA
mapping has been proposed. This approach has been integrated within an open-source CGRA evaluation
framework. In [50], the authors provide a data placement optimization approach for CGRAs to
simultaneously optimize the performance of CGRA execution and data transformation between main
memory and multi bank memory. Also, in [51], dual-V DD CGRA has been proposed to reduce both static
and dynamic power. In this work, they assign high V DD to the PEs which are going to execute long
operations (such as multiplication) while the low V DD is assigned to the PEs which execute short operations
(such as addition).
2.10 Approximated CGRA
Here, we briefly review the works which are related to approximated CGRAs. In [52], the authors do a
design-space exploration of state-of-the-art approximation designs and propose a flow for designing
approximate CGRA and discuss the compilation and runtime reconfiguration issues. Similarly, in [38], the
concept of Polymorphic Approximate CGRA (PX-XGRA) that employs heterogeneous tiles of
Polymorphic-approximated ALU Clusters (PACs) connected in a 2-D mesh style connection is introduced.
Moreover, per runtime output quality requirements of the application, the PACs can have different
approximate modes as well as accurate modes based on their selected configuration.
21
2.11 CGRA Lifetime Improvement
The only two works which have focused on improving the lifetime of CGRAs are those of [53] and [40].
In [53], the initial idea of using VOS in CGRAs considering five voltage levels was proposed by our group.
Only two benchmarks of 4
th
order polynomial evaluation and 8-tap FIR filter were studied. In [40], a joint
stress-aware loop mapping method for helping designers to select the optimized mapping with the minimal
stress on the PEs was suggested. This is performed in the early phase of the CGRA mapping. Their stress
effort optimization had two objectives of reducing the maximum accumulated stress on the PEs and
providing more balanced stress distribution on the PEs of the CGRAs. For their first aim, they introduced
a stress-aware Force-Directed scheduling method (sFDS) to schedule operations at different time slots. This
helped to prevent the operation (especially the same type) mapping on the same PE. A rapid MCC (Maximal
Compatibility Classes) search method [54] was used to find the optimal maps which had the lowest
maximum stresses and distributing more operations on more PEs. The MCC helped achieving both aims.
Also, they created a set of ordered maps created based on obtaining the first objective. They used a multi-
map scheduling technique which used the dynamic reconfiguration feature of the CGRAs for employing
these ordered maps. The method helped reaching a more balanced PE usage distribution and reducing the
maximum stress on each PE.
2.12 NVDLA: NVIDIA Deep Learning Accelerator
NVDLA is an accelerator platform for implementing the neural network models used in deep learning [83].
It is a configurable accelerator from NVIDIA for inference in deep learning applications. This accelerator
provides acceleration for CNNs by speeding up the computations of various building blocks of each CNN
layer operations (e.g., convolution, deconvolution, fully-connected, activation, pooling, local response
normalization) [83]. This modular architecture supports typical deep learning frameworks, such as
TensorFlow and Caffe, enabling highly configurable solution easily scalable for meeting specific needs
[83].
22
2.12.1 NVDLA Hardware
Each block in the NVDLA architecture supports a specific inference operation of deep learning [83]. Most
of the computational work of deep learning inference operations relies on mathematical operations, which
can be divided into four parts of convolution, activation, pooling, and normalization [83]. The common
feature between these mathematical operations makes them well suited for dedicated hardware
implementations [83]. The predictability feature of their memory access patterns makes it easy for
parallelizing the operations [83]. The main operations in NVDLA are convolution operations, single data
point operations, planar data operations, data memory and reshape operations [83]. By configuring
hardware parameters of NVDLA, different sizes of NVDLA can be implemented [83]. The core block of
NVDLA is schematically shown in Figure 2.4. Next, we describe the convolution core of the NVDLA core
block.
2.12.1.1 Convolution Core
The convolution operation is applied to two sets of data [83]. The first data set is for the weights which
obtained through the offline training and remains constant between each two consecutive inferences and
the other set is the input feature data which varies with the network input [83].
The convolution engine of NVDLA has configurable parameters which enables four different modes of
operation including direct convolution, image-input convolution mode, Winograd convolution mode, and
batching convolution mode [83]. Each of these modes provides specific performance improvement for the
convolution operation [83]. As shown in Figure 2.4 (a), the convolution block has five stages including
Convolution DMA (CDMA), Convolution Buffer (CBUF), Convolution Sequence Controller (CSC),
Convolution MAC (CMAC), and Convolution Accumulator (CACC) [84]. The host CPU (see Figure 2.4
(b)) provides configuration data through the CSB slave port of each stage [84]. All the stages use single
synchronization method. The convolution pipeline has 1024 MACs for 16-bit signed integer (i.e., Int16) or
16-bit floating point (i.e., FP16) precisions with a 32-element accumulator array for storing the partial sums
[84]. The MAC array also can be configured to provide 2048 MACs for 8-bit signed integer (i.e., Int8)
23
precision [84]. In this work, the int8 version of the MAC array has been used. To better optimize the
physical design of the MAC array, it is divided to two CMAC_A and CMAC_B [84]. The detailed
schematic diagram of the CMAC stage is presented in Figure 2.5 (a) where there are 8 rows which each
row has 128 Int8 multiplier.
(a)
(b)
Figure 2.4. The hardware diagram of a) The core block of NVDLA has been redrawn from [85] b) The convolutional core in
NVDLA core block redrawn from [84].
Configuration Interface Block
Convolution Buffer
Convolution Core
Memory
Interface
Block
Activation Engine (SDP)
Pooling Engine (PDP)
Local resp.norm (CDP)
Reshape (RUBIK)
Bridge DMA
NVDLA Core Block
CSB interrupt/interface
DBB interface
Second DBB Interface
(optional)
CDMA
CBUF
CSC
CMAC
CACC
REG FILE
REG FILE REG FILE REG FILE REG FILE
CSB Master
MCIF/SRAMIF
CDMA Data
Path
CBUF Data
Path
CSC Data
Path
CMAC Data
Path
CACC Data
Path
Convolution Pipeline
CSB
Pixel/
feature/
weight
24
(a)
(b)
Figure 2.5. The detailed schematic of a) The MAC array redrawn from [86]. b) The convolution buffer redrawn from [84].
The convolution buffer (CBUF) is the second stage in the convolution pipeline [84]. The buffer contains a
total of 512KB (16 32KB banks) of SRAM. The SRAMs store input pixel data, input feature data, weight
data, and weight mask bit (WMB) data from the CDMA module, and are read by the convolution sequence
generator module. The buffer has three read ports and two write ports. Each bank has two 512-bit-wide,
256-entry 2-port SRAMs [84] the banks act as three logical circular buffers with the names of input data
buffer, weight buffer, and WMB buffer. The schematic diagram of the convolution buffer is depicted in
Figure 2.5 (b).
Convolution Buffer (SRAM Bank)
Convolution Sequence Controller (CSC)
MUL MUL MUL MUL MUL MUL
Channel Direction
Kernel Direction
...
MUL MUL MUL MUL MUL MUL ...
MUL MUL MUL MUL MUL MUL ...
MUL MUL MUL MUL MUL MUL ...
MUL MUL MUL MUL MUL MUL ...
MUL MUL MUL MUL MUL MUL ...
MUL MUL MUL MUL MUL MUL ...
MUL MUL MUL MUL MUL MUL ...
F
W
Convolution Accumulator
Adder
Adder
Adder
Adder
Adder
Adder
Adder
Adder
PS
PS
PS
PS
PS
PS
PS
PS
W 1- W 128
F 1- F 128
W 129- W 256
F 129- F 256
W 256- W 384
F 256- F 384
W 384- W 512
F 384- F 512
W 512- W 640
F 512- F 640
W 640- W 768
F 640- F 768
W 768- W 896
F 768- F 896
W 896- W 1024
F 896- F 1024
W: Weight, F: Feature, Mul: Multiplier , PS: Partial Sum
CDMA
Data Port Weight Port
SRAM Bank 0
SRAM Bank 1
SRAM Bank 2
SRAM Bank 3
SRAM Bank 4
SRAM Bank 5
SRAM Bank 6
SRAM Bank 7
SRAM Bank 8
SRAM Bank 9
SRAM Bank 10
SRAM Bank 11
SRAM Bank 12
SRAM Bank 13
SRAM Bank 14
SRAM Bank 15
MUX MUX
Feature/pixel data
Weight/WCB
2*64B
64B
CSC
Feature/pixel data
Weight/WCB
128B
128B
25
2.12.2 NVDLA Software
NVDLA enjoys from a complete software ecosystem support [82]. Part of the ecosystem includes the
software stack on the device, which is included in the open source version of the NVDLA. The software is
divided into two parts of compilation tool and runtime environment. Compilation tool, which is used for
model analysis, compiles models to software usable format called Loadable file. The software in the runtime
environment loads and executes the compiled neural network (Loadable file of the specific neural network
model) on NVDLA.
2.12.3 NVDLA Virtual Platform
NVIDIA also provides a virtual platform for rapid development and debugging of software for NVDLA
[83]. The virtual platform is based on GreenSocs QBOX which is a solution for co-simulation with QEMU
and System-C. The virtual platform has a simple CPU which can load the hardware information of NVDLA.
Its block diagram is shown in [83].
Figure 2.6. The block diagram of the virtual platform redrawn from [87].
Guest Kernel
QBOX (QEMU)
CPU
TLM2C
Router
csb-adaptor dbb-adaptor SRAM-adaptor
NVDLA_core
Irq_adaptor
Memory
26
2.13 Approximate Implementation of NN Accelerators
In this section, prior works focused on approximate multipliers, VOS-based memories, and the use of VOS
in both multipliers and memories in reducing the energy consumption of neural networks accelerators are
briefly reviewed. In addition, works on accuracy configurable accelerators are included in this review.
2.13.1 Approximate Multipliers in NN Accelerators
In [88], the authors used approximate multiplier in neural network accelerator to reduce the energy
consumption. In this work, to compensate for the approximation error of one stage/part of the design to the
subsequent stage/part of the design, an error correction module was employed. In [89], the voltage
overscaling was applied to the multipliers of the neural network accelerator considering the sensitivity of
the weights to error with respect to the layer, filters, and channels. They also presented a heterogeneous
MAC unit design approach where some of the MACs were designed larger for aggressive voltage scaling
[89]. A methodology for designing power-efficient neural networks was suggested in [90]. The desired
trade-off between accuracy and implementation cost was obtained through the use of CGP for designing
approximate multipliers in [90].
In [91], the authors introduced an approximate computing method in which deep neural networks
(DNN) computations were performed block wise. In this work, the reconfigurability was supported at the
block granularity level. The results of the block wise computations were combined approximately to enable
efficient reconfigurability (here reconfigurability meant changing the number and indices of the weight and
activation blocks for the MAC operation) [91]. In [92], to provide accuracy configurable approximate
multipliers, gate-level pruning and wire-by-switch replacing techniques were invoked using a simulated
annealing framework.
A framework that enables aggressive voltage overscaling of MACs in DNN accelerators without
compromising performance (speed) and considerable impact on the classification accuracy is described in
[93]. In this work, a timing error recovery technique for DNN accelerators, called TE-Drop, was presented.
The technique detected timing errors by using Razor flip-flops. The performance impact of re-execution
27
was avoided by dropping the MAC operation subsequent to an erroneous MAC operation and using the
extra clock cycle to correctly compute the erroneous MAC result.
2.13.2 Applying VOS to On-Chip Memories in NN Accelerators
The application of the voltage overscaling technique to the buffer of each layer was suggested in [94]. The
technique was to reduce the energy consumption of the buffer access. In [95], the authors investigated the
efficacy of SRAM voltage scaling in a 9-layer CIFAR-10 binarized ConvNet processor. They achieved
memory energy savings with minimal accuracy degradation. Also, in this work, the effect of the bit error
accumulation in a multilayer network was quantified indicating that further energy savings were possible
by separating the weight and activation voltages [95]. To improve the energy efficiency of DNN
accelerators, in [96], a methodology that enabled aggressive voltage scaling of accelerator weight memories
was proposed. The technique, which was called MATIC standing for memory adaptive training with in-situ
canaries, provided accurate operation with voltage overscaling. It was achieved through injecting profiled
SRAM bit errors (due to VOS) throughout the offline training process letting the neural network to
compensate via adaptation (the inherent error resilience of the neural networks is used here).
The work in [97] presents a technique of low-voltage neural network acceleration, where the embedded
SRAM architecture was equipped with a novel application aware supply voltage boosting capability. This
technique enabled very low voltage operation during most of the application run while mitigating the low-
voltage induced failures [97].
In [98], the authors described one method to reduce the memory power by exploiting the error resilience
feature of neural networks and tolerating bit errors under reduced supply voltages. They extensively studied
the effectiveness of this idea and showed that further savings were possible by injecting error during the
neural network training [98]. In [99], the impact of voltage scaling on the precision and error resilience of
edge devices relying on CNNs were explored. The study showed that the CNN used in their industrial case
was quite resilient against errors of the model, while the impact of the errors in the intermediate buffers,
which hold the activations were more critical [99].
28
2.13.3 Simultaneous Application of the VOS Technique to both Computation and Memory Units
of NN Accelerators
A fully programmable IoT end-node system-on-chip (SoC) capable of executing software-defined,
hardware accelerated binarized neural networks (BNNs) operating at ultra-low voltage was introduced in
[100]. Using the 22nm FDX technology, it was demonstrated that both logic and SRAM operating voltage
could be dropped to 0.5V without any accuracy penalty on a BNN trained for CIFAR-10 dataset. In [101],
the authors empirically studied the efficiency of the VOS approximation technique to improve the power-
efficiency of convolution neural network accelerators mapped to Field Programmable Gate Arrays
(FPGAs). They experimentally studied the reduced-voltage operation of multiple components of FPGAs
[101]. Then, the corresponding accuracy behavior of CNN accelerators when combining the VOS technique
with architectural CNN optimization (quantization and pruning) was characterized [101].
2.13.4 Runtime Accuracy Reconfigurable NN Accelerators
Approximate computing principles and neural network inference were brought together by designing neural
network with specific approximate multipliers providing multiple accuracy levels in the runtime [103]. In
this work, also an automated framework for mapping the neural network weights to the accuracy levels of
the approximate reconfigurable accelerator was proposed. To provide accuracy reconfigurable neural
network inference, a heterogenous structure using approximate units with fixed accuracy levels (static
approximate level) was suggested in [102]. During the runtime, while layer wise approximation was
invoked, the multipliers which were not used were power gated. The authors in [92] proposed two specific
accuracy configurable multipliers for implementing the neural network model. The accuracy level of the
multipliers of each layer was assigned at the runtime.
29
CHAPTER 3. ENERGY CONSUMPTION AND LIFETIME IMPROVEMENT OF
COARSE-GRAINED RECONFIGURABLE ARCHITECTURES TARGETING
LOW-POWER ERROR-TOLERANT APPLICATIONS
In this chapter an energy-quality scalable coarse grain reconfigurable architecture based on voltage
overscaling technique is presented. The proposed technique, which may be applied to CGRAs used as
accelerators for low-power and error-tolerant applications, reduces the (strongly voltage-dependent)
wearout effects and the energy consumption of PEs whenever the error impact on the output quality
degradation can be tolerated. This provides us with the ability to lessen the wearout and reduce energy
consumption of PEs when accuracy requirement for the results is rather low.
Multiple degrees of computational accuracy can be achieved by using different overscaled voltage levels
for the PEs. By employing the technique, the architecture may be configured for accurate or approximate
modes of computation depending on a user specified output quality of service target for a given application.
More precisely, operating voltages used for performing various operations in the application data flow
graph are minimized subject to the output quality constraint by using an energy-quality trade-off algorithm.
Two CGRA architectures are proposed in this chapter. In the first architecture, each PE can be assigned to
any of the VOS levels while in the second architecture, PEs are clustered into groups of (e.g., 3 1 and 2 1)
voltage islands to improve efficiency. The efficacies of the proposed techniques are studied by considering
the bias temperature instability.
The rest of the chapter is organized as follows. We present the key idea and potentials in section 3.1.
Section 3.2 presents the proposed architectures. The results are discussed in Section 3.3 while the chapter
is concluded in Section 3.4.
3.1 Key Idea and Its Potentials for CGRA Optimization
In this chapter, for the first time, to the best of our knowledge, we propose the use of the VOS technique
for CGRAs utilized for low-power error-tolerant applications. Our focus is on improving the lifetime and
reliability and lowering the power consumption (as a byproduct) of the accelerators based on the CGRAs.
Using this technique, the voltage assignment could be selectively applied to the PEs (assigned to the nodes
30
Figure 3.1 a) A simple DFG and b) DFG with multiple choices of over-scaled voltage levels.
of the DFG) whose error impacts on the output do not cause degradation beyond the given constraint. To
illustrate the idea, consider the simple DFG shown in Figure 3.1 (a) where each operation can run with
using one of the five voltage levels. For this simple case, there are 125 different combinations which
increases exponentially by the number of PEs. The voltage assignment for each PE should be optimally
assigned using the design objective (e.g., lifetime or power consumption improvement) considering the
given constraint (e.g., output quality).
Since the CGRA may consist of several similar ALUs, the role of the PEs with the corresponding DFG
node could be altered. This means that we could exchange the assignment of the PEs with the higher supply
voltages (which have higher accuracies) with those with lower supply voltages. This way, the time that a
node is under high voltage is reduced, yielding lower average stresses for all the PEs. Additionally, since
we are using switches that connect the corresponding voltages to PEs, one can set one of these voltages to
zero and activate this switch if the PE is going to be idle for a long enough time. Finally, since the accuracy
is a function of the PE supply voltage, variable accuracy accelerator may be realized using the VOS
technique (not discussed in this chapter). As was mentioned above, however, the focus of this work is only
on the application of this technique for the lifetime and reliability enhancement of the CGRAs. In the next
section, we explain the proposed architecture in more details.
*
+
M1
A1
*
M2
*
+
M1
A1
*
M2
Voltage Over
Scaled Level:
750mV
700mV
650mV
600mV
Voltage Over
Scaled Level:
750mV
700mV
650mV
600mV
Voltage Over
Scaled Level:
750mV
700mV
650mV
600mV
31
3.2 Proposed Architectures
In the conventional CGRA architecture, full supply voltage (V DD) is used for all the PEs (as well as other
blocks), and hardware modules perform exact computations. As discussed in chapter two, one may use
approximate FUs to make the CGRA an approximate one (see, e.g., [38]) or even use accuracy-configurable
PEs to provide both approximate and exact computing modes (see, e.g., [38]). We have proposed two
architectures for the CGRA: 1. CGRA without voltage island 2. CGRA with voltage island. The general
architecture of the proposed CGRAs, which has two major differences with the conventional CGRAs, are
depicted in Figure 3.2 and Figure 3.5. The first major difference is availability of two separate operating
voltages, one for the I/O and the other one for the core of the PEs. The former could be the nominal voltage
of the technology. The input for the switch power boxes, realized by MOSFET switches, are different VOS
levels and the output of these switch power boxes are the core voltages of the
PEs. Using a single power switch box for each PE (CGRA without voltage island) provides us with the
flexibility of assigning any voltage level to any PE. This obviously increases the efficiency of the proposed
CGRA core architecture in terms of energy consumption and lifetime/reliability improvements. It, however,
has the disadvantage of increasing area and power overheads and voltage level routing complexity. To
reduce this overhead, we propose the CGRA with voltage island in section 3.2.2. In this section, first, details
of the proposed CGRA architectures are provided. Next, the accuracy level determination process and
mapping the application on the CGRAs will be discussed.
3.2.1 Proposed CGRA Architecture without Voltage Island
In this section, we suggest an accuracy-configurable CGRA architecture, which makes use of exact
hardware modules while selectively using the VOS technique to switch to approximate computations. The
overall architecture of the proposed CGRA is depicted in Figure 3.2.
32
Figure 3.2 The proposed CGRA Architecture to implement the Voltage Over-Scaling technique.
Compared to the conventional CGRA, the proposed one contains “Power Switch Box” units and the PEs
have two input voltages, Core Voltage (V C) and I/O Voltage (V IO). For each row of the CGRA, we suggest
using a Power Switch Box unit.
The inputs of this unit are the different considered VOS (operating) levels and its outputs are the core
voltage of the PEs. This unit consists of large MOSFET switches which connect the input voltage of each
PE to the one of the voltage rails based on the considered core voltage level for that PE. Note that we can
consider one Power Switch Box unit for all PEs which leads to increasing the size of this unit and the
routing complexity of the voltage rails of the PEs. On the other hand, one may consider one small Power
Switch Box unit for each PE and place it near each PE which leads to increasing the routing complexity of
the considered voltage rails (V 1 to V n). Hence, we suggest considering one of these boxes for each row. In
addition to the core voltage, the I/O voltage of all PEs are connected to the nominal voltage of the system.
In Figure 3.3, the internal structure of each PE and the voltage domains are shown.
In this chapter, without loss of generality, the ALU part only consists of a 32-bit Carry Look Ahead
adder and a 16-bit Dadda multiplier. The operating voltage of the ALU unit is determined by V C. The output
PE PE PE
PE PE PE
PE PE PE
Context Memory
Power
Switch
Box
Power
Switch
Box
Power
Switch
Box
Data Memory
Host Contoller
V
1
V
n
V
1
V
n
V
1
V
n
V
1 V
n
V C (3,1)
V C (2,1)
V C (1,1)
V C (1,3)
V C (1,2)
V C (2,3)
V C (2,2)
V C (3,3)
V C (3,2)
33
Figure 3.3. The proposed architecture of each process element (PE).
of the ALU is connected to the 32-bit output level-converter register (consists of level-converter Flip-Flops
[63]). The inputs of this register are in the V C voltage domain while its outputs are in the V IO voltage domain.
For connecting the PE to its neighboring PEs, there are two 5 to 1 multiplexers for the input of the ALUs
of the PEs and one 1 to 5 demultiplexer for sending the output of the PEs to its four (up, down, left, right)
neighbors or the output bus. These multiplexers are in the V IO voltage domain. Based on this architecture,
all connections between PEs are operated at the nominal voltage while the cores of the PE may have
different operating voltage levels. The core voltages are applied by the host controller to the Power Switch
Box units where the values of these voltages are determined by the approach presented in the next sub-
section.
3.2.1.1 Accuracy Determination and Mapping
To map an application onto the proposed CGRA, first the accuracy of the operations should be determined.
As mentioned before, in the proposed architecture, the degree of inaccuracy of each PE is set based on its
applied operating voltage. In the proposed formulation, a mapping between the output accuracy constraint
and the operating voltage levels of the operations are established.
The output error of a DFG is a linear combination of the operation errors [64]. Hence, the output error
is determined based on the amount of the generated error by each approximate operation reaching to the
I W I S I N I W I S I E
O N O W
O S O E
OUT BUS
32 bit Carry Look
Ahead Adder
16 bit Dadda
Multiplier
Input MUX
A
Input Mux
B
Output
Demux
WR
Immediate Value
Opcode
Bus Input
I N I E
Output Register
(Level-Converter)
ALU
V C Domain
V IO Domain
Context Register
34
output of the DFG. The error distribution of each operation in a DFG is independent from those of other
operations unless both operations have the same input data and same hardware implementation.
Accuracy levels of nodes: The error propagations show strong structural correlation which should be used
to model the output error [64]. In the work of [64], for obtaining the output error, an error sensitivity
parameter (ES) was introduced. The error sensitivity shows the impact of the accuracy of a node on the
output when the other nodes are precise. Hence, the error sensitivity of the w
th
node (i.e., 𝐸 𝑆 𝑤 ,𝑜 ) is defined
by 𝐸 𝑆 𝑤 ,𝑜 =
𝜖 𝑤 ,𝑜 𝜖 𝑤 , where 𝜖 𝑤 ,𝑜 and 𝜖 𝑤 are the error distance of the DFG output and w
th
node, when only the
w
th
node in the approximate operating mode. Based on the error sensitivity, the variance of the output (i.e.,
𝑉 (𝜖 𝑜 ) ) is obtained from the following expression [64]:
𝑉 (𝜖 𝑜 )= ∑ 𝐸 𝑆 𝑤 ,𝑜 2
.𝑣 (𝜖 𝑤 )
∀𝑤 ∈𝑂𝑝𝑒𝑟𝑎𝑡𝑖𝑜𝑛𝑆𝑒𝑡 (3-1)
where 𝑣 (𝜖 𝑤 ) is the output variance of the w
th
node.
Objective function: Now by employing (3-1), we can formulate the problem of determining the operating
voltage level of the DFG nodes under the predefined expected quality (Q EXP). We propose an ILP
formulation which determines the operating voltage levels of the add and multiply operations. The objective
of this formulation is reducing the summation of the operating voltage levels of these operations. Hence,
this ILP formulation is expressed by
𝑂𝑏𝑗𝑒𝑐𝑡𝑖𝑣𝑒 :
𝑀𝑖𝑛𝑖𝑚𝑖𝑧𝑒 (∑∑ 𝛽 𝑉 𝐷𝐷 ,𝑘 𝑋 𝑖 ,𝑘 𝐿 𝑘 =1
𝑛 𝑖 =1
+𝛼 ∑∑𝑉 𝐷𝐷 ,𝑘 𝑌 𝑖 ,𝑘 𝐿 𝑘 =1
𝑚 𝑖 =1
)
𝑆𝑢𝑏𝑗𝑒𝑐𝑡 𝑇𝑜 :
∑ ∑𝐸 𝑆 𝑤 ,𝑘 ,𝑜 2
.𝑣 (𝜖 𝑤 ,𝑘 ).𝑋 𝑤 ,𝑘 𝐿 𝑘 =1
∀𝑤 ∈𝐴𝐷𝐷𝑂𝑝𝑒𝑟𝑎𝑡𝑖𝑜𝑛𝑆𝑒𝑡 + ∑ ∑𝐸 𝑆 𝑤 ,𝑘 ,𝑜 2
.𝑣 (𝜖 𝑤 ,𝑘 ).𝑌 𝑤 ,𝑘 𝐿 𝑘 =1
∀𝑤 ∈𝑀𝑢𝑙𝑡𝑖𝑝𝑙𝑦𝑂𝑝𝑒𝑟𝑎𝑡𝑖𝑜𝑛𝑆𝑒𝑡 <𝑄 𝐸𝑋𝑃
∀𝑤 ∈𝐴𝐷𝐷𝑂𝑝𝑒𝑟𝑎𝑡𝑖𝑜𝑛𝑆𝑒𝑡 (∑𝑋 𝑤 ,𝑘 𝐿 𝑘 =1
=1)
∀𝑤 ∈𝑀𝑢𝑙𝑡𝑖𝑝𝑙𝑦𝑂𝑝𝑒𝑟𝑎𝑡𝑖𝑜𝑛𝑆𝑒𝑡 (∑ 𝑌 𝑤 ,𝑘 𝐿 𝑘 =1
=1)
(3-2)
In this formulation 𝑋 𝑖 ,𝑘 and 𝑌 𝑖 ,𝑘 are binary variables which show the operating voltage index (index k) of
the i
th
add and multiply operation nodes, respectively. Also, 𝑉 𝐷𝐷 ,𝑘 is the k
th
operating voltage level where
in this work, 1
st
, 2
nd
, …, and 5
th
operating voltage levels are 800mV, 750mV, …, and 600mV, respectively.
35
Due to the higher aging effect on the multipliers, we provide parameter 𝛼 (0.5<𝛼 ≤1) and 𝛽 (0<𝛽 ≤
0.5) to be set by the designer. Using larger values for 𝛼 leads to more focus on reducing the operating
voltage of the multipliers. In (3-2), the first inequality constraint is used to limit the output quality while
the last two constraints are included to force the solver to assign one and only one operating voltage level
to each add and multiply operations. 𝑣 (𝜖 𝑤 ,𝑘 ) is the error variance of the w
th
node in the k
th
operating voltage
level. Finally, after obtaining the operating voltage level, by using List scheduling and clique partitioning,
the design is mapped on to the CGRA.
3.2.2 Proposed CGRA Architecture with Voltage Island
In the previous CGRA without voltage island, each PE has a separate operating voltage which may increase
the overhead of using the dynamic scaling of PE voltages (dynamic accuracy configuration). To minimize
this overhead, we suggest using clustering of the PEs in voltage islands. Before we explain the proposed
architecture further, we discuss a motivational example which demonstrates the use of the VOS technique
with voltage island constraint for reducing the energy consumption and lowering the aging rate (improving
the lifetime) of the CGRA when running an application. For the rest of the section, first, we provide a
motivational example and then present the details of the proposed CGRA architecture. Next, we present the
mapping formulation.
3.2.2.1 Motivational Example
Consider an 8-tap finite impulse response (FIR) filter benchmark, which its DFG has been shown in Figure
3.6 (a). The results of mapping this DFG using the proposed algorithm (which will be discussed in a next
section) on a 6 6 CGRA for different output qualities are depicted in Figure 3.4. For the perfect output
quality, all PEs are connected to full V DD. Let us suppose that an output quality of 50% may be tolerated by
the application using this filter. For this level of quality reduction, one may apply the five voltage levels of
600, 650, 700, 750, and 800 mV to the PEs as shown in Figure 3.4 (e). Also, one voltage switch box is used
for each 2×1 voltage island. For the accurate calculation where full operating voltage of 800mV is applied,
the average V DD of PEs is 800mV. In the case of tolerable reduced quality of 50%, the average V DD of the
36
PEs becomes 680mV, which is 15% lower than that of the exact case. Now, given the fact that the dynamic
and leakage power components as well as lifetime/reliability degradation mechanisms depend super-
linearly (even exponentially in some cases) on the operating voltage, the reduction would lead to a
significant improvement for the lifetime/reliability. Also, with this voltage reduction, the CGRA dissipates
27% less energy.
(a)
(b) (c)
600 650 700 750 800
PE9
A2
PE10
M3
PE11 PE12 PE8 PE7
PE15
A3
PE16
M4
PE17 PE18 PE14 PE13
PE21
A4
PE22 PE23 PE24 PE20
M5
PE19
PE27
A5
PE28
M6
PE29 PE30 PE26 PE25
PE33
A6
PE34
M7
PE35 PE36 PE32
A7
PE31
M8
PE3
M3
PE4
M1
PE5 PE6 PE2
M2
PE1
600 650 700 750 800
PE9 PE10
M2
PE11
A7
PE12
M8
PE8 PE7
PE15
M1
PE16
A1
PE17
A6
PE18
M7
PE14 PE13
PE21
M3
PE22
A2
PE23
A5
PE24
M6
PE20 PE19
PE27
M4
PE28
A3
PE29
A4
PE30
M5
PE26 PE25
PE33 PE34 PE35 PE36 PE32 PE31
PE3 PE4 PE5 PE6 PE2 PE1
PE9 PE10
A7
PE11
M8
PE12 PE8 PE7
PE15
M7
PE16
A6
PE17 PE18 PE14 PE13
PE21
M4
PE22
A5
PE23
M6
PE24 PE20
M3
PE19
M2
PE27
A3
PE28
A4
PE29
M5
PE30 PE26
A2
PE25
A1
PE33 PE34 PE35 PE36 PE32 PE31
M1
PE3 PE4 PE5 PE6 PE2 PE1
37
(d) (e)
Figure 3.4 Mapping of an eight-tap FIR filter DFG on a 6 6 CGRA for a) 90% b) 80% c) 70% d) 60% e) 50% quality.
3.2.2.2 Proposed CGRA Architecture
In our suggested technique, by clustering the PEs to, e.g., 2 1 and 1 3, the overheads and complexity of
each PE having separate VOS level are reduced. While, in this work, we considered the same sizes for the
islands, in general the sizes could be different. The importance of using as few number of voltage levels as
possible in multiple supply voltage systems for having a less complex (overhead) power network has been
emphasized in some other works (e.g.,[66][67][68]). Moreover, specifically in [67], it is stated that placing
blocks with the same voltage together saves power routing resources, simplifies power planning, and reduce
IR drop. Also, in [68], a layout plan where modules with same voltage are placed together for reducing the
complexity of the power network is recommended. The power supply rails are routed using the top
interconnect layers, where the thickness, width and pitch of these wires are larger than the wires in the
lower layers. These voltages are connected to the supply contacts of the transistors using several via
electrical connections which brings down the voltage level from the higher layers down to the contact. The
vias, however, consume areas lowering the chip wiring efficiency which may cause some chip area increase.
As an example, in [69], it has been shown that employing dual supply voltage leads to 15% area increase.
Obviously, the larger is the number of the vias, the higher is the via blockage. Reference [70] states that the
via blockage can be up to 50% on metal 1 layer. The model in [70] has been used in the literature for
calculating the impact of the via blockage on lowering the wiring efficiency of the chip (see, e.g., [71]).
PE9
A6
PE10
M7
PE11 PE12 PE8
A5
PE7
M6
PE15 PE16 PE17 PE18 PE14
A4
PE13
M5
PE21
M1
PE22 PE23 PE24 PE20
A3
PE19
M4
PE27
A1
PE28 PE29 PE30 PE26
A2
PE25
M3
PE33
M2
PE34 PE35 PE36 PE32 PE31
PE3
A7
PE4
M8
PE5 PE6 PE2 PE1
PE9
A5
PE10
M6
PE11 PE12 PE8
A2
PE7
M3
PE15
A6
PE16
M7
PE17 PE18 PE14
A1
PE13
M2
PE21
A7
PE22 PE23 PE24 PE20
M1
PE19
PE27
M8
PE28 PE29 PE30 PE26 PE25
PE33 PE34 PE35 PE36 PE32 PE31
PE3
A4
PE4
M5
PE5 PE6 PE2
A3
PE1
M4
38
Figure 3.5 The overall architecture of proposed CGRA.
Since in the proposed structure, each island requires only one supply voltage level, instead of one set of
vias for each PE, one set is required for connecting the proper voltage level to all the PEs in the same island.
Therefore, it may be concluded that the number of via sets and, hence, via blockage are inversely
proportional to the sizes of the islands.
The proposed CGRA structure works based on determining the required minimum accuracy of each
DFG node for satisfying the output quality constraint. Then, the node is mapped on a PE whose voltage
level is determined by the accuracy of the node. In mapping a node on a PE, the clustering of the nodes
based on voltage islands are performed such that the objective function is optimized. In this work, four
voltage island sizes of 2 1 (two rows by one column), 1 3 (one row by three columns), 2 2 (two rows by
two columns) 3 2 (three rows by two columns) are considered.
The internal structure of each PE and the ALU and I/O voltage domains (V C and V I/O, respectively) are
shown and explained in section 3.2.1. The core voltage level for each voltage island is determined by the
host controller from a LUT which has the mapping for the DFG operations and the corresponding voltage
PE PE PE
PE PE PE
PE PE PE
Context Memory
Switch
Power
Box
Data Memory
Host Contoller
V 1
V n
V 1
V n
V 1 V n
PE PE PE
PE
PE
PE
PE
Voltage Island
2x1
Switch
Power
Box
Switch
Power
Box
Switch
Power
Box
Switch
Power
Box
Switch
Power
Box
Switch
Power
Box
Switch
Power
Box
V 1
V n
V 1
V n
V 1
V n
V 1
V n
V 1
V n
V 1
V n
39
levels for each island. Due to possibility of long idle times (e.g., no operation mapped to the PE), to alleviate
the energy dissipation, power gating switches on V c and V IO of each PE is considered.
3.2.2.3 Mapping Formulation
For the optimum voltage level setting and mapping an application to the accuracy-configurable CGRA, we
formulate the determination of the accuracy level of each DFG node (and the voltage level of the PE that
will be performing that operation) and the physical mapping of the DFG nodes to specific PEs on the CGRA
fabric as an optimization problem. In the proposed formulation, all nodes mapped on a set of PEs in an
island have the same operating voltage level. The scheduling and binding are two NP-complete problems
[72]. For this purpose, a set of linear constraints, which will be explained in the following subsections, are
included in the optimization framework.
Accuracy levels of nodes: As mentioned before, in the proposed architecture, the degree of inaccuracy of
each PE is set based on its applied operating voltage. In the proposed formulation, a mapping between the
output accuracy constraint and the operating voltage levels of the operations are established.
Based on the explanation in section 3.2.1.1, the error sensitivity of the i
th
node (denoted by 𝐸 𝑆 𝑖 ,𝑜 ) was
defined as 𝐸 𝑆 𝑖 ,𝑜 =
𝜖 𝑖 ,𝑜 𝜖 𝑖 where 𝜖 𝑖 ,𝑜 and 𝜖 𝑖 were the error distance of the DFG output and the i
th
node, when
only the i
th
node was in the approximate operating mode. Based on this error sensitivity, the variance of the
output (denoted by 𝑣 (𝜖 𝑜 ) ) was obtained from [64]
𝑣 (𝜖 𝑜 )= ∑ 𝐸 𝑆 𝑖 ,𝑜 2
.𝑣 (𝜖 𝑖 )
𝑖 ∈{𝐷𝐹𝐺 𝑁𝑜𝑑𝑒𝑠 }
(3-3)
where 𝑣 (𝜖 𝑖 ) was the output variance of the i
th
node. Now, by considering the variance as the error metric
and employing (3-3), one may formulate the problem of determining the operating voltage level of the DFG
nodes under the predefined expected variance (minimum quality) denoted by 𝑣 𝐸𝑋𝑃 .
Based on the above definitions, for formulating the accuracy level determination, we consider a binary
variable (𝑥 𝑖 ,𝑗 ). When the j
th
operating voltage level (where 1≤𝑗 ≤𝐿 ; L is the number of the considered
40
voltage levels) is considered for the i
th
DFG node, 𝑥 𝑖 ,𝑗 = 1. Hence, the output expected quality constraint is
defined by
∑ ∑𝐸 𝑆 𝑖 ,𝑗 ,𝑜 2
.𝑣 (𝜖 𝑖 ,𝑗 ).𝑥 𝑖 ,𝑗 𝐿 𝑗 =1
∀𝑖 ∈{𝐷𝐹𝐺 𝑁𝑜𝑑𝑒𝑠 }
<𝑣 𝐸𝑋𝑃 (3-4)
where 𝐸 𝑆 𝑖 ,𝑗 ,𝑜 (𝜖 𝑖 ,𝑗 ) shows the error sensitivity (error distance) of the i
th
node in the j
th
operating voltage
level. The values of 𝜖 𝑖 ,𝑗 and 𝐸 𝑆 𝑖 ,𝑗 ,𝑜 are obtained before the mapping process.
Mapping of the Nodes on PEs: In the considered CGRA, each PE is connected to all four neighbors (except
the ones placed in the borders). Therefore, the mapping constraint should guarantee that each two adjacent
nodes in DFG, are neighbors in the CGRA. By considering a binary variable (𝑏 𝑖 ,𝑟 ) for showing the mapping
of the i
th
DFG node on the r
th
PE, the mapping process may be formulated as
∀
𝑖 ∈{𝐷𝐹𝐺 𝑁𝑜𝑑𝑒𝑠 },𝑟 ∈{𝐶𝐺𝑅𝐴 𝑃𝐸𝑠 }
( ∑ ∑ (𝑏 𝑖 ′
,𝑟 ′)
𝑟 ′
∈{𝑎𝑑𝑗𝑒𝑐𝑒𝑛𝑡𝑠 𝑜𝑓 𝑟 𝑡 ℎ
𝑃 𝐸 } 𝑖 ′
∈{𝑎𝑑𝑗𝑒𝑐𝑒𝑛𝑡𝑠 𝑜𝑓 𝑖 𝑡 ℎ
𝑛𝑜𝑑𝑒 }
≥𝑏 𝑖 ,𝑟 ×𝐸 𝑖 )
(3-5)
where 𝐸 𝑖 is the degree of the i
th
DFG node. In (3-5), when 𝑏 𝑖 ,𝑟 is one, all the adjacent nodes of i
th
DFG node
must be mapped onto E i neighbors of r
th
PE. In addition, to map each node on only one PE, the following
formula should be used.
∀
𝑖 ∈{𝐷𝐹𝐺 𝑁𝑜𝑑𝑒𝑠 }
( ∑ 𝑏 𝑖 ,𝑟 =1
𝑟 ∈{𝐶𝐺𝑅𝐴 𝑃𝐸𝑠 }
) (3-6)
Note that since the output of each PE in the considered CGRA structure is connected to only four neighbors,
when the fanout of a DFG node is larger than four, the fanout could be reduced by inserting NOP (no
operation) nodes (other approaches may be taken as well).
Guaranteeing the same voltage level for the nodes in an island: In the optimization formulation, for
each island, the possible mapping of Q nodes (Q is the number of the PEs in each island) on Q PEs should
satisfy the condition of the same voltage level. Therefore, for each possible mapping for an island, under
each operating voltage level (e.g., the j
th
voltage level), we propose to employ the set of the following
inequalities:
41
∀
(𝑖 ,𝑟 )∈{𝑝𝑜𝑠𝑠𝑖𝑏𝑙𝑒 𝑚𝑎𝑝𝑝𝑖𝑛𝑔 𝑜𝑓 𝑄 𝑁𝑜𝑑𝑒𝑠 𝑜𝑛 𝑄 𝑃 𝐸 𝑠 }
(𝑤 𝑗 ≤1−𝑏 𝑖 ,𝑟 +𝑥 𝑖 ,𝑗 ) (3-7)
Here, 𝑤 𝑗 in (3-7) is a binary variable that is 1 if the chosen operating voltage level index for all the mapped
nodes on the PEs of the island is j. The possible mapping of 𝐼 nodes on 𝑄 PEs is equal to 𝑄 -permutation of
𝐼 . Now, (3-7) should be defined for all the voltage levels and in at least one operating voltage level, the 𝑤
should be one. Hence, the following inequality should be employed to meet this constraint:
∑𝑤 𝑗 ≥1
𝐿 𝑗 =1
(3-8)
Note that the set of (3-7) and (3-8) should be described for each voltage island of the CGRA.
Objective function: The goal of this optimization problem is to reduce the operating voltage levels of the
nodes improving the energy consumption and lifetime/reliability of the CGRA. Here, we consider the
constraint of minimizing the summation of the voltage levels of the DFG nodes as the objective function as
∑ ∑ 𝑉𝐷 𝐷 𝑗 ×𝛼 ×𝑥 𝑖 ,𝑗 𝑗 ∈{𝑉𝑜𝑙𝑡𝑎𝑔𝑒 𝐿𝑒𝑣𝑒𝑙𝑠 } 𝑖 ∈{𝐷𝐺𝐹 𝑁𝑜𝑑𝑒𝑠 }
(3-9)
where 𝑉𝐷 𝐷 𝑗 indicates the corresponding operating voltage level of the j
th
operating voltage level index and
𝛼 is the weight coefficient. In this work, we consider the weight value of 0.8 (1) for the case of adder
(multiplier). The reason for considering a larger value for the multiplier is its more suffering from the aging
mechanisms compared to that of the adder [40].
Now, by employing (3-9) as the objective function, and inequalities (3-4) to (3-8) as the constraints,
the mapping process on the proposed CGRA can be formulated. Note that in the mapping process for an
application in the CGRA, we may end up with idle nodes or the nodes operating with lower voltages
experiencing different voltage threshold drifts by the PEs. This provides us with the possibility of
remapping of the DFG on the CGRA during the lifetime to distribute the stress more uniformly for
prolonging the lifetime even further.
42
3.3 Results and Discussion
3.3.1 Simulation Setup
All the studies have been performed by employing the 15nm FinFET-based Open Cell Library (OCL)
technology [73]. To assess the effect of different levels of operating voltage levels for the proposed CGRA
without voltage island, 5 voltage levels have been considered include 800 (nominal voltage), 750, 700, 650,
and 600 mV. To assess the efficacy of the proposed CGRA structure with voltage island, 6×6 CGRAs with
2×1, 1×3, 2×2 and 3×2 voltage islands have been considered. For each CGRA, we have evaluated the VOS
method under two (600mV and 800mV), three (600mV, 700mV and 800mV) and five (600mV, 650mV,
700mV, 750mV, and 800mV) different operating voltage levels.
For extracting the design parameters of the PEs, the components of the PEs were described by the
Verilog HDL and synthesized using Synopsys Design Compiler using a 15nm technology file [73]. For
extracting the impact of the voltage scaling on the design parameters, we have characterized the 15-nm
technology file by employing the Cadence Encounter Library Characterization (ELC) tool under the
considered operating voltage levels. Therefore, the design parameters have been extracted by Synopsys
Design Compiler under these characterized technology libraries.
In this chapter, without loss of generality, it has been assumed that the arithmetic unit of each PE
contained one adder (32-bit CLA) and one multiplier (16-bit Dadda). For extracting the output quality
degradation due to the VOS, the post-synthesis gate-level of these two components were simulated by
ModelSim HDL simulator. The timing information of the gates in these simulations for each operating
voltage level was extracted by Synopsys Design Complier. The accuracy of the arithmetic operations was
obtained by injecting many randomly generated input operands. Also, in the studied CGRA architectures,
one and three cycles considered for the add and multiply operations, respectively, when considering the
same clock frequency at the considered voltage levels. The clock frequency was obtained such that that the
PEs of the CGRA did not have any output error at the nominal (800mV) operating voltage level. The aging
43
rate is calculated by (2-4) that we have assumed all the parameters except for the E OX (voltage dependent)
are the same for both conventional and the proposed CGRAs.
3.3.2 Benchmarks
3.3.2.1 CGRA without voltage island
To explore the effectiveness of the proposed method, two benchmarks of an 8-Tap FIR filter and a 4th order
Polynomial Evaluation (PoE) which have different orders of computations (different sequences of add and
multiply operations) were considered. For the FIR benchmark, first, eight parallel multiplications are
performed and then the results of the multiplications are added serially. In addition, the reason for choosing
the PoE is that the summation is performed immediately after the multiplications. As will be seen next,
these two different benchmarks give rise to different characteristics for the VOS levels versus the output
quality constraint. The DFG of these benchmarks are depicted in Figure 3.6.
3.3.2.2 CGRA with voltage island
To explore the effectiveness of the proposed method, four benchmarks from different application domains
were considered. The benchmarks set included 8-Tap FIR filter, 8×8 Matrix Multiplication (MMM),
Smoothing filter (with 3×3 filter) (SMT), and Sharpening filter (with 3×3 filter) (SHP). These applications
contain only the add and multiply operations. The details of the number of the operations in each application
have been reported in Table 3.1.
(a)
(b)
Figure 3.6 The DFG of the a) FIR [65], and b) PoE benchmarks.
*
+ + +
* * * * * * *
+ + + +
M1
M2 M3 M4 M5 M6 M7 M8
A1 A2 A3 A4 A5 A6 A7
44
Table 3.1. The numbers of total nodes and each operation type for each benchmark.
Benchmark Multiply Add Total
FIR 8 7 15
MMM 8 7 15
SMT 9 8 17
SHP 9 8 17
For mapping these applications on the CGRAs, we employed the mapping approach proposed in Section
3.2.2.3 where the ILP formulations was described by the python language and solved by Gurobi LP solver
[74]. In these studies, five different output quality (approximation) levels, 90%, 80%, 70%, 60%, and 50%
have been studied.
3.3.3 Results
3.3.3.1 CGRA without Voltage Island
3.3.3.1.1 8-Tap FIR Filter
The overscaled voltage levels assigned to each PE for performing the nodes of the DFG for different quality
levels of 50%, 60%, 70%, 80%, and 90% are presented in Figure 3.7 (a). The parameter Q EXP which was
mentioned previously, is related to quality levels and DFG outputs. For this benchmark, for the quality
constraint of 90% (50%), the minimum and maximum voltages were 700mV to 800mV (600mV to 750mV).
As expected, the more quality degradation can be tolerated, the lower voltages can be assigned to PEs. For
some of the operations, allowing the quality degrading does not lead to the operating voltage level decrease
considerably due to their specific feature. For instance, for the add operation A6 (see Figure 3.7(a)), all the
errors occurred in the previous operations converge to this adder. Therefore, for this adder, its assigned
voltage levels are the same and is 750mV which is only one level lower than the full supply voltage.
It should be noted that for the quality of 90%, the operating voltage of some nodes cannot be lowered than
800mV thanks to their criticality from the error perspective. This happens more for the nodes at the middle
and the end of the DFG structure or the add operations collecting the errors from the previous nodes. For
instance, the A3 operation was assigned to 800mV for 3 consecutive qualities of 90%, 80%, and 70%. The
general trend for the FIR benchmark is that the add operations are more critical from the error perspective
45
compared to the multiply operations. For the case of this benchmark, assigning lower voltages to the
multiply operations affect the output quality less compared to the add operations. This allows for more VOS
improving both the energy efficiency and lifetime/reliability for these PEs.
(a)
(b)
(c)
Figure 3.7 a) Overscaled voltage level b) Aging rate Reduction c) Energy Reduction for different arithmetic operations of the FIR
filter benchmark (A: Adder; M: Multiplier)
550
600
650
700
750
800
850
M1 M2 A1 M3 A2 M4 A3 M5 A4 M6 A5 M7 A6 M8 A7
Over-Scaled Voltages (mV)
90% 80% 70% 60% 50%
0
10
20
30
40
50
60
70
80
M1 M2 A1 M3 A2 M4 A3 M5 A4 M6 A5 M7 A6 M8 A7
∆Vth Reduction (%)
90% 80% 70% 60% 50%
0
10
20
30
40
50
M1 M2 A1 M3 A2 M4 A3 M5 A4 M6 A5 M7 A6 M8 A7
Energy Reduction (%)
90% 80% 70% 60% 50%
46
Generally, voltage reduction provides us with a decrease in aging rates and energy consumptions of the PEs
compared to the case of applying the nominal voltage to all PEs. The results for these reductions are
presented in Figure 3.7 (b) and (c), respectively. As was expected, at lower output qualities, there are more
opportunity for the voltage, energy consumption (up to 44%), and aging rate reductions (up to 74%). For
some qualities, exact operations are required for performing algorithm where the PE must work with the
maximum nominal voltage of 800 mV. Obviously, for these PEs, no decrease in the aging rate or energy
consumption is achieved.
3.3.3.1.2 4
th
Order Polynomial Evaluation
The overscaled voltage levels assigned to each PE used for each DFG node under different quality levels
of 50%, 60%, 70%, 80%, and 90% are presented in Figure 3.8 (a).
(a)
(b)
500
550
600
650
700
750
800
850
M1 A1 M2 A2 M3 A3 M4 A4
Over-Scaled Voltages (mV)
90% 80% 70% 60% 50%
0
20
40
60
80
M1 A1 M2 A2 M3 A3 M4 A4
∆Vth Reduction (%)
90% 80% 70% 60% 50%
47
(c)
Figure 3.8 a) Overscaled voltage level b) Aging rate Reduction c) Energy Reduction for different arithmetic operations of the
PoE (A: Adder; M: Multiplier)
Again, for some operations, allowing the quality degradation does not lead to considerable operating
voltage level decreases due to the specific feature of the benchmark. For instance, for the add operation A4
(see Figure 3.8 (a)), all the errors occurred in the previous operations somehow add up in this adder.
Therefore, this adder may not tolerate any further error and hence the nominal voltage should be used for
this adder. The general trend for the PoE benchmark is that the multiply operations are more critical from
the error perspective compared to the add operations except for the last add operation (A4). The decrease
in aging rate and energy consumption are presented in Figure 3.8 (b) and (c). Assigning these voltages to
the PEs, the VOS provided us with an average of 29% (47%) of aging rate reduction for the quality
constraint of 90% (50%). Also, we achieved an energy reduction of 15% (27%) for the quality constraint
of 90% (50%).
3.3.3.1.3 Verifying Extracted Overscaled Voltages
To confirm the trustworthiness of the variance-based model [64] which have been used in this work, the
MSE values obtained based on both model and simulation are compared in Table 3.2. For the
implementation, we determined ILP solutions in Verilog HDL for two considered benchmarks. By injecting
many random stimuli and performing gate-level simulation (the delays of the gates was based on the chosen
operating voltage levels), the values were extracted. Since the MSE values of the simulation were smaller
0
10
20
30
40
50
M1 A1 M2 A2 M3 A3 M4 A4
Energy Reduction (%)
90% 80% 70% 60% 50%
48
Table 3.2. The model and simulated MSE of FIR and PoE
Quality
FIR PoE
MSE MSE
Model Simulation Model Simulation
90% 3.27E+07 8.35E+06 3.50E+05 2.47E+04
80% 1.62E+08 5.89E+07 1.59E+06 5.63E+04
70% 3.58E+08 1.26E+08 3.57E+06 1.01E+06
60% 6.83E+08 1.91E+08 3.69E+06 1.09E+06
50% 1.01E+09 2.53E+08 9.30E+06 1.23E+06
than those of the model which satisfied the quality constraint, one may conclude the variance-based model
is credible.
3.3.3.2 CGRA with Voltage Island
3.3.3.2.1 Energy Reduction
The results for the energy reduction of the CGRA under two cases of 2×1 and 1×3 voltage islands for
different benchmarks and quality constraints compared to that of the CGRA operating at the nominal
operating voltage level (no quality loss) have been presented in Figure 3.9. This CGRA architecture which
was used in this study is a common basic CGRA architecture utilized in some other recent works (see, e.g.,
[75][76][40]). Since the main idea of this chapter has been on proposing voltage overscaling and voltage
islanding schemes for improving the reliability/lifetime and energy improvements of CGRAs, we
considered the basic CGRA architecture for the study.
Without any limitation, the idea may be well applied to other CGRA architectures (which there exist a
few of them) where the amounts of improvements may be different for different CGRA architectures.
As the results show, lowering the minimum acceptable output quality causes more energy reduction. In the
case of 2×1 voltage island (Figure 3.9 (a)), the FIR and MMM benchmarks enjoyed from more energy
reductions (~1.9× and 2×, respectively) by reducing the minimum acceptable quality from 90% to 50%
compared to the cases of the SMT and SMP benchmarks (~1.2× and ~1.3×, respectively). From these
results, less sensitivities of these two image processing applications to imprecise computing may be
concluded. Hence, for a given specified minimum output quality, these applications may be assigned lower
operating voltages for their PEs.
49
(a)
(b)
Figure 3.9 The energy reduction of different benchmarks under different quality constraints mapped on CGRA with (a) 2×1 (b)
1×3 voltage islands in the case of considering five voltage levels has been considered.
The reductions in the energies originate from being able to assign lower operating voltages to some PEs
determined by their acceptable levels of the inaccuracy. Obviously, the operating voltage reduction could
be more in the case of the considered image processing applications. To demonstrate this, as an example,
the operating voltage levels of the PEs for the minimum acceptance quality of 90% have been indicated for
the two cases of MMM and SMT applications in Figure 3.10.
As the reported energy gains in Figure 3.9 (a) reveal, the maximum gains belonged to the SMT
benchmarks (from 37% to 43%), while the FIR benchmark had the lowest gains (from 14% to 27%). For
the studied benchmarks, the average of the energy gains was from 24% to 35% when the acceptable
minimum output quality reduced from 90% to 50%. The results in Figure 3.9 (b), reveals that the energy
reductions in the case of 1×3 voltage island are almost the same as those in the case of 2×1 voltage island.
14
15
37
31
18
19
38
34
22
23
41
37
25
27
42
38
27
30
43
40
0
10
20
30
40
50
FIR MMM SMT SHP
Energy Reduction (%)
Benchmarks
90% 80% 70% 60% 50%
50
(a) (b)
Figure 3.10 Mapping of a) MMM b) SMT benchmarks on a 6 6 CGRA under the minimum quality of 90%.
Since both voltage island sizes provide about the same energy reductions, the 1×3 islands are preferred
thanks to the reduced overhead (see section 3.2.2.2). It should be mentioned that both power and area
overheads of the power switch boxes themselves are (negligibly) small. The fact that the same power
reduction gains were achieved was due to using a 6×6 CGRA platform which had more PEs than the number
of operations required for the application DFGs. Of course, unused PEs should be power gated to prevent
any leakage power consumption.
For the results presented in Figure 3.9, five operating voltage levels for PEs were considered. To reduce
the overhead of generating and distributing the operating voltage levels, a few number of levels may be
used. Reducing the voltage levels diminishes the flexibility of using different approximation levels for the
PEs limiting the opportunity to lower the energy consumption for a given output quality. To study the
impact, we studied the energy reduction gain of the proposed approach by considering the cases of two and
three voltage levels as well. All the energy reductions are compared to the reference case of applying the
nominal voltage to all the PEs (100% output quality). The results, which are reported in Table 3.3 show
that in the two cases of the FIR and MMM benchmarks, the energy reduction gains have been affected
more.
600 650 700 750 800
PE9
M4
PE10
M3
PE11
M1
PE12 PE8 PE7
PE15
A3
PE16
A2
PE17
A1
PE18 PE14 PE13
PE21
A4
PE22
M5
PE23
M2
PE24 PE20 PE19
PE27
A5
PE28
A6
PE29
A7
PE30 PE26 PE25
PE33
M6
PE34
M7
PE35
M8
PE36 PE32 PE31
PE3 PE4 PE5 PE6 PE2 PE1
PE9 PE10
4
PE11
5
PE12 PE8 PE7
PE15
12
PE16
10
PE17
7
PE18
6
PE14 PE13
PE21
13
PE22
11
PE23
9
PE24
8
PE20 PE19
PE27
15
PE28 PE29 PE30 PE26
17
PE25
PE33
14
PE34 PE35 PE36 PE32
16
PE31
PE3 PE4
2
PE5
3
PE6
1
PE2 PE1
51
Figure 3.11 The decrease of the energy reduction in the case of 1×3 voltage island when the number of operating voltage levels is
lowered from five to two.
This has been demonstrated further in Figure 3.11 comparing the loss of the energy reduction when the
number of operating voltages levels decreases from five to two. This means that even if we set the voltage
for one of the islands below the nominal voltage, the output quality becomes below 90%. For lower
acceptable minimum output qualities, however, there is a chance for using islands with lower operating
voltages. As the results show, for all benchmarks, decreasing the minimum output quality below 90%,
provided us with some energy reductions. As expected, the gain is proportional to the amount of quality
reduction. In addition, the results indicate that, by increasing the island size, the energy improvements for
the cases of 2×2 and 3×2 compared to the those of the cases of 1×3 and 2×1 for most of the benchmarks
and qualities, have reduced. For example, by increasing the island size from 2×1 to 3×2, on average
(maximum) 4% (15%) reduction in the energy improvement is achieved.
Finally, we notice that decreasing the voltage levels from five to three for the considered benchmarks
does not yield considerable loss of the energy reduction of the VOS technique while we may reduce the
associated overheads (e.g., two less switches).
3.3.3.3 Lifetime/Reliability Improvement
First, it should be mentioned that the reliability degradation of a circuit, in a general sense, for most of
associated mechanisms, is a strong function of the operating voltage (see, e.g., [19]). Here, we have only
concentrated on NBTI-induced threshold voltage change as a measure of the CGRA aging.
100
67
45
40
33
100
68
48
15
13
28
18
20
14
9
52
21
16
13
18
0
10
20
30
40
50
60
70
80
90
100
90% 80% 70% 60% 50%
Decrease in Energy
Reducion (%)
Quality
FIR MMM SMT SHP
52
Table 3.3. Energy reduction (%) of different benchmarks under different minimum output qualities, voltage island sizes and
operating voltage resolutions.
Voltage
Island
Minimum Output Quality → 90% 80% 70% 60% 50%
Voltage Levels →
Benchmarks
5 3 2 5 3 2 5 3 2 5 3 2 5 3 2
3×2
FIR 13 11 0 16 15 6 22 18 12 25 21 15 26 22 18
MMM 14 11 0 17 15 6 22 18 12 26 24 20 30 26 26
SMT 35 30 23 38 37 31 40 39 33 40 40 36 43 41 39
SHP 30 21 15 33 28 26 34 34 31 38 38 33 39 39 33
2×2
FIR 13 13 0 17 15 6 21 19 12 25 21 15 27 22 18
MMM 15 13 0 18 15 6 22 18 12 25 23 23 29 26 26
SMT 35 31 26 38 37 31 41 39 33 41 40 36 43 41 39
SHP 31 24 15 33 28 26 36 34 31 38 38 33 39 39 33
1×3
FIR 14 13 0 18 17 6 22 19 12 25 21 15 27 23 18
MMM 15 13 0 19 15 6 23 18 12 27 24 23 30 28 26
SMT 36 33 26 38 37 31 41 39 33 42 40 36 43 41 39
SHP 31 24 15 33 29 26 37 35 31 38 38 33 40 39 33
2×1
FIR 14 13 0 18 17 6 22 19 12 25 21 15 27 23 18
MMM 15 13 0 19 15 6 23 18 12 27 24 23 30 28 26
SMT 37 33 26 38 37 31 41 39 33 42 40 36 43 41 39
SHP 31 24 15 34 29 26 37 35 31 38 38 33 40 39 33
The results which are for the cases of the 2×1 and 1×3 voltage islands for different minimum output qualities
are plotted in Figure 3.12.
(a)
(b)
Figure 3.12 The threshold voltage change rate reduction for different benchmarks for different minimum acceptable qualities and
five operating voltage levels in the case of (a) 2×1 (b) 1×3 voltage islands.
28 29
64
55
35 36
67
60
42 43
70
64
46
49
72
66
49
54
73
70
0
10
20
30
40
50
60
70
80
FIR MMM SMT SHP
∆Vth Rate Reduction(%)
Quality
90% 80% 70% 60% 50%
27
29
63
55
34
36
67
59
42
43
70
64
46
49
72
67
49
54
73
70
0
10
20
30
40
50
60
70
80
FIR MMM SMT SHP
∆Vth Rate Reduction(%)
Quality
90% 80% 70% 60% 50%
53
To obtain the threshold voltage change (∆𝑉 𝑡 ℎ,𝑁𝐵𝑇𝐼 ), (2-4) was used. As the results reveal, the VOS
approximate computing technique leads to reducing the aging effect when some output quality degradation
is acceptable. The lower the tolerable output quality is, the lesser the threshold voltage change rate will be.
In the considered benchmarks, the highest improvement belonged to the SMT benchmark which the Vth
change rate reduction was improved from 63% to 73% in the case of 1×3 voltage island. This is a direct
consequence of using lower operating voltage levels for the PEs in the case of this benchmark. Also, the
aging rate reductions for FIR, MMM, and SHP were from 27% to 49%, 29% to 54%, and 55% to 70%,
respectively. The average of the aging rate reductions was about 62% (44%) when the minimum output
quality constraint was 50% (90%). This is the one of the key advantages of using the proposed VOS
approximate computing technique.
As was observed in the previous subsection, having fewer operating voltage levels, lowers the
opportunity for using overscaled operating voltages for the PEs for a given minimum output quality. Similar
to the case of the energy reduction gain, this lowers the aging rate improvement opportunity. As an example,
Figure 3.13 shows the decrease in the aging rate reduction trend versus the minimum output quality when
the number of operating voltage levels changes from 5 to 2. Table 3.4 shows the aging rate reduction of
different benchmarks for different minimum output qualities, voltage island sizes, and operating voltage
resolutions.
Figure 3.13. Decrease of the aging rate improvement in the case of 1×3 voltage island when the number of operating voltage
levels is lowered from five to two.
100
71
52
46
100
72
53
18
30
22
19
15
53
25
19
15
0
50
100
150
90% 80% 70% 60%
Decrease in Aging Rate
Reduction (%)
Quality
FIR MMM SMT SHP
54
Table 3.4. Aging rate reduction (%) of different benchmarks for different minimum output qualities, voltage island sizes and
operating voltage resolutions.
Voltage
Island
Minimum Output Quality → 90% 80% 70% 60% 50%
Voltage Levels →
Benchmarks
5 3 2 5 3 2 5 3 2 5 3 2 5 3 2
3×2
FIR 26 21 0 30 29 10 41 33 20 46 38 25 47 40 30
MMM 28 21 0 31 29 10 43 33 20 48 41 35 54 45 45
SMT 61 53 39 66 64 52 70 67 57 70 69 61 73 71 65
SHP 54 38 26 58 49 44 60 59 52 66 66 57 68 67 57
2×2
FIR 26 24 0 32 28 10 40 32 20 46 40 25 49 45 30
MMM 29 24 0 35 28 10 39 32 20 44 40 40 51 45 45
SMT 62 55 44 67 64 52 70 67 57 71 69 61 73 71 65
SHP 55 42 26 58 49 44 62 59 52 67 66 57 67 67 57
1×3
FIR 27 24 0 34 31 10 42 35 20 46 38 25 49 41 30
MMM 29 24 0 36 29 10 43 34 20 49 41 40 54 48 45
SMT 63 57 44 67 64 52 70 67 57 72 69 61 73 71 65
SHP 55 42 26 59 51 44 64 61 52 67 66 57 70 67 57
2×1
FIR 28 24 0 35 31 10 42 35 20 46 38 25 49 41 30
MMM 29 24 0 36 29 10 43 33 20 49 41 40 54 48 45
SMT 64 57 44 67 64 52 70 67 57 72 69 61 73 71 65
SHP 55 42 26 60 51 44 64 61 52 66 66 57 70 67 57
As the results show, for most of the benchmarks and qualities, the 3×2 voltage island has the lowest aging
rate reduction. For qualities below 70%, however, in some benchmarks, the 2×2 island case has the
minimum aging rate reduction. By increasing the island size from 2×1 to 3×2, a maximum of 14% decrease
in the aging rate was obtained. In addition, the figures indicate that decreasing the minimum output quality
improves the aging rate reduction.
3.3.3.3.1 Folding Approach for Reliability Improvement
In addition to reducing the operating voltage, here, we consider two cases for further improving the lifetime
by redistributing the impact of aging mechanisms on the PEs. These cases deal with exchanging the
operations (stresses) on PEs. Before explaining these cases, consider the CGRA with 2 1 islands shown in
Figure 3.14 (a) with a folding line in the middle. If some PEs in one side of the folding line have, e.g., lower
voltages (or power-gated), every once a while, their operations (state) may be exchanged with the PEs in
the other side which suffer from (more) aging rate. This would allow for more uniform aging rate
distributions of the PEs reducing the aging rate of the CGRA. For instance, in this CGRA, the pairs of PE1
55
and PE4 (PE5 and PE8) are candidates for the exchange process. Of course, their neighbors also should be
exchanged such that the overall optimized mapping is not affected considerably (see the figure).
Now, let us describe the cases in a general form. In the first case, there is an active PE in one side while
there is another idle (power-gated) PE placed symmetrically with respect to the folding line in the other
side. By moving the operation of the active PE (with the assigned operating voltage level) to the idle PE,
the delay drift of the active PE is reduced. In this case, both PEs will be under stress for the half of the
lifetime duration. In the second case, there is two active PEs with two different operating voltages.
Exchanging them as well as their corresponding neighbors with their related roles with the other side lead
to a more balanced aging rate for both sides which improves the CGRA lifetime.
The exchange cases (remapping) described above has some energy and runtime overheads, and hence,
should be performed every once a while depending on the application in hand. Also, the efficacy strongly
depends on the number of used and idle PEs and the assigned voltage levels to the islands. In this work, the
mapping/re-mapping and scheduling processes for every application and given minimum output quality are
performed in the offline phase (statically). The results of the mapping/re-mapping process are the context
words which are utilized for programming the CGRA in different runtime slots (invocations). Proper
context words, which are stored in the context memory, are loaded in the context registers used for
configuring the PEs. On the other hand, since the folding process is simple having an overhead similar to
the reconfiguration of the PEs, it is performed online.
The place of the folding line and the exchange direction should not change the adjacency of the PEs to
considerably lower the overhead of the mapping process. This way only the direction of the neighbors in
the context word of the PEs is changed. Also, it should be mentioned that there might be cases such as high
minimum acceptable qualities and the use of most of the PEs of the CGRA where no room will be left for
optimizing the lifetime further using this technique. As shown in Figure 3.14, three possible folding lines
could be considered including vertical (Figure 3.14 (a), horizontal (Figure 3.14 (b)), and diagonal (Figure
3.14 (c)). In the case of the diagonal folding, due to passing of the line through the middle of some the PEs,
56
PE 1 PE2 PE3 PE 4
PE 5 PE 6 PE 7 PE 8
PE 9 PE 10 PE 11 PE 12
PE 13 PE 14 PE 15 PE 16
c
Folding Line
PE 1 PE2 PE3 PE 4
PE 5 PE 6 PE 7 PE 8
PE 9 PE 10 PE 11 PE 12
PE 13 PE 14 PE 15 PE 16
c
Folding Line
(a) (b)
PE 1 PE2 PE3 PE 4
PE 5 PE 6 PE 7 PE 8
PE 9 PE 10 PE 11 PE 12
PE 13 PE 14 PE 15 PE 16
c
(c)
Figure 3.14. Folding a 4×4 array of PEs with respect to a (a) vertical, (b) horizontal, and (c) diagonal folding lines passing
through the middle of the array.
their operations are not exchanged with other PEs. Also, there is another possible diagonal folding line
passing from top-right to bottom-left which is not shown here. Since each PE uses more than one operating
voltage, for predicting the BTI impact, (2-4) might not be used. In this case, for modeling the voltage
threshold drift of the PMOS transistor, we have used the model proposed in [77] given by
∆𝑉 𝑡 ℎ
=(𝐴 1
1
𝑎 ∆𝑡 1
+⋯+ 𝐴 𝑛 1
𝑎 ∆𝑡 𝑛 )
𝑎
(3-10)
where ∆𝑡 𝑘 shows the amount time that the transistor in the 𝑘 𝑡 ℎ
voltage level, a is equal to 0.173 in the case
of NBTI, and 𝐴 𝑘 is the technology parameter, which extracted from (2-4) and defined by
𝐴 𝑘 ≅𝐴 𝑒 −
κ
𝜃 𝐸 𝑂𝑋
𝛾 𝑑 𝑓 𝛽
(3-11)
Here, the 𝐸 𝑂𝑋
is an operating voltage dependent parameter, used for considering the operating voltage
reduction impact on the lifetime of the system.
57
In Figure 3.15., the improvement of the aging rate reduction for all the benchmarks for different minimum
output qualities in the case of five voltage levels and 1×3 and 2×1 voltage islands have been presented.
In these figures, H, V, and D stand for horizontal, vertical, and diagonal folding lines, respectively. For
each benchmark and quality constraint pair, the gain for the case of the folding line resulting in highest
improvement was selected. As the results indicate, the proposed folding approach could lead to about 30%
improvement in the aging rate reduction in the case of the considered benchmarks. Also, by reducing the
minimum output quality, the efficiency of the proposed enhancement approach in some cases decreases.
This may be attributed to the fact that when the minimum output quality decreases, the assigned voltages
to more PEs are reduced limiting the gain of the exchange process. The results show that the proposed
folding approach provided, on average, 49.5% (15.5%) and 65% (6.2%) aging rate reduction (aging rate
reduction improvement) in the cases of 90% and 50% minimum output quality constraint.
(a)
(b)
Figure 3.15. The increase in aging rate reduction (%) for all the benchmarks under different minimum output qualities in the case
of five voltage levels and (a) 2x1, (b) 1×3 voltage islands.
D
V
H
V
H
D
H
D
V
H
H
H
H
V V
D
H
H
D
D
0
5
10
15
20
25
30
35
90% 80% 70% 60% 50%
Increase in Aging Rate Reduction
(%)
Quality
FIR MMM SMT SHP
H: Horizontal
V: Vertical
D: Diagonal
H
V
H
V
D
D
D
H
V
V
V
V
H
V
V V V
H
0
5
10
15
20
25
30
35
90% 80% 70% 60% 50%
Increase in Aging Rate Reduction
(%)
Quality
FIR MMM SMT SHP
H: Horizontal
V: Vertical
D: Diagonal
58
In the case of the FIR benchmark for the quality of 70% when the voltage island size of 1×3 was used, the
improvement of the folding approach was zero. Since the folding impact strongly depends on the bindings
of the nodes to the PEs, there is no clear dependence between the aging rate reduction improvement and
the voltage island size increase.
The impact of having different number of operating voltage levels on the efficiency of the folding
process also was studied. The results which have been reported in Table 3.5 contain the aging rate reduction
improvement for different minimum output qualities and voltage island sizes with and without using the
folding scheme. As the results show, by exploiting the folding approach, in almost all the cases, the aging
rate reduction of the proposed structure has been improved. It should be mentioned that even in the cases
where there was not any aging rate reduction when the folding approach was not used (e.g., the FIR
benchmark for the minimum output quality of 90% when two voltage levels were used), the folding reduces
the aging rate. Given the fact that without folding, improvement was zero yielding infinity improvement
(meaningless) and hence NA notation has been used for these cases in the table. Also, the proposed folding
approach resulted in larger improvement in the case of two operating voltages. This may be attributed to
the large difference between the operating voltage levels in the case of two operating voltage levels.
On average, improvements of 12% and 34% were achieved in the cases of three and two operating
voltage levels, respectively. In Table 3.6, we have compared the improvements of the works suggested here
and the work of [38] compared to those of the conventional exact CGRA for the benchmarks. In the case
of the[38], we have considered only the minimum output quality level of 90% while for the proposed
structure in this work, the results for an additional minimum accuracy level of 50% have been included. In
addition to the power reduction, the table contains the lifetime improvement due to the reduction of the
electric field (supply voltage) for the proposed approach. The results were obtained considering five voltage
levels. For the case of 90%, in our approach, we cannot lower the voltages of most of the PEs much due to
the fact that the MSBs are also affected by reducing the voltage considerably.
59
Table 3.5. Aging rate reduction improvement (%) using the folding technique for different benchmarks, different minimum
output qualities, voltage island sizes, and operating voltage resolutions (Imp: Improvement, Dir: Folding Direction)
Voltag
e
island
Minimum Output Quality
→
90% 80% 70% 60% 50%
Voltage Levels →
Benchmarks
5 3 2 5 3 2 5 3 2 5 3 2 5 3 2
Imp Dir Imp Dir Imp Dir Imp Dir Imp Dir Imp Dir Imp Dir Imp Dir Imp Dir Imp Dir Imp Dir Imp Dir Imp Dir Imp Dir Imp Dir
3×2
FIR 35 D 48 H NA D 7 H 31 H 110 H 15 D 9 D 30 H 7 V 11 V 60 V 15 D 10 D 47 V
MMM 32 V 48 H NA D 29 H 31 H 190 V 16 H 27 H 60 H 10 H 10 H 20 H 11 H 0 - 0 -
SMT 3 D 11 V 18 H 6 D 9 H 0 - 7 H 6 V 11 V 9 H 6 H 8 D 4 D 3 V 6 D
SHP 7 V 16 H 4 D 5 H 6 H 20 V 7 H 5 V 10 H 5 D 3 D 2 V 7 H 1 V 0 -
2×2
FIR 19 V 21 V NA H 19 H 21 V 110 H 15 D 0 - 45 H 0 - 0 - 60 V 10 D 0 - 17 V
MMM 28 H 33 V NA H 14 D 14 H 100 D 15 V 9 H 80 D 9 H 5 V 8 D 8 V 7 H 7 H
SMT 8 H 11 V 7 H 7 V 5 V 8 H 6 V 9 D 14 D 4 H 7 D 7 H 4 V 4 V 8 V
SHP 7 H 10 H 23 H 7 H 10 H 18 D 5 V 7 H 8 H 4 H 5 H 5 H 4 H 7 H 11 V
1×3
FIR 30 H 8 V NA H 24 V 45 D 140 H 0 - 9 V 60 H 11 H 11 V 64 H 6 V 24 H 30 H
MMM 10 D 54 H NA H 14 D 28 D 140 H 12 D 24 D 50 V 6 H 12 V 18 H 11 V 0 - 22 D
SMT 10 V 0 - 7 V 3 V 5 V 10 V 4 V 4 V 11 D 6 H 4 H 8 D 4 V 4 D 5 H
SHP 4 V 17 D 23 D 3 V 4 D 16 H 3 V 5 V 0 - 0 V 8 D 11 D 4 H 4 V 12 D
2×1
FIR 32 D 21 H NA V 17 V 16 H 140 V 12 H 6 D 85 V 15 V 13 H 64 V 8 H 2 H 30 V
MMM 21 D 13 H NA V 3 H 31 D 190 V 5 D 15 H 30 V 12 V 17 H 13 H 11 H 8 H 16 H
SMT 5 H 2 D 11 H 7 H 6 H 2 H 7 H 4 D 9 H 4 V 10 V 10 D 4 V 4 H 5 D
SHP 15 D 12 H 50 D 3 H 6 V 5 H 9 H 11 H 12 V 3 D 2 H 16 V 1 D 4 V 9 H
This lowers the power improvement compared to that of [38]. Finally, the table contains the area overheads
of both approaches where for the proposed method in this proposal is negligibly low.
3.3.3.3.2 Overheads of the Proposed Structure
In our structure, there are, e.g., up to three additional control signals (for up to five voltage levels) which
selects from the voltages that should be applied to each island. The values of the voltage control signals
(determined beforehand for a given accuracy level) along with other controlling signals are stored in the
context memory and are routed to the CGRA structure through proper interconnect layers. Obviously, the
signal values are different in different invocations of the CGRA structure. Since the additional signals form
a small fraction of all the signals, the overhead would be negligible. Similarly, the area and power overheads
for the additional switch power boxes in the case of 2×1 voltage island supporting five operating voltage
levels (800mV, 750mV, 700mV, 650mV, and 600mV) in the 15nm technology were estimated to be about
0.2% and 0.0003%, respectively. These overheads were extracted based on the power consumption and the
delay of PEs obtained by synthesizing the PE using Synopsys Design Complier. By using the dynamic
60
Table 3.6. The improvements of the proposed method and the one suggested in [38] compared to those of the conventional exact
CGRA.
Benchmarks
Quality
Constraint
Power
Improvement
Lifetime Improvement
Area
Overhead
[38]
FIR 90% 34%
Not Applicable ~3.3×
MMM 90% 34%
SHP 90% 33%
SMT 90% 33%
Proposed (2×1 Voltage Island)
FIR
90% 14% 28%
~0.2%
50% 27% 49%
MMM
90% 15% 29%
50% 30% 54%
SHP
90% 37% 64%
50% 43% 73%
SMT
90% 31% 55%
50% 40% 70%
power consumption of the PE, the overall capacitance of the PE was estimated. In addition, by employing
the obtained total power consumption of the PE for each VOS level, the current flowing through the PE
was determined. Using this current and the saturation current of a PMOS switch for one fin, the number of
the fins required for each PMOS switch was determined. Next, to reduce the timing overhead of the
switching of the VOS level, we assumed a larger number of the fins such that the switching would occur in
one cycle (~490 ps). Then, the number of the fins was increased proportional to the number of the PEs
inside the island. The power overhead also included the leakage powers of the PMOS switches in the OFF
state.
In the mapping process, which is performed offline, the voltage levels of the islands for a given
constraint as well as the mapping of the DFG nodes on the PEs are determined. The information is loaded
in the context memory for the use in the runtime. The island for each PE is predetermined which along with
its context word (used for configuring the PE) would configure the PE fully. The information would be
loaded in the context memory when the system starts. In the runtime, based on the required configuration
of the PE, the corresponding context words is loaded from the context memory to the context register. The
latency of loading the context words, depends on the bandwidth between the context memory and the
CGRA. The process means that no additional energy or latency for the mapping process is induced in the
proposed approximate CGRA structure compared to that of the conventional one. The only latency that one
61
might possibly conceive is the time that after one utilization of the CGRA for a given accuracy levels, for
the next use of the structure, a higher level of accuracy is required. In this case, the voltage levels of some
islands need to be increased. In these cases, we should give some time for the voltage of these PEs rise from
the lower level to the higher level. Our estimation showed that, in the worst case, the latency for the
considered switch box was about 500ps which was equal to around one clock period of the system. In the
case that the CGRA is not in use for a while, the power gating scheme could be used for the PEs. In the
case of using this scheme, setting the voltages of the islands requires some time for all the PEs to reach to
their final supply voltage. Thus, the proposed scheme does not impose additional latency compared to the
conventional structure when similar power gating scheme is used.
3.3.3.3.3 Comparing the Proposed CGRA with Some Related Prior Works
Table 3.7 compares our work to some prior ones in the areas of CGRA, approximate computing, and voltage
overscaling. It includes the parameters/features of ability to support approximate computing, lifetime
improvement, energy reduction, accuracy configurability, runtime accuracy configurability, and runtime
accuracy configurability resolution. In this work, the use of VOS approximate technique for improving the
lifetime/reliability as well as reducing the energy/power consumption of CGRAs based on voltage islanding
is proposed for the first time. This provides a platform in which a trade-off between output accuracy
(quality) level from one side and energy/power consumption and lifetime/reliability from the other side
may be performed while an exact computation (100% accuracy level) is also an option.
Among the previous works, only [53] and [38] focus on presenting accuracy configurable (approximate
and exact mode computing) CGRA structures. Compared to the work of [38], more specifically, our
proposed method provides lifetime improvement while the structure suggested in [38] mainly concentrates
on energy/power reduction. In addition, our method supports different accuracy qualities with more
resolution levels. Also, in comparison to [53], this work proposes the use of VOS in a CGRA considering
the actual mapping, different voltage island sizes, and different voltage levels where four benchmarks are
studied.
62
Table 3.7. A comparison between the proposed work and some others in the areas of CGRA, approximate computing, and
voltage overscaling.
CGRA
Structure
Approximate
Computing
Lifetime
Improvement
Energy
Reduction
Accuracy
Configurability
Runtime
Accuracy
Configurability
Runtime Accuracy
Configurability
Resolution
[36] -
[35] -
[55] Low
[37] -
[56] -
[57] -
[58] -
[19] Medium
[59] -
[60] -
[61] -
[40] -
[38] Low
This
work
High
3.4 Conclusion
To improve the lifetime (reliability) and energy efficiency of CGRAs, the use of the VOS technique was
proposed. This technique, which was applied to functional units of the PEs, was based on trade-offs between
the accuracy and lifetime. For the first time, the use of different overscaled voltage levels for different
operations of a DFG of a computational kernel was suggested. Two architectures of the CGRA were
proposed. The first type was equipped with power switch boxes for each row of PEs while the second one
was equipped with power switch boxes for each cluster of PEs providing the overscaled voltage levels
required for each application based on an output quality constraint. Also, an ILP based mapping algorithm
for determining the operating voltage level of each PE and binding of the operations of the input application
on the PEs of the CGRA was proposed. The efficiency of the proposed approaches was evaluated using
multiple benchmarks. For the second architecture, the impact of the number of the VOS levels on the energy
saving and lifetime improvement was investigated. Also, a folding technique for further improvement of
aging in the second structure was suggested. Results indicated considerable improvement in the lifetime
and energy consumption. The results showed that the proposed CGRAs could lead to 43% and 73% lower
energy consumption and aging rate, respectively.
63
CHAPTER 4. ENERGY-EFFICIENT ACCURACY-CONFIGURABLE ADDER
AND MULTIPLIER WITH IMPROVED LIFETIME BASED ON VOLTAGE
OVERSCALING
In this chapter first, a low-power accuracy-configurable block-based Carry Look Ahead adder (AC-CLA)
is proposed. The structure employs the voltage overscaling and number of approximate blocks as the
approximation knobs for improving the energy consumption as well as the reliability and lifetime of the
adder. In this adder, for a given accuracy level, some of the blocks work in the approximate mode by using
overscaled voltages. The efficacy of the adder depends on the number of the approximate blocks as well as
the VOS levels used for these blocks. The AC-CLA adder provides energy consumption and lifetime
improvement in compare to the exact CLA adder.
Second, we investigate an energy efficient accuracy configurable Dadda (X-Dadda) multiplier. This
structure employs the voltage overscaling and approximate width setting as the approximation knobs for
improving the energy consumption as well as the reliability and lifetime of the multiplier. For a given
accuracy level, the partial product columns and the overscaled voltage for optimizing the energy are
determined. To further improve the efficiency of the multiplier, four-bit truncation of the multiplier output
is also suggested. The results indicate considerable lifetime and energy consumption improvements over
those of the conventional exact ones. Also, the impact of process variations on the accuracy of the X-Dadda
is studied.
The rest of the chapter is organized as follows. Section 4.1 presents the proposed AC-CLA adder and its
related results. Next, the X-Dadda multiplier and its results are proposed and discussed in section 4.2.
Finally, the chapter is concluded in section 4.3.
64
4.1 Low-power Accuracy-configurable Carry Look Ahead Adder
4.1.1 Proposed Approximate AC-CLA
4.1.1.1 Basic Idea Behind Bi-VOS
Based on (2-3), one possible way to further lower the energy consumption without violating the minimum
output quality constraint is to use different voltage levels for different parts of the circuit. This is in contrast
to the case of applying the same reduced voltage level to the whole circuit. Here, one should determine
parts of the circuit which affect the output quality less. More specifically, one may use different voltage
levels for the circuits related to generating the MSBs and LSBs of the circuit output [78]. The technique
which was called Bi-VOS is based on this concept [78][79]. Here, similar to these works, we suggest using
different VOS levels in the design of block-based carry look-ahead adder (AC-CLA). Using the Bi-VOS,
enables us to apply VOS to some LSB block (s) while applying the nominal supply voltage to the MSB
block (s). This decreases the circuit energy consumption and enables us to improve the lifetime without
increasing the error of the adder output excessively.
4.1.1.2 Proposed Structure
To build a CLA adder, blocks (with different sizes e.g. 4-bit, 6-bit, 8-bit) can be used in a multilevel
structure [131]. We consider the granularities of 4- and 8-bits for the blocks of the AC-CLA. As an example,
the structure with the block size of 8-bit is depicted in Figure 4.1.
In this 32-bit adder structure, there are two supply voltages, which one is for the accurate part, denoted
here as V
DD_AC
, and the other one is for the approximate part symbolized as V
DD_AP
whose value is
determined by a power switch box. The V
DD_AC
uses the nominal voltage to make sure that the hardware
part responsible for the 16 output most significant bits have enough time to produce their results while
V
DD_AP
is connected to VOS level (lower supply voltage) to reduce the energy consumption of the hardware
part responsible for the 16 output least significant bits of the adder. The generated carry of the approximate
block is transferred to the exact block through a level shifter which elevates its voltage to the nominal
supply voltage (0.80V) for logic one.
65
Figure 4.1. The structure of the 32-bit AC-CLA-B8 design.
Additionally, level shifters are used for the sum outputs of the approximate block when they are to be
connected to the next architecture component operating at the nominal supply voltage. It should be noted
that we have used same level shifters for carry and the sum outputs, however, simpler level shifter designs
which has lower area and power while having higher delay can be deployed for the sum outputs which are
not in the critical path of the adder. In our work, to reduce the latency overhead of the level shifter, a low-
latency level shifter circuit proposed in [125] was used. Also, the level shifter circuit was realized using
minimum sized transistors for lowering the area overhead as well. While different VOS levels,
corresponding to different accuracy levels, may be used for different blocks of the approximate region, in
this study, without loss of generality, we only considered one level for the whole approximate part. As has
been shown in Figure 4.1, there is two supply voltages for the AC-CLA adder. By just adding one extra
voltage rail and four PMOS transistors, approximate feature has been provided for the adder. Moreover, by
changing the voltage level of the V DD_AP, voltage rail with the power switch box, the configurability feature
of the AC-CLA is provided. The added extra voltage rail and four PMOS transistors while not changing the
latency of the adder, have very small overheads in terms of area, delay and power for the proposed design.
For instance, in RAP-CLA [119] which is a configurable adder and uses architectural solution, needs to
Block 1
C
23
C
out C
7
S[31:24] S[15:8] S[7:0]
A[7:0]
B[7:0]
B[15:8]
A[15:8]
A[23:16]
B[23:16]
A[31:24]
B[31:24]
Accurate Region
V
DD_AP
V
DD_AC
Approximate Level 4 Select
Approximate Level 3 Select
Approximate Level 2 Select
Approximate Level 1 Select
Power Switch Box
Approximate Region With
Block Size of Eight: Number of
Approximate Blocks (N
AP_B
) = 2
Block 2 LS
C
15
Block 3 Block 4
LS
S[23:16]
LS
66
provide both the exact and approximate components for the adder. Other example for the higher overhead
of the other accuracy configurable designs is the GeAr [121], which uses error detection and correction to
make their accuracy configurable adder. In their error correction procedure, there will be one additional
cycle for correcting single error.
One of the main advantages of VOS technique in compare to the other knobs for providing
approximation feature (e.g. architectural knobs) is that it improves the aging mechanism. The simple added
voltage rail for V DD_AP not only reduces energy, it evens slows down the aging of the blocks which receive
the lower supply voltage. The other reconfigurable adders use nominal supply voltage for all of their logics
which does not improve the aging. Moreover, this simple added voltage rail, reduces the error of the 10
years aged AC-CLA.
4.1.2 Results and Discussion
4.1.2.1 Simulation Setup
All the studies were performed by employing the 15nm FinFET-based Open Cell Library (OCL) technology
[73] with the nominal operating voltage level of 0.80V. We considered five VOS levels for the approximate
part including 0.40V, 0.45V, 0.50V, 0.60V, and 0.70V. The level shifter cell was added to the technology
file which was characterized for all the considered VOS levels. For the evaluation of the efficiency, first,
we described the 32-bit AC-CLA structure with Verilog HDL based on the considered block lengths of 4-
bit and 8-bits. The study includes varying the number of blocks in the approximation part from one to the
total number of blocks minus one. While we have used 32-bit Carry Look Ahead adder in this study, the
same approach may be used for smaller adders obtaining the same trend of the results versus the number of
the blocks in the approximation part.
After describing a design, it was synthesized by employing the Synopsys Design Compiler (DC) for the
considered VOS levels. For all the designs, the operating voltage of the exact part was considered to be
0.80V. Due to existence of two voltage domains in the AC-CLA structure, the unified power format (UPF),
a standard for specifying the low power design intent for the chip design, was utilized. After extracting the
67
gate-level netlist of the designs and their corresponding standard delay files (SDFs) from DC, using the
ModelSim, the designs were simulated, and their accuracies were determined using one million random
inputs. To have a fair comparison, we forced the DC to generate the same structure for all the designs with
the same number of approximated blocks. Also, for extracting the accuracy of the designs, we set the clock
period for sampling the outputs based on the delay of that design in the exact mode (i.e., when the same
operating voltage level of 0.80V was applied to whole circuit). Finally, the total energy consumptions of
the designs were extracted based on the design parameters reported by DC.
4.1.2.2 The Accuracy Analysis
The error metric parameters of MED, MRED, and MNED of the 32-bit AC-CLA with 4-bit blocks and 8-
bits blocks denoted by B4 and B8, respectively, for different VOS level for the approximate part (V DD_AP)
and total number of the approximate blocks (N AP_B) are plotted in Figure 4.2.
As was expected, by increasing the N AP_B and decreasing the V DD_AP, the output error increases. The
unusual behavior of the error results in some few cases is due to the fact that each of the gate-level netlists
had different numbers of level shifter with possibly different positions. Also, we ought to emphasize that
we have forced the DC synthesis tool to generate the same structure for all the designs with the same number
of approximated blocks.
Figure 4.2. MED, MRED, and MNED of the 32-bit AC-CLA adder with 4-bit block ((a)-(c)) and 8-bit block ((d)-(f)) versus
NAP_B for different VDD_AP.
5.E+00
5.E+01
5.E+02
5.E+03
5.E+04
5.E+05
5.E+06
5.E+07
1 2 3 4 5 6 7
MED (Log)
N
AP_B
0.40V 0.045V 0.50V 0.60V 0.70V V DD_AP :
4-bit Block
(a)
1.E-09
1.E-08
1.E-07
1.E-06
1.E-05
1.E-04
1.E-03
1.E-02
1.E-01
1 2 3 4 5 6 7
MRED (Log)
N
AP_B
0.40V 0.045V 0.50V 0.60V 0.70V
4-bit Block
V DD_AP :
(b)
1.E-09
1.E-08
1.E-07
1.E-06
1.E-05
1.E-04
1.E-03
1.E-02
1 2 3 4 5 6 7
MNED (Log)
N
AP_B
0.40V 0.045V 0.50V 0.60V 0.70V
4-bit Block
V DD_AP :
(c)
1.E+01
1.E+02
1.E+03
1.E+04
1.E+05
1.E+06
1.E+07
1 2 3
MED (Log)
N
AP_B
8-bit Block
(d)
1.E-09
1.E-08
1.E-07
1.E-06
1.E-05
1.E-04
1.E-03
1 2 3
MRED (Log)
N
AP_B 8-bit Block
(e)
1.E-09
1.E-08
1.E-07
1.E-06
1.E-05
1.E-04
1.E-03
1 2 3
MNED (Log)
N
AP_B 8-bit Block
(f)
68
There is a small decrease in the MED of the AC-CLA-B4 in 0.70 V when the N AP_B increases from 3 to 4
and 6 to 7. This may attribute to the fact that the set of large random numbers generated for calculating the
MED of each configuration could be slightly different than that of the other one. Also, the results indicate
that, for these sets of inputs, the amount of the error that occur in the AC-CLA-B4 does not differ noticeably
when we increase the N AP_B in the AC-CLA from 3 to 4 and 6 to 7.
The maximum values for the error parameters belong to AC-CLA-B4 for V DD_AP = 0.4V and N AP_B = 7
which are MED = 1.43 10
8
, MRED = 4.40 10
-2
, and MNED = 3.34 10
-2
. Our studies show that the
sensitivity of the MRED with respect to V DD_AP is much higher at larger N AP_B. The error sensitivity shows
the ratio of the error metric change with respect to the approximate voltage change. Similarly, sensitivity
of the MRED in terms of N AP_B increases significantly at lower V DD_AP values.
4.1.2.3 Design Parameters Analysis
The total energy consumption of the AC-CLA structures versus V DD_AP, are shown in Figure 4.3.
The energy consumption reduction enlarges as the number of the approximate blocks increases and/or the
VOS level decreases. Since the energy reduction is achieved at the price of error increase, we suggest using
figure of merit (FoM) for comparing the AC-CLA structures with themselves and other approximate adders.
The figure of merit includes energy, accuracy, area and delay degradation due to aging (Delay Degrad.)
and is defined as
𝐹𝑜𝑀 =
−1 ×log (𝑀𝑁𝐸𝐷 )
𝐸𝑛𝑒𝑟𝑔𝑦 ×𝐴𝑟𝑒𝑎 ×𝐷 𝑒𝑙𝑎𝑦 𝐷𝑒𝑔𝑟𝑎𝑑 .
(4-1)
Since the values of MNEDs are smaller than 1, and considerably smaller than the energy consumption
values in units of fJ, −log (𝑀𝑁𝐸𝐷 ) is used in defining FoM. The FoM values of the AC-CLA structures
versus the number of approximate blocks for different V DD_AP are shown in Figure 4.4.
69
Figure 4.3. Energy of the AC-CLA structures versus NAP_B for different VDD_AP (a) 4-bit block width (b) 8-bit block width.
This FoM is defined for the comparison purpose, preferring one approximate structure over the other,
largely depends on the tolerable error of the application as well as the available energy budget.
Figure 4.4. Figure of Merit of the 32-bit AC-CLA adder structures versus NAP_B for different VDD_AP (a) 4-bit block width (b) 8-
bit block width.
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
1 2 3 4 5 6 7
Energy (fJ)
N
AP_B
0.40V 0.45V 0.50V 0.60V 0.70V 0.80V
V
DD_AP
:
4-bit Block
(a)
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
1 2 3
Energy (fJ)
N
AP_B
8-bit Block (b)
0.006
0.008
0.010
0.012
0.014
0.016
0.018
0.020
0.022
0.024
0.026
0.028
0.030
1 2 3 4 5 6 7
Figure of Merit
N
AP_B
0.40V 0.45V 0.50V 0.60V 0.70V AVG
4-bit Block
V
DD_AP
:
(a)
0.00
0.02
0.04
0.06
0.08
0.10
0.12
0.14
0.16
1 2 3
Figure of Merit
N
AP_B
8-bit Block
(b)
70
The FoM increases as the number of approximate blocks are increased until N AP_B = 5 for AC-CLA-B4
(N AP_B = 2 for AC-CLA-B8) designs. For all the designs with different number of approximate blocks and
V DD_AP, the designs which have the V DD_AP = 0.40V have the maximum FoMs while the designs with V DD_AP
= 0.70V have the lowest FoMs for both of the AC-CLA-B4 and AC-CLA-B8 type of designs.
Next, we compare the efficacy of the proposed approximate adder structures with some of those of the
previously suggested accuracy configurable adders, namely, RAP-CLA [119] and GeAr [121] and one of
the recent proposed approximate adders in the literature, namely, BCSA [120]. The comparison, which is
shown in Figure 4.5, includes the energy consumption, area, MNED, and FoM values of a set of nine
chosen previous designs.
They are three from [119] (RAP-CLA W2, RAP-CLA W4, RAP-CLA W8), three from [120] (BCSA W2,
BCSA W4, BCSA W8), and three from [121] (GeAr W2, GeAr W4 and GeAr W8) in addition to six AC-
CLA structures. The included AC-CLA structures are denoted by AC-CLA BXNYVZ where X, Y, and Z
indicate the adder block size, N AP_B, and V DD_AP, respectively. The selected AC-CLA structures for this
study are among the ones with either the minimum energy consumption or MNED or highest FoM. It can
be inferred from the figures that the minimum energy consumption (MNED) belongs to GeAr with the
window size of 2 (B4N1V0.6) which is equal to 2.52fJ (1.8×10
-9
).
Figure 4.5. Comparing the energy consumption, area, NMED, and FoM of different accuracy configurable approximate adders.
20.64
20.64
11.66
4.77
10.84
25.12
2.52
5.31
13.33
15.34
7.79
11.50
22.05
10.50
12.56
383.14
152.86
310.54
93.83
110.69
142.88
61.34
121.26
123.32
116.29
128.29
124.29
118.39
126.39
122.39
3.04E-04
1.20E-03
1.66E-03
6.17E-03
3.77E-03
1.96E-03
1.56E-02
5.61E-03
3.65E-03
1.18E-02
2.08E-02
2.96E-02
7.39E-03
2.27E-02
1.53E-01
0
30
60
90
120
150
180
210
240
270
300
330
360
390
1.00E-09
1.00E-08
1.00E-07
1.00E-06
1.00E-05
1.00E-04
1.00E-03
1.00E-02
1.00E-01
Energy (fJ) & Area (µm
2
)
Approximate Adders
MNED & FOM (Log)
Energy Area MNED FoM
71
Also, the maximum FoM belongs to the B8N2V0.4 which is about 10× more than the maximum FoM value
of the prior works. Hence, to use the AC-CLA adder for the block size of 4 (8), the designer may use the
B4N5V0.4 (B8N2V0.4) design choice for optimal performance.
Therefore, this study shows the superiority of the AC-CLA structure in terms of FoM compared to the
previous state-of-the-art accuracy configurable approximate adders. Also, the accuracy resolution of the
AC-CLA are finer than those of the prior configurable approximate adders. The AC-CLA accuracy
resolution is determined by the number of the operating voltage levels and their difference applied to the
approximate part. The resolution may be changed without requiring any modification to the internal
structure of the AC-CLA. As mentioned before, one of the advantages of the AC-CLA structures compared
to the state-of-the-art accuracy configurable approximate adders are their higher lifetime/reliability obtained
due to the lower electric filed which exits in the devices. To assess the improvement which can be achieved
through using lower supply voltages, we have plotted BTI induced delay degradation of the explored
structures in Figure 4.6.
Figure 4.6. BTI induced delay degradation of the explored structures for different VDD_AP and NAP_B (a) 4-bit block width (b) 8-
bit block width.
0.40V
0.45V
0.50V
0.60V
0.70V
0%
10%
20%
30%
40%
50%
1
2
3
4
5
6
7
V
DD_AP
(V)
DELAY DEGRADATION
N
AP_B
4-bit Block
0.40V
0.45V
0.50V
0.60V
0.70V
0%
10%
20%
30%
40%
50%
1
2
3
V
DD_AP
(V)
DELAY DEGRADATION
N
AP_B
40%-50% 30%-40% 20%-30%
10%-20% 0%-10%
8-bit Block
(b)
(a)
72
The plots which are for both structures of 4- and 8-bit block sizes are versus V DD_AP and N AP_B. In the best
(worst) case, after 10 years, the critical path delay of the B8N2V0.4 (B4N1V0.7) increases about 2% (45%).
The delay degradation of the AC-CLA structures depends on V DD_AP and N AP_B during the lifetime of the
adder.
For these results, we assumed that the AC-CLA structures experience all the considered VOS operating
voltage levels of this study with similar probability (𝑃 (𝑉 𝑉𝑂 𝑆 𝑖 )=1/7). The delay degradation on average
for all VOS levels and number of approximated blocks for the case of AC-CLA-B4 and AC-CLA-B8 were
23% and 23%, respectively. To guarantee that the AC-CLA does not have any errors after n years of
operation in the exact mode, the clock frequency for sampling the output of the adder should be decreased.
On the other hand, as we showed, lower VOS level of V DD_AP mitigates the delay degradation rate of the
AC-CLA. If the frequency is not lowered, the adder becomes an erroneous one (approximate adder). Hence,
we expect that less error is induced if the delay of the AC-CLA is determined based on the delay of the AC-
CLA in exact mode after 10 years operation. It is clear that the exact AC-CLA works precisely after 10
years under the considered delay (clock period).
Our simulation results show for example that the MRED reduction of the AC-CLA-B4 for N AP_B = 7 is
77% and of the AC-CLA-B8 for N AP_B = 3 is 90%, when the clock of the designs has been set as the delay
of the AC-CLA in exact mode after 10 years operation.
4.1.2.4 Using the AC-CLA Adder in Neural Network Application
Now, we evaluate the efficacy of the proposed AC-CLA adder when is used for realizing a neural network
with the application of classifying the MNIST dataset. The neural network structure consisted of two hidden
layers (with 100 neurons with the ReLU activation function in each layer) and trained offline by 60K
images. Then, the NN was described using Verilog while the inputs and the weights as well as the width of
the multipliers were 8 bits. For the addition operation in the neurons of the first layer (second and third
layers), 32-bit AC-CLA-B4 (24-bit AC-CLA-B4 adders) with tree style implementation were used. The
number of approximate blocks in each adder in each layer is denoted by N AP_B. Other operations (i.e., 8-bit
73
Table 4.1. Accuracy loss for the NN implemented with the selected designs of AC-CLA
VDD_AP
(NAP_B, L1, NAP_B, L2, NAP_B, L3)
*
(7, 5, 5) (6, 4, 4) (4, 3, 3)
Loss (%) E Red. (%) Loss (%) E Red. (%) Loss (%) E Red. (%)
0.70 0 28.81 0 28.36 0 14.05
0.60 0 38.77 0 40.00 0 17.36
0.50 10.62 48.94 0 50.00 0 22.31
0.45 10.62 52.33 0 53.27 0 23.97
0.40 76.52 56.57 1.65 57.09 0 26.45
by 8-bit multiplications and the ReLU function of the neurons of the layers) in the hardware implementation
of the NN were described at the behavioral level (exact mode). The clock period of the design was
determined based on the N AP_B parameter and the VOS level considered for the V DD_AP. The results obtained
in this study showed a classification accuracy of 92.67% for the nominal voltage (exact adders). For
extracting the accuracy of the NNs implemented using AC-CLA, 10K test images were injected to the NN.
The results for accuracy losses of the selected designs reported in Table 4.1 indicate the highest value
belongs to V DD_AP = 0.40V (about 76.52%), while for most of the VOS levels, the accuracy degradation
was negligible.
The negligible accuracy loss is attributed to the fact that the timing requirements in the pipeline
implementation was not violated for most of the input data. Finally, for each selected design, the table
contains the energy reduction associated to the addition operations. The use of AC-CLA at lower voltages
lead to more reductions which could be accompanied by more accuracy loss.
4.2 The X-Dadda Structures
4.2.1 Design Based on Bi-VOS
Considering (2-3) using VOS for all the bits including most significant bits and least significant bits will
hinder taking the full advantage of efficient use of the large energy reduction potential of the VOS technique
[123]. To overcome this problem, in this work, inspired by the idea of Bi-VOS employed in [81] and [114],
we have applied different VOS levels to different bit positions of parallel multipliers. Additionally, we
consider two designs for the Dadda multiplier where the first design only uses Bi-VOS while the second
74
one employs Bi-VOS along with some bit truncation. Using the VOS for LSBs decreases the energy and
improves the lifetime without increasing the error of the multiplier output excessively.
Owing to the considerable impact on the output accuracy, in the hardware part of the MSBs, we use the
nominal voltage to make its operation errorless. The structure of the considered Bi-VOS design is shown
in Figure 4.7 (in the case of with and without truncation). In this structure, there are two supply voltages,
where one is for the accurate part, denoted here as 𝑉 DD_AC
, and the other one is for the approximate part,
denoted as 𝑉 DD_AP
, whose value is set through a power switch box (see Figure 4.7).
Figure 4.7. The structure of the 8-bit X-Dadda design with 𝑤𝑖𝑑𝑡 ℎ_𝑟𝑒𝑔𝑖𝑜𝑛 2 of 9 and 5 illustrating the cases of without and with
truncation.
Accurate Region:
region1 width
(width_region1) = 7
Compressor
FA
HA
Adder Block
Level 1
Level 2
Merge
PP
Partial Product Reduction
Truncate Region
V
DD_AP
V
DD_AC
Approximate Level 4 Select
Approximate Level 3 Select
Approximate Level 2 Select
Approximate Level 1 Select
Power Switch Box
Approximate Region Without
Truncation: region2 width
(width_region2) = 9
Approximate Region
With Truncation:
region2 width
(width_region2) = 5
LS
LS
LS
LS
Level Shifter
V
DD_High
V
DD_Low LS
Output of FA
Output of Compressor
Output of HA
75
The accuracy of the multiplier can be configured in the runtime by changing 𝑉 DD_AP
with a negligible delay
overhead. It should be noted that the approximate width (𝑤𝑖𝑑𝑡 ℎ_𝑟𝑒𝑔𝑖𝑜𝑛 2), which is another approximation
knob, is set only at the design time.
We utilize level shifters for matching the voltages of the outputs produced by the half adders, full adders,
and compressors of the approximate columns to the supply voltage of the half adders, full adders, and
compressors of the accurate columns. To lower the latency overhead of the level shifter, a low-latency level
shifter circuit proposed in [125] was used. Of course, depending on the design constraints and requirements,
other level shifters (e.g., [126]) in the X-Dadda structure may be utilized. To minimize the area overhead,
all of the transistors used in the level shifter of [125] were minimum sized. In Figure 4.8, we have depicted
the circuit diagram of the level shifter whose delay versus 𝑉 DD_AP
characteristic is shown in Figure 4.9.
Figure 4.8. The delay versus 𝑉 DD_AP
of the level shifter.
Figure 4.9. The circuit diagram of the level shifter employed in this work [125].
0
50
100
150
200
250
300
350
400
0.35 0.45 0.55 0.65 0.75
Delay (ps)
V
DD_AP
(V)
D
IN
D
OUT
V
DD_AC
V
DD_AC
V
DD_AC V
DD_AP
76
4.2.2 Simulation Setup
All the studies were performed by employing the 15nm FinFET-based Open Cell Library (OCL) technology
[73] with the nominal operating voltage level of 0.80V (𝑉 DD_AC
). We considered six VOS levels for the
approximate part (𝑉 DD_AP
) including 0.40V, 0.45V, 0.50V, 0.55V, 0.65V and 0.75V. The level shifter cell
was added to the technology file which was characterized for all the considered VOS levels. For the
evaluation, first, we described each 8-bit X-Dadda structure with Verilog HDL separately based on the
considered approximate part width and also the length of the truncation. For this study, in the case of without
truncation (with 4-bit truncation), the length of the approximate part was varied from 2-bit to 14-bit (1-bit
to 10-bit).
More specifically, 13 designs for the X-Dadda structure without truncation and 10 designs for the X-
Dadda structure with truncation were studied. The designs were synthesized by considering 𝑉 DD_AC
and
𝑉 DD_AC
of 0.8V and 0.45V, respectively. In this work, Synopsys Design Compiler tool was utilized for
synthesizing the designs. For each 𝑤𝑖𝑑𝑡 ℎ_𝑟𝑒𝑔𝑖 𝑜𝑛 2, the synthesized output gate-level netlist of its
corresponding design was employed for studying the impact of the 𝑉 DD_AP
value on the design parameters
of the X-Dadda.
Thus, for each chosen operating VOS level of the approximate part, the DDC (Synopsys internal
database format) of the structure with the considered approximate bit width and the proper technology file
were fed to the synthesis tool. Due to the existence of two voltage domains in the X-Dadda structure, the
united power format, a standard for specifying the low power design intent for the chip design, was utilized
with the netlist during synthesize.
It should be noted that each of the gate-level netlists had different number of level shifters whose
positions in the circuit netlist tended to be different. These two factors caused the design parameters of the
studied X-Dadda to deviate slightly from the expected trend of the design parameters when the
approximation width (𝑤𝑖𝑑𝑡 ℎ_𝑟𝑒𝑔𝑖𝑜𝑛 2) is varied. Note that the width of the exact part is denoted as
𝑤𝑖𝑑𝑡 ℎ_𝑟𝑒𝑔𝑖𝑜𝑛 1.
77
Table 4.2. Target performances for different designs without and with truncation at the supply voltage of 0.80V.
Designs without Truncation Designs with Truncation
𝒘𝒊𝒅𝒕𝒉 _𝒓𝒆𝒈𝒊𝒐𝒏𝟐
Target
Performance
@ 0.80V (ps)
𝒘𝒊𝒅𝒕𝒉 _𝒓𝒆𝒈𝒊𝒐𝒏𝟐
Target
Performance
@ 0.80V (ps)
1-4 750 6 680
5, 6, 8, 13 760 5 690
9, 11 770 1 710
10, 14 780 8 720
12 800 9 730
NA NA 3 740
NA NA 4, 10 750
NA NA 2, 7 760
As mentioned before, in the VOS-based design, the target delay is not changed by changing the operating
voltage level. In our investigation, the target delay for each of the 23 (13 + 10) designs was determined by
the post synthesis simulation of the gate-level netlist of the design by considering the 𝑉 DD_AP
= 0.8V (exact
operation). For the post-synthesis simulations, based on the chosen operating VOS level of the approximate
part, the proper SDF file was generated. The file was generated using the synthesis tool when employing
the characterized technology file at that operating VOS level. We used the Synopsys VCS tool for the
simulations during which one million random inputs were injected as the inputs to the circuit.
The target delays of the considered designs are provided in Table 4.2 for the cases of with and without
truncation, respectively. These delays were used for setting the clock frequency of the flip-flops (operating
at 0.80V) which sampled the X-Dadda outputs. When lowering 𝑉 DD_AP
, this timing requirement was not
satisfied, potentially causing erroneous outputs (therefore, there was no need for considering special timing
requirements for the flip-flops (register) used with the proposed multiplier). Finally, we ought to emphasize
that we forced the DC synthesis tool to generate the same structure for all the designs with the same
approximate part widths. The energy consumptions of the designs also were extracted based on the design
parameters reported by DC.
78
4.2.3 Results
4.2.3.1 Nominal X-Dadda Structures
The error metric parameters of MRED, MED, and MNED of the 8-bit X-Dadda multiplier in the two
cases of without truncation (w/o_t) and with truncation (w_t) under different 𝑤𝑖𝑑𝑡 ℎ_𝑟𝑒𝑔𝑖𝑜𝑛 2 and VOS
levels (𝑉 DD_AP
) are plotted in Figure 4.10. As was expected, by increasing 𝑤𝑖𝑑𝑡 ℎ_𝑟𝑒𝑔𝑖𝑜𝑛 2 and also
decreasing 𝑉 DD_AP
, the output error increases causing the increases in MED, MRED, and MNED
parameters. The results for 𝑤𝑖𝑑𝑡 ℎ_𝑟𝑒𝑔𝑖𝑜𝑛 2 equal to one has not been shown since the effect of the
technique will be limited. Among the studied structures, the maximum MED (~2.65E+03), MRED (~8.22E-
01), MNED (~4.05E-02) belonged to the w_t X-Dadda multiplier with 𝑤𝑖𝑑𝑡 ℎ_𝑟𝑒𝑔𝑖𝑜𝑛 2 of 10, 8, and 10
with 𝑉 DD_AP
=0.40V, respectively. It should be noted that as discussed before, there are few points in the
results that do not follow quite the expected trend. This may be attributed to different employed gates and
corresponding sizes resulted from synthesizing different X-Dadda structures independently.
To study the impact of approximate knob adjustment on the fluctuations of the error metrics of the
proposed multiplier, we make use of an error sensitivity (ES) parameter.
Figure 4.10. MED, MRED, and NED of the 8-bit X-Dadda multiplier without ((a)-(c)) and with ((d)-(f)) 4-bit truncation under
different 𝑉 DD_AP
and 𝑤𝑖𝑑𝑡 ℎ_𝑟𝑒𝑔𝑖𝑜𝑛 2.
1.E-06
1.E-05
1.E-04
1.E-03
1.E-02
1.E-01
1.E+00
2 3 4 5 6 7 8 9 10 11 12 13 14
MRED (Log)
width_region2 (bits)
0.40V 0.45V 0.50V 0.55V 0.65V 0.75V V
DD_App
:
(b)
1.E-02
1.E-01
1.E+00
1.E+01
1.E+02
1.E+03
1.E+04
2 3 4 5 6 7 8 9 10 11 12 13 14
MED (Log)
width_region2 (bits)
0.40V 0.45V 0.50V 0.55V 0.65V 0.75V V
DD_App
:
(a)
5.E-07
5.E-06
5.E-05
5.E-04
5.E-03
5.E-02
2 3 4 5 6 7 8 9 10 11 12 13 14
MNED (Log)
width_region2 (bits)
0.40V 0.45V 0.50V 0.55V 0.65V 0.75V
(c)
V
DD_App
:
1.E+01
1.E+02
1.E+03
1 2 3 4 5 6 7 8 9 10
MED (Log)
width_region2 (bits)
4-bit Truncation
(d)
5.E-03
5.E-02
5.E-01
1 2 3 4 5 6 7 8 9 10
MRED (Log)
width_region2 (bits)
4-bit Truncation
(e)
1.E-04
1.E-03
1.E-02
1 2 3 4 5 6 7 8 9 10
MNED (Log)
width_region2 (bits)
(a)
4-bit Truncation
(f)
79
The sensitivity of the error metric 𝑋 (MED, MRED, or NED) to the change of 𝑉 DD_AP
(from 𝑉 DD_AP,𝑖 to
𝑉 DD_AP,𝑖 +1
) for a given 𝑤𝑖𝑑𝑡 ℎ_𝑟𝑒𝑔𝑖𝑜𝑛 2
𝑘 is defined as
𝐸 𝑆 𝑋 (𝑉 DD_AP,𝑖 ,𝑤𝑖𝑑𝑡 ℎ_𝑟𝑒𝑔𝑖𝑜𝑛 2
𝑘 )=
𝐸 𝑋 (𝑉 DD_AP,𝑖 +1
,𝑤𝑖𝑑𝑡 ℎ_𝑟𝑒𝑔𝑖𝑜𝑛 2
𝑘 )−𝐸 𝑋 (𝑉 𝐷𝐷 _𝐴𝑃 ,𝑖 ,𝑤𝑖𝑑𝑡 ℎ_𝑟𝑒𝑔𝑖𝑜𝑛 2
𝑘 )
𝑉 DD_AP,𝑖 −𝑉 DD_AP,𝑖 +1
(4-2)
where the function 𝐸 𝑋 is the error metric 𝑋 for the specified 𝑉 DD_AP,𝑖 and 𝑤𝑖𝑑𝑡 ℎ_𝑟𝑒𝑔𝑖𝑜𝑛 2
𝑘 . The index i (0
≤ i < 5) is used to point to the voltages in the set of {0.75V, 0.65V, 0.55V, 0.50V, 0.45V, 0.40V}. The
results of this investigation for MRED revealed that, on average (for all considered 𝑤𝑖𝑑𝑡 ℎ_𝑟𝑒𝑔𝑖𝑜𝑛 2),
𝐸 𝑆 𝑀𝑅𝐸𝐷 increased 319× (484×) by decreasing 𝑉 DD_AP
from 0.75V to 0.45V for the w/o_t (w_t) design. This
implies a much higher sensitivity for the structure at lower 𝑉 DD_AP
values. In other words, at lower VOS
levels, going from one level to the next higher level, improves MRED considerably. Similarly, the error
sensitivity to the change of 𝑤𝑖𝑑𝑡 ℎ_𝑟𝑒𝑔𝑖𝑜𝑛 2 was examined. The obtained results indicated that, on average
(for all considered 𝑉 DD_AP
values), 𝐸 𝑆 𝑀𝑅𝐸𝐷 increased 1204× (123×) by enlarging 𝑤𝑖𝑑𝑡 ℎ_𝑟𝑒 𝑔 𝑖𝑜 𝑛 2 from 2
to 13 (1 to 8) for the case of w/o_t (w_t) design. Again, this indicates that when the number of approximation
bits is large, reducing the number of approximation bits by even one bit improves MRED significantly.
Next, we study the energy consumptions of the X-Dadda structure versus 𝑤𝑖𝑑𝑡 ℎ_𝑟𝑒𝑔𝑖𝑜𝑛 2 with 𝑉 DD_AP
as the running parameter. As may be observed from the results, which are shown in Figure 4.11, the energy
consumption reduces when 𝑉 DD_AP
decreases and/or 𝑤𝑖𝑑𝑡 ℎ_𝑟𝑒𝑔𝑖𝑜𝑛 2 increases.
More specifically, increasing 𝑤𝑖𝑑𝑡 ℎ_𝑟𝑒𝑔𝑖𝑜𝑛 2 from 2 to 14 (see Figure 4.11 (a)), the energy
consumption reduction enlarges from ~0.4% to 42% for both w/o_t and w_t designs. For instance, in the
case of 𝑉 DD_AP
= 0.4V, increasing 𝑤𝑖𝑑𝑡 ℎ_𝑟𝑒𝑔𝑖𝑜𝑛 2 from 2 to 14 (1 to 10) causes energy reduction changes
from 0% to 60% compared to the nominal energy consumption the case of w/o_t (w_t) design, respectively.
For 𝑉 DD_AP
= 0.75V, the corresponding reductions are from ~0% to 18% (1% to 21%) which is lower than
the other VOS level. To further investigate the characteristic of the structure, let us define energy reduction
sensitivity (𝐸𝑅 𝑆 ) to 𝑤𝑖𝑑𝑡 ℎ_𝑟𝑒𝑔𝑖𝑜𝑛 2. The sensitivity, which may also be considered as the energy gain
efficiency, is calculated from the change in the energy reduction when increasing 𝑤𝑖𝑑𝑡 ℎ_𝑟𝑒𝑔𝑖𝑜𝑛 2 by one.
80
Figure 4.11. Energy consumption of the 8-bit X-Dadda structures versus 𝑤𝑖𝑑𝑡 ℎ_𝑟𝑒𝑔𝑖𝑜𝑛 2 for (a) without and (b) with truncation
designs for different VOS levels.
As a specific example, by increasing 𝑤𝑖𝑑𝑡 ℎ_𝑟𝑒𝑔𝑖𝑜𝑛 2 from 4 to 12 (2 to 7) for the case of w/o_t (w_t)
design, we obtain a ~2× (~2.2×) increase in the ERS. The values were obtained by averaging over all the
VOS levels.
It should be noted that the delay and area of the switches for connecting lower V DD’s are important
parameters in the power switching approaches. Since in this work, we have used FinFET devices, the
number of the fins determines the delay and area overheads of the switches. We used seven devices for
connecting the seven voltage levels to region 2. The highest number of fins corresponded to the case of the
largest 𝑤𝑖𝑑𝑡 ℎ_𝑟𝑒𝑔𝑖𝑜𝑛 2 (largest capacitance for region 2) which was 14 in the case of the 8-bit X-Dadda
structure. It should be stated that the number of fins of the switches were determined based on keeping the
VOS voltage IR drop below 10%. The number of the fins for the voltages of 400mV, 450mV, 500mV,
550mV, 650mV, and 750mV were chosen to be 37, 27, 22, 19, 14, and 12. The area overheads of the
employed switch boxes were less than 0.1% of the area of the proposed X-Dadda multiplier.
70
80
90
100
110
120
130
140
150
160
170
180
190
200
2 3 4 5 6 7 8 9 10 11 12 13 14
Energy (fJ)
width_region2 (bits)
0.40V 0.45V 0.50V 0.55V 0.65V 0.75V V
DD_App
:
(a)
70
80
90
100
110
120
130
140
150
160
170
1 2 3 4 5 6 7 8 9 10
Energy (fJ)
width_region2 (bits)
0.40V 0.45V 0.50V 0.55V 0.65V 0.75V V
DD_App
:
4-bit Truncation
(b)
81
Finally, we provide some error results for the 16-bit structure in Figure 4.12, which show MRED, MED,
and MNED versus the 𝑤𝑖𝑑𝑡 ℎ_𝑟𝑒𝑔𝑖𝑜𝑛 2. As is observed from the figure, similar characteristics to those of
the 8-bit structures exist for this structure. The trend is the increase in the error metric as 𝑤𝑖𝑑𝑡 ℎ_𝑟𝑒𝑔𝑖𝑜𝑛 2
increases. The energy reduction of the structure versus 𝑤𝑖𝑑𝑡 ℎ_𝑟𝑒𝑔𝑖𝑜𝑛 2 is plotted in Figure 4.13. Again,
the energy consumption improves as the width increases.
While the energy reduction rate is higher for lower supply voltages, the error increase rate is also higher for
these voltages.
4.2.3.2 Impact of Process Variation on X-Dadda Structure
Operating a circuit at lower voltages than the nominal one in the presence of the process variation may
exacerbate the impact of variation on the circuit characteristics. This is owing to the fact that the operating
region of devices may become close to the near threshold region [127].
Figure 4.12. (a) MED, (b) MRED, and (c) NED of the 16-bit X-Dadda multiplier under different 𝑉 DD_AP
and 𝑤𝑖𝑑𝑡 ℎ_𝑟𝑒𝑔𝑖𝑜𝑛 2.
1.E+01
1.E+02
1.E+03
1.E+04
1.E+05
1.E+06
1.E+07
5 7 9 11 13 15 17 19 21 23 25 27
MED (Log)
width_region2 (bits)
0.40V 0.50V 0.60V 0.70V
(a)
V
DD_App
:
1.E-08
1.E-07
1.E-06
1.E-05
1.E-04
1.E-03
1.E-02
5 7 9 11 13 15 17 19 21 23 25 27 29
MRED (Log)
width_region2 (bits)
0.40V 0.50V 0.60V 0.70V
(b)
V
DD_App
:
1.E-09
1.E-08
1.E-07
1.E-06
1.E-05
1.E-04
1.E-03
1.E-02
5 7 9 11 13 15 17 19 21 23 25 27 29
MNED (Log)
width_region2 (bits)
0.40V 0.50V 0.60V 0.70V
(c)
V
DD_App
:
82
Figure 4.13. Energy consumption of the 16-bit X-Dadda structures versus 𝑤𝑖𝑑𝑡 ℎ_𝑟𝑒𝑔𝑖𝑜𝑛 2 for different VOS levels.
In this part, we study the impact of the process variation on the accuracy of the proposed multiplier structure
in the case of without truncation. The study was performed for two low VOS levels of 0.40V and 0.50V
applied to region 2 when considering different 𝑤𝑖𝑑𝑡 ℎ_𝑟𝑒𝑔𝑖𝑜𝑛 2 values.
The study included both local and global variations. For the global process variation, the Gaussian
distributions for the channel length (L g), the fin thickness (t si), and the height of fin (H fin) with 3σ = 10% of
their nominal values and for the gate oxide thickness with 3σ = 5% of its nominal value were considered
[128]. Also, the local variability was only assumed for t si and L g due to line edge roughness (LER)
phenomenon [128].
Based on the above variation models and by employing the Monte Carlo simulation with the Synopsys
HSPICE tool, delay variations of the gates were extracted. Then, for each design, 5,000 SDF files were
generated based on the delay variations of the gates. To do this, we developed an in-house tool that applied
the delay variation of the gates to the SDF files of the design which were extracted by the synthesis tool
(DC) without considering the process variation. This way, we used the transistor level specs for considering
the impact of the process variation on the gates which, in turn, was utilized to investigate the impact of the
process variation on the proposed multipliers.
Based on the generated SDF files, each design was simulated by VCS Synopsys tool for 5,000 times
(based on the SDF files), and the mean and standard deviation of the MED, MRED, and MNED parameters
were obtained. The extracted accuracy variations are reported in Table 4.3 which shows that the process
350
400
450
500
550
600
650
700
750
800
850
900
5 7 9 11 13 15 17 19 21 23 25 27
Energy (fJ)
width_region2 (bits)
0.40V 0.50V 0.60V 0.70V
V
DD_App
:
83
Table 4.3. The accuracy variation of the error parameters of the proposed approximate multiplier (without truncation) under the
process variation for voltage levels of 0.40V and 0.50V for the approximate part.
𝒘 𝒊𝒅𝒕𝒉 _𝒓𝒆𝒈𝒊𝒐𝒏𝟐
(𝑽 𝐃𝐃 _𝐀𝐂
, 𝑽 𝐃𝐃 _𝐀𝐏
)= (0.80V, 0.40V) (𝑽 𝐃𝐃 _𝐀𝐂
, 𝑽 𝐃𝐃 _𝐀𝐏
)= (0.80V, 0.50V)
MED MRED MNED MED MRED MNED
5
Nominal 1.84E+01 2.16E-03 2.82E-04 8.13E+00 5.81E-04 1.25E-04
Mean 1.86E+01 2.18E-03 2.86E-04 8.02E+00 5.78E-04 1.23E-04
Std 3.19E-01 3.55E-05 4.92E-06 2.45E-01 1.61E-05 3.78E-06
Mean/Std 58 61 58 33 36 33
8
Nominal 3.13E+02 5.89E-02 4.81E-03 1.31E+02 1.30E-02 2.02E-03
Mean 3.14E+02 5.97E-02 4.83E-03 1.32E+02 1.31E-02 2.03E-03
Std 2.87E+00 7.87E-04 4.41E-05 2.65E+00 1.64E-04 4.08E-05
Mean/Std 110 76 110 50 80 50
10
Nominal 8.14E+02 2.03E-01 1.25E-02 3.44E+02 3.23E-02 5.28E-03
Mean 8.14E+02 2.05E-01 1.25E-02 3.42E+02 3.20E-02 5.26E-03
Std 6.23E+00 2.64E-03 9.58E-05 4.58E+00 4.21E-04 7.04E-05
Mean/Std 131 78 131 75 76 75
14
Nominal 2.37E+03 5.46E-01 3.65E-02 1.23E+03 8.55E-02 1.89E-02
Mean 2.40E+03 5.52E-01 3.69E-02 1.21E+03 8.43E-02 1.86E-02
Std 2.37E+01 9.53E-03 3.65E-04 1.65E+01 8.47E-04 2.53E-04
Mean/Std 101 58 101 73 100 73
variation does not have a significant impact on the error metrics. As the figures indicate, the mean to
standard deviation ratio of all the parameters for the designs are greater than six revealing that the designs
are robust with respect to the process variation. In three out of the eight designs, the MRED decreases in
the presence of the process variation in comparison to the corresponding error parameters of the nominal
design. The maximum increase in the MRED is 1% and the largest reduction of the MRED error parameter
is 2% compared to the nominal value of the MRED.
4.2.3.3 Aging Effects
The threshold voltages of the NMOS and PMOS of the considered technology at the nominal voltage (i.e.,
0.80V) are 0.175V and -0.190V, respectively. Also, the corresponding voltages of the NMOS and PMOS
at the overscaled voltage of 0.40V are 0.205V and -0.181V, respectively. The increases in the threshold
voltage magnitudes of NMOS and PMOS versus the supply voltage over ten years stress are plotted in
Figure 4.14.
The results, which were obtained using (2-4), indicate that the magnitude of NMOS and PMOS threshold
voltages would increase 0.151V and 0.190V (~0.000V and ~0.001V), respectively, for the supply voltage
of 0.80V (0.40V).
84
Figure 4.14. The increase in the threshold voltage magnitude for NMOS and PMOS versus the supply voltage after ten years.
For the 0.40V operating voltage level, over ten years, the delay (time of switching) will increase about 0.3%
(0.2% for NMOS and 0.4% for PMOS). For determining the delay increase, (2-3) was utilized.
As mentioned before, one of the advantages of the X-Dadda structure compared to other accuracy
configurable approximate multipliers is its higher lifetime/reliability thanks to operating at lower operating
voltage levels. To demonstrate this, in Figure 4.15, the BTI induced delay degradation of the explored
structures versus 𝑉 DD_AP
has been plotted.
Let us consider the notation of TXWYVZ for different X-Dadda structures. In this notation, X indicates
the number of truncation bits, Y shows the width of region2, and Z is the VOS level of the region2. Based
on the results presented in Figure 4.15, in the best (worst) case, after 10 years, the delay of the T0W14V0.4
(T0W1V0.8) increases about 4% (~50%). The delay degradation of the X-Dadda structure over time
depends on 𝑉 DD_AP
. By assuming that the X-Dadda structure experiences all the considered operating
voltage levels of this study with similar probability (𝑃 (𝑉 )=1/7), in the case of 𝑤𝑖𝑑𝑡 ℎ_𝑟𝑒𝑔𝑖𝑜𝑛 2 = 14 (8),
the delay degradation of the X-Dadda will be 18% (27%). It should be noted that these results are for the
case of without truncation design. Also, it is clear that in the case of w_t, due to the fewer number of gates
in the critical path, the delay degradation is smaller than the case of w/o_t.
By increasing the delay of X-Dadda during its lifetime, to guarantee that the X-Dadda multiplier in its
exact operating mode does not have any error, the clock period for sampling its output should be increased
over time accordingly.
0
0.05
0.1
0.15
0.2
0.35 0.45 0.55 0.65 0.75 0.85
Threshold Voltage
Magnitude Increaes
After 10 Yeras (V)
Supply Voltage (V)
NMOS PMOS
85
Figure 4.15. BTI induced delay degradation of the explored structures for different 𝑉 DD_AP
and 𝑤𝑖𝑑𝑡 ℎ_𝑟𝑒𝑔𝑖𝑜𝑛 2 values for (a)
without and (b) with truncation designs for different VOS level.
On the other hand, as we have shown, due to lower 𝑉 DD_AP
, the delay degradation rate of the X-Dadda,
while operating in the approximate mode, is lower. Hence, it is expected, even when not adjusting the clock
period over time, the X-Dadda outputs should have less error compared to the other approximate structures
which are based on the circuit simplification/pruning approach (e.g.,[104]), in a lifespan of 10 years. To
illustrate this, in Figure 4.16, we have plotted the MRED reduction of the X-Dadda when the clock of the
designs was set as the delay of the X-Dadda in the exact mode after 10 years operation. Obviously, the
exact X-Dadda works precisely after 10 years under the considered clock period.
0.4
0.55
0.8
0%
10%
20%
30%
40%
50%
2
3
4
5
6
7
8
9
10
11
12
13
14
V
DD_AP
(V)
DELAY DEGRADATION
WIDTH_REGION2 (BITS)
0%-10% 10%-20%
20%-30% 30%-40%
40%-50%
0.4
0.5
0.65
0.8
0%
10%
20%
30%
40%
50%
2
3
4
5
6
7
8
9
10 V
DD_AP
(V)
DELAY DEGRADATION
WIDTH_REGION2 (BITS)
4-bit Truncation (b)
86
In the cases of w/o_t and w_t designs, as 𝑉 DD_AP
decreases, the MRED reduction increases for different
𝑤𝑖𝑑𝑡 ℎ_𝑟𝑒𝑔𝑖𝑜𝑛 2 values. The highest reduction for the designs w/o_t (w_t) corresponds to 𝑉 DD_AP
=0.40𝑉
and 𝑤𝑖𝑑𝑡 ℎ_𝑟𝑒𝑔𝑖𝑜𝑛 2 =13 (𝑉 DD_AP
=0.40V and 𝑤𝑖𝑑𝑡 ℎ_𝑟𝑒𝑔𝑖𝑜𝑛 2 =10).
4.2.3.4 Comparing X-Dadda Structures with Some Prior Approximate Multipliers
To evaluate the efficacy of the proposed X-Dadda multiplier, we have compared the characteristics of the
X-Dadda multiplier with those of some prior art multipliers given in Table 4.4. Different approximate
multipliers may be compared in terms of the design parameters such as error, energy, and delay. The critical
design parameter is the energy which is the main purpose of invoking the approximate computing paradigm
in the first place.
Figure 4.16. MRED reduction of the X-Dadda when working in the approximate mode and the clocks of the designs were set as
the delay of the X-Dadda in exact mode after 10 years operation for (a) without and (b) with truncation.
87
Table 4.4. Features of the studied multipliers
Structure Runtime Accuracy Configurable VOS Circuit Simplification/Pruning
YUS [116] ✓ ✓
YUS-V2 [117] ✓ ✓
EVO [111] ✓
BAM [110] ✓
PPAM[113] ✓
TruMD ✓
DQ [34] ✓ ✓
ICM [80] ✓
X-Dadda ✓ ✓
In the second tier of importance, the delay also should be considered. For the X-Dadda multiplier, different
MRED values were obtained by varying the overscaled voltage which also resulted in lower energy
consumption.
Other approximate multipliers included in the comparison are BAM [110], two multipliers from EVO
[111], ICM [80], PPAM [113], DQ [34], YUS [116], YUS-V2 [117], and TruMD (which has truncated
2, 4, 6, and 8 most LSBs of Dadda multiplier). Among these multipliers, YUS, YUS_V2, and DQ have
accuracy configurability during the runtime which is the key feature (advantage) of the proposed X-Dadda
structure. As discussed previously, in [34], four dual-quality reconfigurable approximate 4:2 compressors
with two modes of exact and approximate were considered. These compressors had two parts of
approximate and supplementary where the supplementary part was power gated during the approximate
mode. By providing the exact mode signal, the supplementary part is added to the approximate part of the
multiplier to provide the exact mode for the multiplier. Also, the works in [116] and [117] used a carry-
maskable adder (CMA) in the merge stage of the multiplication. This CMA can be configured to function
as carry propagation adder (CPA), a set of bit-parallel OR gates, or a combination of the two. This
configurability is realized by using mask signals to mask the carry propagation.
In Figure 4.17, we have used a 3-D plot for showing the MRED value versus the energy and delay of
the multipliers considered in the comparative study.
88
Figure 4.17. The MRED vs energy and delay for different approximate multipliers.
In the cases of fixed accuracy designs, each point indicates the delay and energy of each structure for its
given error. Different points for this type of structures correspond to different instances of the design. For
the accuracy configurable multipliers, different points correspond to the change of the approximation level
during the runtime by their proper means. For a better illustration, Figure 4.18 and Figure 4.19 show the
Energy-MRED and Delay-MRED features of the approximate multipliers using 2-D plots.
Figure 4.18. Energy vs MRED for different approximate multipliers.
0
50
100
150
200
250
-0.01
0.01
0.03
0.05
0.07
0.09
0.11
0.13
0.15
0.17
0.19
0.21
0.23
0.25
0.27
0.29
0.31
0.33
0.35
0.37
0.39
0.41
0.43
0.45
0.47
0.49
0.51
0.53
0.55
0.57
0.59
0.61
0.63
0.65
0.67
0.69
0.71
0.73
0.75
0.77
0.79
0.81
0.83
0.85
Energy (fJ)
MRED
DQ ICM
EVO TruMD
PPAM BAM
YUS YUS_V2
T4W8 T4W10
T0W4 T0W10
T0W12 T0W14
120
170
220
0.000
0.001
0.002
0.003
0.004
0.005
0.006
0.007
0.008
0.009
0.010
0.011
0.012
0.013
89
Figure 4.19. Delay vs MRED for different approximate multipliers.
For accuracy configurable designs, the energy versus the output error characteristic normally shows a
Pareto-optimal behavior meaning that improving the accuracy is equivalent to the energy increase. For fixed
accuracy structures, different designs are located at different points in the energy-error plane. Of course,
changes might be applied to the designs themselves for obtaining multipliers with different points of energy-
error.
The energy comparison, which is shown in Figure 4.18, for most of the characteristics, reveals a Pareto-
optimal behavior, as was expected. The proposed X-Dadda structure can provide lower energy
consumptions for lower MRED values when compared to other structures. When truncation is used for the
X-Dadda structure, the energy efficiency improves. More specifically, for small MRED values, T4W10
shows the lowest energy consumption. For MREDs between 0.015 and 0.10, TruMD (b-bit truncation) has
the lowest energy consumption. The TruMD structure, however, is not a configurable multiplier.
For the energy consumption below 84fJ, T4W10, PPAM, BAM, EVO and DQ are the designs which
can provide lower energy consumptions. Between these designs, PPAM is the most accurate design. Our
T4W10 design also can provide this amount of energy consumption with a lower accuracy. One of the DQ
design has the lowest energy consumption at MRED value of 0.37. The second highest accuracy design is
ICM which has one of the highest energy consumptions. The YUS design has the highest energy
0
100
200
300
400
500
600
700
800
900
1000
-0.01
0.01
0.03
0.05
0.07
0.09
0.11
0.13
0.15
0.17
0.19
0.21
0.23
0.25
0.27
0.29
0.31
0.33
0.35
0.37
0.39
0.41
0.43
0.45
0.47
0.49
0.51
0.53
0.55
0.57
0.59
0.61
0.63
0.65
0.67
0.69
0.71
0.73
0.75
0.77
0.79
0.81
0.83
0.85
Delay (ps)
MRED
DQ ICM
EVO TruMD
PPAM BAM
YUS YUS_V2
T4W8 T4W10
T0W4 T0W10
T0W12 T0W14
90
consumption while YUS_V2 has lower energy consumption compared to that of YUS. The YUS and
YUS_V2 designs have also the advantage of accuracy configurability during the runtime. One of the other
observations is that the MRED values of the X-Dadda structure spans over a wide range of MRED values.
For instance, in T4W10, MRED spans from 0.005 to 0.6 (energy consumption 75 to 154fJ), which provides
the user with different accuracy levels based on the energy budget and application accuracy requirement.
Comparing the multipliers in terms of delay show the lowest delays for YUS and YUS_V2 structures.
The X-Dadda structure without truncation which keeps the critical paths of the exact multiplier has a higher
delay. The delay for the truncated version of this structure is obviously lower than the case of without
truncation. It should be noted that the delays for the X-Dadda structures were determined based on the exact
mode and hence are not changed versus MRED. For fixed accuracy structures, lower MRED values are
achieved based on less simplification /pruning and hence higher delays are expected. This would normally
result in a Pareto-optimal behavior.
Another design parameter of considerably less importance is the area. In the case of approximate
multipliers based on circuit simplification/pruning, the area would be lower than that of the exact circuit
while in the case of approximate structures relying on voltage overscaling, the area will be slightly larger
than that of the exact circuit due to the overhead associated with the power switches and level shifters.
Finally, it should be stated that since low bit multipliers have a wide usage in image processing and
machine learning applications, we focused on presenting accuracy configurable 8-bit approximate
multipliers. For the applications requiring multipliers with higher numbers of bits, similar comparative
studies may be performed.
4.2.3.5 Using the X-Dadda multiplier in Error Resilient Applications
Now, we evaluate the efficacy of the proposed X-Dadda multiplier for realizing neural networks in the
applications of classifying the MNIST dataset, image sharpening, image smoothing, and discrete cosine
transform (DCT).
91
4.2.3.5.1.1 Classifying the MNIST Dataset
The neural network structure consisted of two hidden layers (with 100 neurons with the ReLU activation
function in each layer) and trained offline by 60K images. Then, the NN was described by Verilog HDL
language while the inputs and the weights as well as the width of the multipliers were 8 bits.
While for the training of the neural network, we used floating-point operations, in the hardware
implementation used for the evaluation study, the inputs, weights, and biases were converted to 8-bit, 8-bit,
and 16-bit integers numbers, respectively. The conversion caused some accuracy degradation for the
network. Obviously, training the neural network considering integer inputs, weights, biases as well as the
proposed approximate multiplier will increase the accuracy of the neural network.
Other operations (e.g., 24-bit addition in each neuron of the layers and the ReLU function of each
neuron of the layers) in the hardware implementation of the NN were described in behavioral level (exact
mode). The clock period of the design was determined based on the 𝑤𝑖𝑑𝑡 ℎ_𝑟𝑒𝑔𝑖𝑜𝑛 2 parameter and the
VOS level considered for the 𝑉 DD_AP
(see Table 4.2). The results obtained in this study showed a
classification accuracy of 92.67% for the nominal voltage (exact multipliers).
The accuracy loss (compared to the exact implementation) versus 𝑉 DD_AP
, when the X-Dadda structures
(without truncation) with 𝑤𝑖𝑑𝑡 ℎ_𝑟𝑒𝑔𝑖𝑜𝑛 2𝑠 of 8 and 10 bits were employed, are drawn in Figure 4.20.
Figure 4.20. The accuracy loss of the considered neural network by using different X-Dadda structures as well as 𝑉 DD_AP
values
compared to case of the exact implementation.
0.31
0 0.05 0.04
1.23
6.97
0 0
0.08
1.77
2.66
7.86
0
1
2
3
4
5
6
7
8
9
0.75 0.65 0.55 0.50 0.45 0.40
Accuracy Loss (%)
V
DD_AP
(V)
width_region2 = 8
width_region2 = 10
92
For extracting the accuracy, 10K test images were injected to the NN. The highest accuracy loss for both
selected designs belonged to 𝑉 DD_AP
=0.40𝑉 (about 7.86%), while for most of the VOS levels, the
accuracy degradation was negligible or even zero. As an example, decreasing 𝑉 DD_AP
from 0.80V to 0.50V
led to, on average, only 1.49% accuracy loss while consuming lower energy for performing the
classification. Hence, for this specific neural network structure, the user can use the X-Dadda multiplier
with 𝑤𝑖𝑑𝑡 ℎ_𝑟𝑒𝑔𝑖𝑜𝑛 2 = 10 and VOS level of 500mV for minimum energy consumption while having
accuracy loss below 2%.
4.2.3.5.1.2 Image Sharpening
For the Sharpening application, each pixel can be extracted by [129]
𝑌 (𝑖 ,𝑗 )=2𝑋 (𝑖 ,𝑗 )−
1
273
∑ ∑ 𝑋 (𝑖 +𝑚 ,𝑗 +𝑛 ) . 𝑀𝑎𝑠 𝑘 𝑆 ℎ𝑎𝑟𝑝𝑒𝑛𝑖𝑛𝑔 (𝑚 + 3 ,𝑛 + 3)
2
𝑛 = −2
2
𝑚 = −2
(4-3)
where X and Y are the input and output images and
𝑀 𝑎𝑠 𝑘 𝑆 ℎ𝑎𝑟𝑝𝑒𝑛𝑖𝑛𝑔 =
[
1 4 7 4 1
4 16 26 16 4
7 26 41 26 7
4 16 26 16 4
1 4 7 4 1
]
(4-4)
The PSNR values and the energy saving for this application for five input images are reported in Table 4.5
(where inf corresponds to no changes in the image quality).
The results show that by decreasing the VOS level from 500mV to 400mV causes, on average, the
PSNR (energy saving) decrease (increase) by 7% (15%) compared to that of the exact computation. Also,
in the studied VOS level, decreasing the 𝑤𝑖𝑑𝑡 ℎ_𝑟𝑒𝑔𝑖𝑜𝑛 2 from 12 to 6 bits, on average, leads to an increase
(decrease) of 30% (76%) in the PSNR (energy saving) compared to the case of the exact computation. In
the case of 𝑤 𝑖𝑑 𝑡 ℎ_𝑟𝑒𝑔𝑖𝑜𝑛 2 = 4, while the output quality does not decrease, an energy reduction of 2.5% is
achieved. Hence, for the Sharpening application, the user can use the X-Dadda multiplier with
𝑤𝑖𝑑𝑡 ℎ_𝑟𝑒𝑔𝑖𝑜𝑛 2 = 10 and VOS level of 400mV for minimum energy consumption while having acceptable
quality (𝑃𝑆𝑁𝑅 ≥ 30) .
93
Table 4.5. The PSNR of the Sharpening application for different images and approximate multipliers with two VOS levels.
VOS level 400mV 500mV
𝒘𝒊𝒅𝒕𝒉 _𝒓𝒆𝒈𝒊𝒐𝒏𝟐 4 6 8 10 12 4 6 8 10 12
Lena inf 41.53 33.75 33.34 29.55 inf 47.24 36.37 32.09 31.04
Baboon inf 39.94 33.94 31.11 27.49 inf 45.22 36.80 33.18 30.78
Splash inf 42.72 37.16 37.33 31.36 inf 47.80 41.60 36.86 35.35
Airplane inf 43.56 31.34 32.60 30.95 inf 50.05 39.09 29.73 31.15
Pepper inf 40.57 34.06 33.97 28.92 inf 45.77 36.62 33.25 32.07
Energy Reduction (%) 3% 10% 23% 36% 44% 2% 9% 20% 32% 36%
4.2.3.5.1.3 Smoothing Application
For the Smoothing application, the relation between the output and input images are given by [129]
𝑌 (𝑖 ,𝑗 )=
1
60
∑ ∑ 𝑋 (𝑖 +𝑚 ,
2
𝑛 = −2
2
𝑚 = −2
𝑗 +𝑛 ). 𝑀𝑎𝑠𝑘 𝑆𝑚𝑜𝑜𝑡 ℎ𝑖𝑛𝑔 (𝑚 +3, 𝑛 +3)
(4-5)
where X and Y are the input and output images and
𝑀𝑎𝑠𝑘 𝑆𝑚𝑜𝑜𝑡 ℎ𝑖𝑛𝑔 =
[
1 1 1 1 1
1 4 4 4 1
1 4 12 4 1
1 4 4 4 1
1 4 7 4 1
]
(4-6)
The PSNR values and the energy reduction for the approximate Smoothing application for five input images
have been reported in Table 4.6.
Similar to the Sharpening application, in the case of the 𝑤𝑖𝑑𝑡 ℎ_𝑟𝑒𝑔𝑖𝑜𝑛 2 = 4, there was not any output
quality loss while about 2.5% energy reduction has been achieved. Also, based on the reported results, by
decreasing the VOS level from 500mV to 400mV, the PSNR (energy saving) is, on average, reduced
(increased) by 18% (14%) compared to the that of the exact computation. Also, for the studied VOS levels,
by decreasing the 𝑤𝑖𝑑𝑡 ℎ_𝑟𝑒𝑔𝑖𝑜𝑛 2 from 12 to 6 bits, the PSNR (energy saving), on average, is increased
(reduced) by 33% (76%). Hence, for the Smoothing application, the user can use the X-Dadda multiplier
with 𝑤𝑖𝑑𝑡 ℎ_𝑟𝑒𝑔𝑖𝑜𝑛 2 = 12 and VOS level of 500mV for minimum energy consumption while having
acceptable quality (𝑃𝑆𝑁𝑅 ≥ 30) .
94
Table 4.6. The PSNR of the Smoothing application for different images and approximate multipliers with two VOS levels.
VOS level 400mV 500mV
𝒘𝒊𝒅𝒕𝒉 _𝒓𝒆𝒈𝒊𝒐𝒏𝟐 4 6 8 10 12 4 6 8 10 12
Lena Inf 42.17 34.24 24.81 23.81 Inf 44.98 36.83 34.17 33.83
Baboon Inf 39.39 31.35 21.05 20.09 Inf 41.01 35.69 32.36 31.00
Splash Inf 44.84 36.70 29.45 28.17 Inf 50.98 40.83 36.33 38.43
Airplane Inf 44.34 35.97 26.56 25.52 Inf 46.46 41.19 37.46 36.43
Pepper Inf 41.33 33.92 25.64 24.40 Inf 42.41 36.53 33.54 33.15
Energy Reduction (%) 3% 10% 23% 36% 44% 2% 9% 20% 32% 36%
4.2.3.5.1.4 DCT Application
In this application, we approximate the two multiplication operations of the 8-point discrete cosine
transform (DCT) expressed by [130]
𝐷𝐶𝑇 =𝐷 ×𝐴 ×𝐷 ′
(4-7)
where 𝐴 is a block of an input image and 𝐷 is the DCT matrix (𝐷 ′ is the inverse of 𝐷 ). In our study, the
values in the 𝐷 matrix were transformed to integer values by multiplying them by 2
8
.
The transformed 𝐷
matrix is as below:
𝐷 =
[
91 91 91 91 91 91 91 91
126 106 71 25 −25 −71 −106 −126
118 49 −49 −118 −118 −49 49 118
106 −25 −126 −71 71 126 25 −106
91 −91 −91 91 91 −91 −91 91
71 −126 25 106 −106 −25 126 −71
49 −118 118 −49 −49 118 −118 49
25 −71 106 −126 126 −106 71 −25
]
(4-8)
For implementing this part, for the first (second) multiplication, 8-bit (16-bit) approximate Dadda was used.
To evaluate the output quality of the approximated DCT block, the results of DCT were applied to an
inverse DCT (IDCT) block to retrieve the image. The PSNR values of the output (approximate) images as
well as the energy reduction of the DCT part are reported in Table 4.7. The considered 𝑤𝑖𝑑𝑡 ℎ_𝑟𝑒𝑔𝑖𝑜𝑛 2s of
these two multipliers are indicated by the tuple of (𝑥 ,𝑦 ) where 𝑥 (𝑦 ) denotes the width of the 8-bit (16-bit)
proposed approximate Dadda multiplier. From the results, one can observe that decreasing the widths from
(6,15) to (2,5) the output quality (energy saving) is increased (reduced), on average, about 24.3% (94%).
Hence, for the DCT part, the user can use the X-Dadda multiplier with 𝑤𝑖𝑑𝑡 ℎ_𝑟𝑒𝑔𝑖𝑜𝑛 2(8𝑏𝑖𝑡 ,16𝑏𝑖𝑡 ) = (6,
95
Table 4.7. The PSNR of the DCT-IDCT application for different images and approximate multipliers with two VOS levels.
VOS level 400mV
𝒘𝒊𝒅𝒕𝒉 _𝒓𝒆𝒈𝒊𝒐𝒏𝟐 (8-bit, 16-bit) (2, 5) (2, 9) (4, 5) (4, 9) (6, 5) (6, 9) (6, 11) (6, 13) (6, 15)
Lena 49.84 49.83 49.84 47.13 49.79 42.04 38.29 34.75 30.18
Baboon 50.66 50.66 50.66 49.31 50.58 41.61 37.37 34.44 31.90
Splash 48.69 48.69 48.69 47.55 48.49 41.90 37.52 34.18 29.91
Airplane 49.34 49.34 49.34 44.33 49.31 40.82 37.88 33.39 27.83
Pepper 49.22 49.22 49.22 49.22 49.16 40.45 37.00 33.80 29.74
Energy Reduction (%) 1% 6% 1% 6% 2% 7% 12% 17% 23%
VOS level 500mV
Lena 49.83 49.83 49.83 49.83 49.83 49.83 49.09 48.01 45.45
Baboon 50.66 50.66 50.66 50.66 50.66 50.66 48.78 47.95 46.29
Splash 48.69 48.69 48.69 48.69 48.69 48.69 48.32 47.16 45.08
Airplane 49.34 49.34 49.34 49.34 49.34 49.34 48.98 47.84 45.25
Pepper 49.22 49.22 49.22 49.22 49.22 49.22 48.32 46.85 43.37
Energy Reduction (%) 1% 5% 1% 5% 2% 6% 9% 14% 19%
15) and VOS level of 500mV for minimum energy consumption while having acceptable quality (𝑃 𝑆 𝑁𝑅 ≥
30) .
4.3 Conclusion
In this work, we have explored the design of the accuracy configurable approximate Carry Look Ahead
adder (AC-CLA) based on the voltage overscaling technique. The use of VOS provided lower energy
consumption and higher lifetime. The AC-CLA adder had the flexibility of controlling the output accuracy
during the runtime of the system by adjusting the operating voltage level of the approximate part and could
change its operation mode to the exact mode. The error, energy consumption, and BTI induced delay
degradation of the adder versus the number of approximated blocks and applied overscaled voltages were
determined. Finally, we determined the efficacy of the proposed AC-CLA structure in realizing a neural
network. The study revealed a small accuracy degradation compared to the case of using exact adder.
Also, we have explored the design of the accuracy configurable approximate Dadda multiplier (X-
Dadda) based on the voltage overscaling (VOS) technique. The X-Dadda multiplier had the flexibility of
controlling the output accuracy during the runtime of the system by adjusting the operating voltage level of
the approximate part and could change its operation mode to the exact mode. The error, energy
consumption, and lifetime of the multiplier versus the approximation width and applied VOS level were
determined. The exploration results indicated that, e.g., when the width of the approximate part was half of
96
the width of the multiplier and its operating voltage was the half of the nominal voltage, the multiplier
consumed 21% lower energy consumption at the cost of only 0.057 MRED. In addition, in this case, the
delay increase due to the BTI effect was only 28% of the delay increase of the multiplier in the exact
operating mode. The study also included the investigation of the impact of the process variation on the error
metric of the multiplier where the results did not indicate a considerable deterioration of the error. Finally,
we determined the efficacy of the proposed X-Dadda structure in realizing neural networks for the
applications of MNIST classification, image smoothing and sharpening, and DCT.
97
CHAPTER 5. X-NVDLA: RUNTIME ACCURACY CONFIGURABLE NVDLA
BASED ON EMPLOYING VOLTAGE OVERSCALING APPROACH
This chapter investigates a runtime accuracy reconfigurable implementation of an energy efficient deep
learning accelerator. It is based on voltage overscaling technique which provides dynamic adjustment of
approximation level as well as improving lifetime/reliability of the accelerator. The technique is applied to
both computing and memory units where based on the minimum required accuracy, the applied voltage is
adjusted during the runtime. The implementation of the network is performed using NVDLA which is an
open-source CNN accelerator. The approximation is applied to both the accelerator MAC array employed
for the required network computations and to the accelerator SRAM memory utilized for storing the inputs
(images), weights, and activation data. In addition, the characteristic of bias temperature instability, as one
of the lifetime deteriorating phenomena is determined. The study includes energy improvement versus
accuracy degradation as a function of overscaled voltages and number of approximate least significant bits
using a 14nm FinFET technology.
The rest of the chapter is organized as follows. Section 5.1 presents the proposed X-NVDLA accelerator.
Next, the simulation setup is discussed in section 5.2. The results and discussion are presented in section
5.3. Finally, the chapter is concluded in section 5.4.
5.1 X-NVDLA
In this section, X-NVDLA which is implemented by applying the VOS approximation technique to the
MAC array and the convolution buffer of the NVDLA is discussed.
5.1.1 Approximate MAC Array (AxC)
To reduce the energy consumption of the MAC array, both adder or multiplier may be approximated. Since
the energy consumption of the multiplier is much larger than that of the adder, we focused on approximating
the multiplier. There are different approximate multiplier structures which can be employed here. In this
work, we chose X-Dadda multiplier whose structures is shown in Figure 4.7 [132]. The energy consumption
98
reduction as well as accuracy reconfigurability is achieved through applied voltages to different bits in the
multiplier which has the same structure as that of the exact one. More specifically, the X-Dadda multiplier
has two approximation knobs. The first one is the width of the approximate part of the X-Dadda multiplier
which is shown as 𝑎𝑝 _𝑤𝑖𝑑𝑡 ℎ in the figure. The second knob is the VOS level of the approximation part
which is denoted as 𝑉 𝐷𝐷 _𝐴𝑃
in Figure 4.7.
These two parameters should be determined in each multiplier in the MAC array of the accelerator. The
first parameter is the number of approximate bits of the multiplication, (𝑊 𝐴𝑥𝐶 which is 𝑎𝑝 _𝑤𝑖𝑑𝑡 ℎ in Figure
4.7) and the second one is the VOS level considered for the approximate bits (𝑉 𝐷𝐷 ,𝐴𝑥𝐶 which is 𝑉 𝐷𝐷 _𝐴𝑃
in
Figure 4.7). 𝑊 𝐴𝑥𝐶 can be either a design time or runtime parameter in the range of 1 to 14. In this work, we
considered 𝑊 𝐴𝑥𝐶 as a design time parameter to be 9, 11, and 13-bits. On the other hand, the 𝑉 𝐷𝐷 ,𝐴𝑥𝐶 is a
runtime parameter which enables us to dynamically adjust the accuracy. It distinguishes this approximate
multiplier from other approximate multipliers whose level of their approximation is either fixed or are
limited by a couple of accuracy level changes. For the technology considered in this work, 𝑉 𝐷𝐷 ,𝐴𝑥𝐶 may
obtain any value from {750mV, 650mV, 500mV, 400mV}. The granularity of assigning different VOS
levels to the multiplier in the MAC array can be fine, medium, and coarse (FG, MG, and CG, respectively).
The implementation of AxC requires different VOS levels and a power switch box (PSB) where the
overheads were low [123][94]. In the fine-grain design scheme, each multiplier in the MAC array can be
assigned an independent VOS level. An example of the fine-grain design is shown in Figure 5.1(a) where
the structure of the required power switch box is shown in Figure 5.2.
In the medium grain scheme, each row is assigned an independent voltage while in the coarse-grain
design all the multiplier in the whole array can have one VOS level. An example of the medium grain type
of design is demonstrated in Figure 5.1 (b). It should be noted that the pipeline register which is placed
between the multiplier and the adder should be a level-shifter one such that we can make sure the voltage
levels of the input data of the adders correspond to its nominal voltage level (𝑉 𝐷𝐷 ,𝑛𝑜𝑚 ).
99
(a)
(b)
Figure 5.1. An example for a) Fine-Grain type design of the MAC array b) Coarse-Grain type design of MAC array.
While all of these design schemes may be applied to the NVDLA hardware, however, in this work, we only
considered the fine-grain (FG) and coarse-grain (CG) ones.
The NVDLA architecture is a weight stationary structure where the weights for the multiplier units will
be fixed for several cycles (see, e.g.,[133]). Thus, to implement our proposed energy-efficient accelerator
using the FG scheme, we suggest that the VOS level of each multiplier be chosen based on the magnitude
of its weight (using one of 5 ,7.5 and 10 intervals) as given below:
{
0≤|𝑤 |< 5 𝑉𝑂𝑆 =400𝑚𝑉
5≤|𝑤 |<10 𝑉 𝑂𝑆 =500𝑚𝑉
10≤|𝑤 |<15 𝑉𝑂𝑆 =650𝑚𝑉
15≤|𝑤 |<20 𝑉𝑂𝑆 =750𝑚𝑉
20≤|𝑤 | 𝑉𝑂𝑆 =800𝑚𝑉
(5-1)
Convolution Buffer (SRAM Bank)
Convolution Sequence Controller (CSC)
MUL MUL MUL MUL MUL MUL
Channel Direction
Kernel Direction
...
F
W
Convolution Accumulator
Adder
Adder
Adder
Adder
Adder
Adder
Adder
Adder
W 1- W 128
F 1- F 128
PS
VOS 1
VOS 2
VOS 3
VOS 4
VOS 5
W 129- W 256
F 129- F 256
W 256- W 384
F 256- F 384
PS
PS
PS
PS
PS
PS
PS
PSB PSB PSB PSB PSB PSB
MUL MUL MUL MUL MUL MUL ...
PSB PSB PSB PSB PSB PSB
MUL MUL MUL MUL MUL MUL ...
PSB PSB PSB PSB PSB PSB
MUL MUL MUL MUL MUL MUL ...
PSB PSB PSB PSB PSB PSB
MUL MUL MUL MUL MUL MUL ...
PSB PSB PSB PSB PSB PSB
MUL MUL MUL MUL MUL MUL ...
PSB PSB PSB PSB PSB PSB
MUL MUL MUL MUL MUL MUL ...
PSB PSB PSB PSB PSB PSB
MUL MUL MUL MUL MUL MUL ...
PSB PSB PSB PSB PSB PSB
W 384- W 512
F 384- F 512
W 512- W 640
F 512- F 640
W 640- W 768
F 640- F 768
W 896- W 1024
F 896- F 1024
W 896- W 1024
F 896- F 1024
400 500 650 750 800
VOS 1
VOS 2
VOS 3
VOS 4
VOS 5
W: Weight, F: Feature, PSB: Power Switch Box, Mul: Multiplier , PS: Partial Sum
Convolution Buffer (SRAM Bank)
Convolution Sequence Controller (CSC)
MUL MUL MUL MUL MUL MUL
Channel Direction
Kernel Direction
...
MUL MUL MUL MUL MUL MUL ...
MUL MUL MUL MUL MUL MUL ...
MUL MUL MUL MUL MUL MUL ...
MUL MUL MUL MUL MUL MUL
MUL MUL MUL MUL MUL MUL ...
MUL MUL MUL MUL MUL MUL ...
MUL MUL MUL MUL MUL MUL ...
F
W
Convolution Accumulator
Adder
Adder
Adder
Adder
Adder
Adder
Adder
Adder
W 1- W 128
F 1- F 128
PS
400 500 650 750 800
PSB
PSB
PSB
PSB
PSB
PSB
PSB
PSB
VOS Voltage Level Row
VOS Voltage Level Row
VOS 1
VOS 2
VOS 3
VOS 4
VOS 5
VOS Voltage Level Row
...
...
VOS Voltage Level Row
VOS Voltage Level Row
VOS Voltage Level Row
VOS Voltage Level Row
VOS Voltage Level Row
W 129- W 256
F 129- F 256
W 256- W 384
F 256- F 384
W 384- W 512
F 384- F 512
W 512- W 640
F 512- F 640
W 640- W 768
F 640- F 768
W 768- W 896
F 768- F 896
W 896- W 1024
F 896- F 1024
PS
PS
PS
PS
PS
PS
PS
W: Weight, F: Feature, PSB: Power Switch Box, Mul: Multiplier , PS: Partial Sum
100
{
0≤ |𝑤 |< 7.5 𝑉𝑂𝑆 =400𝑚𝑉
7.5≤|𝑤 |<15 𝑉𝑂𝑆 =500𝑚𝑉
15≤|𝑤 |<22.5 𝑉𝑂𝑆 =650𝑚𝑉
22.5≤|𝑤 |<30 𝑉𝑂𝑆 =750𝑚𝑉
30≤|𝑤 | 𝑉𝑂𝑆 =800𝑚𝑉
(5-2)
{
0≤ |𝑤 |< 10 𝑉𝑂𝑆 =400𝑚𝑉
10≤|𝑤 |<20 𝑉𝑂𝑆 =500𝑚𝑉
20≤|𝑤 |<30 𝑉𝑂𝑆 =650𝑚𝑉
30≤|𝑤 |<40 𝑉𝑂𝑆 =750𝑚𝑉
40≤|𝑤 | 𝑉𝑂𝑆 =800𝑚𝑉
(5-3)
Obviously, depending on the range of the weight and the number of occurrences, one may modify the
intervals in (5-1)-(5-3). In the case of the CG multipliers in the MAC array, first, the average of the absolute
values of the 1024 weights loaded from the SRAM for the 1024 multipliers in the MAC array are
determined. Based on the average value of the weights, the VOS level for the multipliers in the MAC array
is determined from (5-1)-(5-3).The use of lower voltages as the operating voltage will reduce the energy
consumption of the multiplication operation and improving the lifetime while inducing some error owing
to maintaining the clock frequency fixed.
In both design schemes, when the weights are updated, the VOS level should be changed accordingly.
For switching between different voltage levels, PMOS switches may be utilized which their timing and
power overheads are negligible [123].
5.1.2 Approximate Convolution Memory Buffer (AxM)
The energy consumption of the convolution buffer in the accelerator also may be reduced by applying the
VOS technique to the SRAM banks. On the other hand, applying a lower voltage to the SRAM cells makes
cell susceptible to hold, read and write failures degrading the accuracy. To limit the accuracy degradation,
the VOS technique should be applied to the SRAM cells storing bits with the lower significance (similar to
the X-Dadda multiplier) [82].
Let us call this SRAM X-SRAM in the rest of the paper. In the X-SRAM, again, there are two parameters
which need to be determined. The first one is the number of bits in the 8-bit data to be approximated which
is called 𝑊 𝐴𝑥𝑀 . The second parameter is the VOS level which should be applied to these bits, denoted as
101
Figure 5.2. The schematic of the X-SRAM design (each box is a cell).
𝑉 𝐷𝐷 ,𝐴𝑥𝑀 . An example of the proposed X-SRAM structure is shown in Figure 5.2 where the approximate
region may obtain any of the four considered VOS levels. Obviously, the lower is the applied voltage is,
the lower the energy consumption will be. In addition, using lower supply voltages means lower electric
fields in the devices which in turn leads to weaker phenomena responsible for aging. Applying a lower
voltage to the SRAM cell make it susceptible to possible undesired read, write, and hold value changes.
5.1.3 Simultaneous Use of Approximate Multiplier and Buffer
A concurrent use of AxC and AxM, obviously should reduce the energy consumption and the strength of
the aging phenomena more while the accuracy degradation ought to be also higher. While separate
applications of the VOS approximate technique to deep learning accelerators have been proposed and
investigated, to the best of our knowledge, their simultaneous application has not been suggested in the
literature. For the simultaneous application of VOS to both multiplier and buffer, four parameters need to
be determined such that the accuracy degradation of the neural network does not come below an acceptable
value. The parameters are 𝑊 𝐴𝑥𝐶 , 𝑊 𝐴𝑥𝑀 , 𝑉 𝐷𝐷 ,𝐴𝑥𝐶 , and 𝑉 𝐷𝐷 ,𝐴𝑥𝑀 .
SRAM1x1 SRAM1x2 SRAM1x3 SRAM1x4 SRAM1x5 SRAM1x6 SRAM1x7 SRAM1x8
V DD, Exact
AxC 4 Select
AxC 3 Select
AxC 2 Select
AxC 1 Select
V DD, AxM
Power Switch Box
SRAM Cell V DD
AxC Exact
SRAM Cell V DD
AxC Exact
SRAM Cell V DD
AxC Exact
SRAM Cell V DD
AxC Exact
SRAM Cell V DD
AxC Exact
SRAM Cell V DD
AxC
SRAM Cell VDD
AxC
SRAM Cell VDD
AxC Exact Exact Exact
SRAM256x1 SRAM256x2 SRAM256x3 SRAM256x4 SRAM256x5 SRAM256x6 SRAM256x7 SRAM256x8
SRAM2x1 SRAM2x2 SRAM2x3 SRAM2x4 SRAM2x5 SRAM2x6 SRAM2x7 SRAM2x8
SRAM3x1 SRAM3x2 SRAM3x3 SRAM3x4 SRAM3x5 SRAM3x6 SRAM3x7 SRAM3x8
SRAM254x1 SRAM255x2 SRAM255x3 SRAM255x4 SRAM255x5 SRAM255x6 SRAM255x7 SRAM255x8
SRAM255x1 SRAM255x2 SRAM255x3 SRAM255x4 SRAM255x5 SRAM255x6 SRAM255x7 SRAM255x8
...
...
...
...
...
...
...
...
400 500 650 750 800
Exact Region Approximate Region
102
5.2 Simulation Setup
The details of the employed simulation setup for assess the efficacy of the X-NVDLA are provided in the
following subsections. Note that in this work, as mentioned before, four VOS levels of 750mV, 650mV,
500mV, 400mV have been considered for the MAC array while for the buffer (X-SRAM), VOS levels of
525mV, 510mV, 500mV, 490mV, 480mV and 475mV were applied.
5.2.1 Simulation Platform and Considered Neural Networks
All the simulations were performed using the NVDLA virtual platform which provides a register-accurate
system. The platform is developed on GreenSocs QBOX which is a solution for co-simulation with QEMU
and System-C [87]. The system has the advantage of quick software development and debugging. A QEMU
emulator of ARMv8 ‘virt’ SoC board is included to provide high speed emulation of CPU and generic
devices [87]. Thus, for evaluating the X-NVDLA, the impact of the VOS technique on the precision of the
multiplication results and the stored data on the buffers were modeled and implemented in the System-C
model of the NVDLA whose details are provided in the next subsections.
In this work, we considered two neural networks of LeNet-5 as a small network and ResNet-50 as a very
large network [135][136]. The input datasets included images from MNIST dataset for the LeNet and
ImageNet for the ResNet-50 [137][138]. It should be noted that many of the common CNN architectures
such as AlexNet, VGG16, VGG19, and Inception cannot be compiled for the Int8 precision by the NVDLA
compiler. While this limited the NN models that could be used in the evaluation process here, similar
characteristics should be observed when these neural networks are executed on X-NVDLA. The times
required for the inference of an image for each NN architecture is given in Table 5.1.
Since ResNet-50 (LeNet-5) is computationally very expensive, we considered 200 random images from
ImageNet (MNIST) in obtaining the results (similar to [140]-[141]). The simulations were performed using
the computing nodes of Center for Advanced Research Computing [139].
103
Table 5.1. THE TIME NEEDED FOR EACH IMAGE INFERENCE ON THE NN MODEL
NN model Hours / CPU
LeNet-5 3 mins
ResNet-50 10 hours
5.2.1.1 Quantization of the Neural Network Models
All the Caffe models are in the floating-point precision. To convert these models to Int8 precision,
TensorRT (TRT) tool was used to analyze the dynamic range per layer tensors and calculate the scale factors
[142]. The TRT tool creates a calibration table that contains the scaling factors for each layer [134]. The
table should be passed to the NVDLA compiler with the Caffe model of the neural network to create an
int8 precision of the Loadable file.
5.2.2 Modeling the Approximate Multiplier in the MAC Array
To evaluate the performance of the approximate multiplier in the MAC array, the approximation error
should be added in the System-C code of the MAC array. To incorporate the impact of the approximation
in the result, we replace the exact result value with the approximate one. This is performed by storing all
the input combinations and the corresponding outputs of the X-Dadda multiplier for four considered VOS
levels. These values are stored as lookup tables where each VOS level has its own lookup table. Based on
the approximate level chosen for each multiplication, the imprecise result variable is loaded by the
corresponding value from the lookup table.
To calculate the energy saving, a feature for counting the number of multiplications in each VOS level
was implemented in the System-C model of the NVDLA MAC unit. Having had these numbers, the work
of [132] was invoked to calculate the energy saving of each VOS level. To obtain the energy reduction
(Δ𝐸 𝑀𝐴𝐶 _𝐴𝑟𝑟𝑎𝑦 ), the following expression has been used:
Δ𝐸 𝑀𝐴𝐶 _𝐴𝑟𝑟𝑎𝑦 (%)=((Δ𝐸 𝑉𝑂 𝑆 1
×𝑁 𝑉𝑂 𝑆 1
+Δ𝐸 𝑉𝑂 𝑆 4
×𝑁 𝑉𝑂 𝑆 2
+Δ𝐸 𝑉𝑂 𝑆 4
×𝑁 𝑉𝑂 𝑆 3
+Δ𝐸 𝑉𝑂 𝑆 4
×𝑁 𝑉𝑂 𝑆 4
)/𝑁 𝑀𝑈𝐿 )× 100 (5-4)
where Δ𝐸 𝑉𝑂 𝑆 𝑖 is the energy reduction of the multiplications when the i
th
VOS level is used. Also, 𝑁 𝑉𝑂 𝑆 𝑖 is
the number of multiplications with the i
th
VOS level. Finally, 𝑁 𝑀𝑈𝐿 is total number of multiplication
operation implemented by the MAC array. In our work the values of Δ𝐸 𝑉𝑂 𝑆 𝑖 has been shown in Figure 5.3.
104
Figure 5.3. The energy reduction for different 𝑉 𝐷𝐷 ,𝐴𝑥𝐶 and 𝑊 𝐴𝑥𝐶 values in the case of 8-bit X-Dadda multiplier [132].
5.2.3 Modeling the Approximate Convolution Buffer
By reducing the supply voltage of the SRAM cell below the nominal voltage, the SRAM cells become
erroneous where the error is technology dependent. In this work, we used the bit error rate of the 6T SRAM
cells at each VOS level obtained from [143] which is shown in Figure 5.4.
In the NVDLA, both the input activations and weights are loaded from the SRAM cells. Thus, we
modeled the impact of applying the VOS technique through adding error (based on the VOS level and the
obtained BER) to the exact loaded activations and weights from the buffer. More specifically, the impact
of the VOS on the imprecision of the i
th
bit of the loaded data (𝐷 𝑎𝑡𝑎 [𝑖 ]) is modeled by
𝐷𝑎𝑡𝑎 𝐴𝑃𝑋 [𝑖 ] = (
(𝑑𝑜𝑢𝑏𝑙𝑒 )𝑟𝑎𝑛𝑑 ()
𝑅𝐴𝑁 𝐷 _𝑀𝐴𝑋 ) < 𝐵𝐸𝑅 )⨁𝐷𝑎𝑡𝑎 [𝑖 ] (5-5)
where RAND_MAX is the maximum number that the rand() function can generate and the rand() function
generates a random number. The range of i is from 0 to 7.
Figure 5.4. The BER vs. VOS level for 6T SRAM cell redrawn from [143].
0
10
20
30
40
50
60
750 650 500 400
Energy Reduction (%)
VOS (mV)
9-bit 11-bit 13-bit
1.0E-09
1.0E-08
1.0E-07
1.0E-06
1.0E-05
1.0E-04
1.0E-03
1.0E-02
1.0E-01
1.0E+00
0.40 0.45 0.50 0.55 0.60
Bit Error Rate
Voltage Overscaled (V)
105
Table 5.2. THE CACTI ENERGY CONSUMPTION REPORT FOR A 32KB SRAM ARRAY.
SRAM Parameter Energy
Read Access Per Bit 6.16 10
-02
nJ
Write Access Per Bit 5.92 10
-02
nJ
Leakage Power Per Bank 7.97 10
-01
mJ
In this study, the CACTI tool was employed to obtain the energy of the SRAM memory [144]. Each bank
in the NVDLA has a capacity of 32KB whose energy consumption report has been presented in Table 5.2.
To calculate the leakage energy of the SRAM array, we need the total time spent during the inference of an
image. To obtain this time, we used the tool described in [145]. It reports the performance guideline of the
NN model running on different open source deep learning accelerators (openDLA) including the NVDLA
architectures [145]. The performance guideline consists of the MAC utilization, roofline factor,
conservative and aggressive fps (frame per second). This tool was used to obtain the 𝑓𝑝𝑠 (which
1
𝑓𝑝𝑠
provides the total time needed for inferring an image) of the specific architecture and NN model (which is
nv_full). The 𝑓𝑝𝑠 for LeNet-5 and ResNet-50 have been reported in Table 5.3.
As the next step, we modified the System-C model of the NVDLA buffer to obtain the number of writes
to SRAM from DRAM (input image and weights were read from DRAM and written to SRAM). Also, the
modified model reports the number of reads from the SRAM bank (input images, weights, and activations
were read from SRAM). Using the figures of this table, one can calculate the total energy of the SRAM
banks from
𝐸 𝑆𝑅𝐴𝑀 =8∗𝐸 𝑅 ×(N
W
A
+N
W
W
)+8∗ 𝐸 𝑊 ×(N
R
A
+N
R
W
)+16∗ 𝑃 𝐿𝑒𝑎𝑘 ×
1
𝑓𝑝𝑠 (5-6)
where 𝐸 𝑊 (𝐸 𝑅 ) is the energy consumption for writing (reading) a bit to (from) SRAM, N
W
A
(N
W
W
) is the
number of activation data (weights) that is (are) written to the SRAM from DRAM. Also, N
R
A
(N
𝑅 𝑊 ) is the
number of activation data (weights) that is (are) read form SRAM. Finally, 𝑃 𝐿𝑒𝑎𝑘 is leakage power of the
bank.
106
Table 5.3. THE FPS FOR LENET-5 AND RESNET-50
NN model fps
LeNet-5 23640
ResNet-50 56
5.2.4 Lifetime Improvement Modeling
In this section based on (2-3) and (2-4) lifetime modeling for MAC array and SRAM banks has been
presented and discussed.
5.2.4.1 Lifetime Improvement of the MAC Array
To demonstrate the lifetime improvement for the MAC array, as a specific case, we considered the CG
design mode for the MAC array. The same study may be performed for the cases of the MG and FG modes.
In this mode, all the multipliers of the MAC array are assigned to a specific VOS level for specific number
of cycles based on the average of the 1024 weights loaded from the SRAM banks of the convolution core.
Hence, the multiplier units supply voltage are not always 800mV causing the reduction in the stress (the
electric field) of the devices in the multiplier unit. To determine the lifetime improvement, the number of
cycles that the multiplier operated at a specific VOS level was determined. Then by using and (2-3), the
delay degradation after 10 years was calculated based on the statistics of the VOS level using the method
described in [132].
5.2.4.2 Lifetime Improvement of the SRAM Banks
Based on the AxM technique used for the SRAM banks, the SRAM columns in the approximate region
may operate with a voltage less than the nominal voltage (800mV). This reduces the stress on the cells of
these columns while the stress of the columns with the nominal supply voltage for all the times is not
changed. Using an auxiliary circuit, the aging rate of the columns can become uniform, one can switch the
bit significance of the columns of the SRAM banks. In this rotation scenario, the worst aging belongs to the
columns which half of its lifetime is nominal voltage (800mV) and half of the time is one of the VOS
levels. To assess the lifetime improvement, one may utilize (3-10) [132]. Note that due to the higher impact
of the NBTI in this technology compared to the Positive BTI (PBTI), we only considered the NBTI effect.
107
5.3 Results and Discussion
In this section, the efficiency of the proposed energy-efficient and BTI-resilient NVDLA is investigated.
Thus, first, the characteristics of the energy reduction versus the accuracy degradation for the three
scenarios of the approximate multiplication, approximate buffer, and their combination are presented. Then,
the results of suppressing the aging effect for the NVDLA accelerator is provided.
5.3.1 Results for Energy-Accuracy Characteristic
In this subsection, the results of the energy reduction versus the accuracy degradation in the cases of mere
multiplier approximation of the MAC array, mere memory buffer approximation, and simultaneous use of
both approximate multipliers and memories are presented and discussed for LeNet-5 and Resnet-50
networks.
5.3.1.1 AxC
There are three choices affecting the energy reduction of MAC array versus accuracy degradation in AxC.
They include the VOS granularity mode the width of the approximate region of the X-Dadda multiplier
(𝑊 𝐴𝑥𝐶 ), and the weight interval (5, 7.5 and 10). For each interval, there is a predefined 𝑉 𝐷𝐷 ,𝐴𝑥𝐶 , which has
been defined in the (5-1)-(5-3) and has been considered for the AxC. The weight interval of 10 ((5-3)) was
not considered for LeNet-5. Since its weights were smaller than 40, considering (5-3) meant that no
multiplier would work with the nominal voltage (exact mode) and, hence, to prevent this, (5-3) was not
considered. In the case of ResNet-50, for the FG mode, the weight intervals of 5, 7.5, and 10 ((5-1)-(5-3))
were considered, while for the CG approach due to the large accuracy loss in the case of the interval of 10,
only 5 and 7.5 were considered. The utilization numbers of different VOS levels for both LeNet-5 and
Resnet-50 are shown in Figure 5.5. Different combinations of these choices lead to different energy-
accuracy pair values which are given in Figure 5.6.
Each point in the graph is denoted by a tuple of (𝑊 𝐴𝑥 𝐶 , FG/CG, weight interval). The results reveal that,
in general, in the case of LeNet-5, the CG mode yielded higher energy reduction at the cost of more accuracy
degradation. On the other hand, the FG approach provides smaller energy reduction while its accuracy
108
(a)
(b)
Figure 5.5. Utilization numbers of different VOS levels for a) two modes of FG (weight interval of 7.5) and CG (weight interval
of 7.5) in the case of LeNet-5 and b) two modes of FG (weight interval of 10) and CG (weight interval of 7.5) in the case of
ResNet-50
(a)
1.E+02
1.E+03
1.E+04
1.E+05
1.E+06
1.E+07
800 750 650 500 400
Number of Occurances (Log)
VOS (mV)
FG
CG
1.E+03
1.E+08
2.E+08
3.E+08
4.E+08
5.E+08
6.E+08
7.E+08
8.E+08
9.E+08
800 750 650 500 400
Number of Occurances
VOS (mV)
FG
CG
(9, FG, 5)
(9, FG, 7.5)
(9, CG, 5)
(9, CG, 7.5)
(11, FG, 5)
(11, FG, 7.5)
(11, CG, 5)
(11, CG, 7.5)
(13, FG, 5)
(13, FG, 7.5)
(13, CG, 5)
(13, CG, 7.5)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
1.2
1.3
1.4
1.5
1.6
5 10 15 20 25 30 35 40
Accuracy Reduction (%)
MAC Array Energy Reduction (%)
109
(b)
Figure 5.6. The energy reduction of the MAC array versus accuracy reduction of AxC for a) LeNet-5 b) ResNet-50.
reduction is lower. More specifically, in FG, fewer numbers of multipliers could take lower operating
voltages providing lower energy savings and accuracy losses. Obviously, increasing the number of
approximate bits also caused more energy reduction and accuracy degradation. On the contrary, in the case
of ResNet-50, the FG mode led to a higher energy reduction with lower accuracy reduction compared to
those of the CG mode. The higher energy saving in the case of FG may be originated from the fact that the
weights of Resnet-50 distributed more broadly compared to those of LeNet-5 providing more opportunity
to use lower VOS levels. Using lower VOS level have not led to higher accuracy loss. This could be
attributed to the lower accuracy sensitivity to the weight variations in Resnet-50.
The study shows that in the case of LeNet-5, the tuple (13, CG, 7.5) provides about 37% MAC array
energy reduction at the price only 1.5% accuracy reduction. In the case of ResNet-50, the tuple (13b, FG,
10) provided the highest energy reduction which was about 35% with 5% (3%) reduction of Top-1 (Top-5)
accuracy. while the tuple (13b, FG, 7.5) provided 30% energy reduction with only 1% Top-1 (0.5% Top-
5) accuracy degradation. Finally, as the results show, the Top-5 accuracy reduction is considerably lower
than the Top-1 reduction in the similar approximation configuration.
(9, FG, 5)
(9, FG, 7.5)
(9, FG, 10)
(9, CG, 7.5)
(11, FG, 5)
(11, FG, 7.5)
(11, FG, 10)
(11, CG, 7.5)
(13, FG, 5)
(13, FG, 7.5)
(13, FG, 10)
(13, CG, 7.5)
(9, FG, 5)
(9, FG, 7.5)
(9, FG, 10) (9, CG, 7.5)
(11, FG, 5)
(11, FG, 7.5)
(11, FG, 10)
(11, CG, 7.5)
(13, FG, 5)
(13, FG, 7.5)
(13, FG, 10)
(13, CG, 7.5)
-3
-2
-1
0
1
2
3
4
5
6
7
8
9
10
11
10 15 20 25 30 35 40
Accuracy Reduction (%)
MAC Array Energy Reduction (%)
Top-1
Top-5
110
5.3.1.2 AxM
Here, the results of applying the VOS technique to memory buffer in the NVDLA are presented. Different
VOS levels and numbers of approximate bit were considered as approximate knobs. The energy breakdown
of the SRAM banks for both NNs are shown in Figure 5.7. The number of reads from SRAM is higher than
the number of writes to the SRAM. In addition, read energy/bit is higher than the write energy/bit. Thus,
the SRAM read energy consumption has the highest contribution in the total SRAM energy consumption
in both NNs. In addition, the higher share of the leakage power in the case of ResNet-50 is attributed to the
longer inference time when compared to the case of LeNet-5. In addition, due to the higher number of layers
(weights) in the case of ResNet-50, the write energy contribution in the case of LeNet-5 is higher than the
case of LeNet-5.
The results for the convolution buffer energy reduction versus accuracy degradation for different tuples
of (𝑊 𝐴𝑥𝑀 , 𝑉 𝐷𝐷 ,𝐴𝑥𝑀 (mV)) are plotted in Figure 5.8. In the case of LeNet-5 NN, due to the almost zero
accuracy lost in the case of the VOS level of 510 mV, we only applied 500mV and 475mV voltages in the
study. On the other hand, the VOS level of 475mV was not considered for the case of ResNet-50 NN owing
to its large accuracy loss. For LeNet-5, the tuple (7, 475) provides the maximum energy reduction at the
cost of 3% accuracy reduction. On the other hand, the tuple (5, 500) results in 35% energy reduction (15%
less energy reduction compared to the maximum case) without almost any accuracy degradation. Allowing
1% accuracy loss, lead to an energy reduction of 47% (tuple (7, 500)). In the case of ResNet-50, both tuples
of (7, 500mV) and (7, 510mV) provide about 50% energy reduction which is the maximum energy
reduction while the Top-1 accuracy reduction for both configurations are about 2.5%.
Figure 5.7. The energy breakdown of SRAM banks for LeNet-5 and ResNet-50.
80%
11%
9%
46%
23%
31%
Read Energy Write Energy Leakage Energy
LeNet-5
ResNet-50
111
(a)
(b)
Figure 5.8. The convolution buffer energy reduction versus accuracy reduction of AxM for a) LeNet-5 b) ResNet-50.
On the other hand, by considering the Top-5 accuracy of these configuration, the tuple (7, 500mV) has the
superiority of no Top-5 accuracy degradation. Increasing the VOS level by 15mV, the convolution buffer
energy reduction decreases to about 45% while the Top-1 (Top-5) accuracy reduction becomes about 1%
(0.5%).
(3, 500)
(5, 500)
(7, 500)
(7, 475)
(5, 475)
(3, 475)
(1, 475)
0
0.5
1
1.5
2
2.5
3
3.5
4
0 5 10 15 20 25 30 35 40 45 50 55
Accuracy Reduction (%)
Convoloution Buffer Energy Reduction (%)
(7, 525)
(7, 510)
(5, 510)
(3, 510)
(3, 500)
(5, 500)
(7, 500)
(7, 525)
(7, 510)
(5, 510)
(3, 510)
(3, 500)
(5, 500)
(7, 500)
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
5.5
20 25 30 35 40 45 50 55
Accuracy Reduction (%)
Convolution Buffer Energy Reduction (%)
Top-1
Top-5
112
5.3.1.3 AxC and AxM Together
The results of employing the VOS technique for both the multiplier and memory of the NVDLA at the same
time are provided here. For this part, the energy refers to the energy consumption of the multipliers and the
SRAM banks for the convolution buffer.
First, the energy consumption contributions of the MAC array and the SRAM buffer for both LeNet-5
and ResNet-50 are depicted in Figure 5.9, which follow the same characteristics as those reported results
in [94][133]. In the study, to reduce the energy consumption while having minimum accuracy degradation,
the FG design mode was considered. We considered four VOS levels assigned to each multiplier based on
the values of the weights.
Also, three values of 9, 11, and 13 were taken as the number of approximate bits. For the number of
approximate bits of the memory in the LeNet-5 (ResNet-50) implementation, we chose 7 bits (5 bits) out
of 8 bits for both weight and activation data stored in the SRAMs. In the case of LeNet-5, different VOS
levels for the approximated bits of the weights and activations in the SRAMs were explored. The results
are presented in Figure 5.10 (a) where each point is denoted by the tuple of (𝑊 𝐴𝑥𝐶 , 𝑉 𝐷𝐷 ,𝐴𝑥𝑀 ).The reason for
considering this tuple was the fact that in the case of LeNet-5, the maximum weight interval that could be
considered was 7.5. It should be mentioned that the accuracy degradation (and energy reduction) was
negligible in the case of the weight interval of 5 and hence was not considered in the study. Also, the
accuracy reduction resulted from the weight interval of 7.5 and the FG mode was very small.
Figure 5.9. The energy consumption contributions of the MAC array and the SRAM buffer in the total energy consumption of the
convolution core for both LeNet-5 and ResNet-50.
19%
81%
10%
90%
MAC Energy SRAM Energy
LeNet-5
ResNet-50
113
(a)
(b)
Figure 5.10. The energy reduction versus accuracy degradation when both AxC and AxM were used for a) LeNet-5 and b)
ResNet-50.
Considering this weight interval, the accuracy was not sensitive to 𝑊 𝐴𝑥𝑀 , and hence, this width was not
included in the exploration. As was expected, when 𝑉 𝐷𝐷 ,𝐴 𝑥 𝑀 decrease, the energy saving improves at the
cost of accuracy decrease. The maximum energy reduction (41.5%) with the lowest accuracy reduction
belongs to the tuple (13, 500). If 2% accuracy reduction is tolerable, the tuple (13, 475) can provide about
2.5% more energy reduction. Increasing 𝑉 𝐷𝐷 ,𝐴𝑥𝑀 only by 5mV, improves the accuracy, while having almost
the same energy reduction. For showing the effect of 𝑉 𝐷𝐷 ,𝐴𝑥𝑀 on the accuracy of the neural networks, these
(9, 510)
(9, 500)
(9, 490)
(9, 480)
(9, 475)
(11, 510) (11, 500)
(11, 490)
(11, 480)
(11, 475)
(13, 510)
(13, 500)
(13, 490) (13, 480)
(13, 475)
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
38 39 40 41 42 43 44 45
Accuracy Reduction (%)
Energy Reduction (%)
(9, 5)
(9, 7.5)
(9, 10)
(11, 5)
(11, 7.5)
(11, 10)
(13, 5)
(13, 7.5)
(13, 10)
(9, 5)
(9, 7.5)
(9, 10)
(11, 5)
(11, 7.5)
(11, 10)
(13, 5)
(13, 7.5)
(13, 10)
-1
-0.5
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
5.5
33.0 33.5 34.0 34.5 35.0 35.5 36.0
Accuracy Reduction (%)
Energy Reduction (%)
Top-1
Top-5
114
small steps of resolution has been considered for the 𝑉 𝐷𝐷 ,𝐴𝑥𝑀 . For practical implementation of the AxM,
these steps of resolution for 𝑉 𝐷𝐷 ,𝐴𝑥𝑀 may be larger. The results for considered approximation configurations
in the case of ResNet-50 are shown by the tuple of (𝑊 𝐴𝑥𝐶 , weight interval of 5, 7.5 or 10) in Figure 5.10
(b). In the case of ResNet-50, due to the large accuracy loss by considering 𝑉 𝐷𝐷 ,𝐴𝑥𝑀 < 510 mV, we only
chose 𝑉 𝐷𝐷 ,𝐴𝑥𝑀 = 510mV. Since in the case of Resnet-50, the weights were greater than 40, the weight
interval greater than 7.5 were utilized. Also, the accuracy degradation in the case of the FG mode for
different weight intervals were not negligible. Based on this fact, we considered three weight intervals of
5, 7.5, and 10.
The variation of the energy reduction in different considered approximation configurations is small
thanks to the lower share of the energy consumption of the MAC array (19% of the total energy
consumption of the convolution core). For ResNet-50, the tuple (9, 5) has ~33% energy reduction while
having only 0.5% accuracy degradation for Top-1 (no accuracy degradation for Top-5). The tuple (13, 10)
has the highest energy reduction (about 36%) while having the second highest accuracy reduction. By
reducing the weight interval from 10 to 5 for the same approximation configuration, the Top-1 and Top-5
accuracies improve by 1.5% when the energy reduction reduces 1.5% (the tuple (13, 5)). Also, for the
considered configurations, the Top-5 accuracy fluctuation range is (-1%, 1%) which is much smaller than
the (0%, 5%) fluctuation range for the Top-1 accuracy. The variation in the energy reduction for different
tuples, which obtained through changing approximation configurations, was small owing to the low share
of the energy consumption of the MAC array (10.5% of the total energy consumption of the convolution
core). The energy reduction range for LeNet-5 was larger than that of ResNet-50 due to the larger
contribution of the MAC array energy consumption in the total energy consumption.
5.3.2 Results for Lifetime Improvement
The reductions in the delay increase rate of the MAC array for LeNet-5 and ResNet-50 are drawn in Figure
5.11. Since the weights of ResNet-50 were larger than the weights of LeNet-5, the number of cycles that
the MAC array operating voltage level was at smaller VOS levels was higher for LeNet-5. This made the
115
delay degradation rate for LeNet-5 was lower than that of ResNet-50. Furthermore, as 𝑊 𝐴 𝑥𝐶
increased for
both considered NNs, the delay degradation improvement increased. Also, with the same 𝑊 𝐴𝑥𝐶 , as the
weight interval increased, the delay degradation improvement increased. For the SRAM banks with
𝑉 𝐷𝐷 ,𝐴 𝑥𝑀
values of 510 mV and 475mV, the aging rate reduced about 11%. It should be noted that the aging
rate reduction is dependent of the value of 𝑊 𝐴𝑥𝑀 . This is accounted by noting the fact that 𝑊 𝐴𝑥𝑀 is always
smaller than 8, and hence, there is always at least one column whose voltage for a given amount of time at
the nominal voltage (800mV) and the rest of the time at a VOS level. The results provided here are for
𝑊 𝐴𝑥𝑀 =4 where the time spent at nominal voltage is 50%.
5.3.3 Comparing X-NVDLA with Some Prior Related Works
Finally, important parameters of the previous works reviewed in this section are compared with those of X-
NVDLA in Table 5.4. The parameters used for comparison are as follow:
AGR: approximation granularity, AxC: use of approximate multiplier/logic, AxM: use of approximate
memory, AO: area overhead, AC: accuracy reconfigurability, RAC: runtime accuracy reconfigurability,
EC: use of error correction unit, RTNN: retraining neural network for incorporating AxC and AxM errors,
FS: frequency scaling, PRC: precision of weights and activations, and LNN: study of large neural networks.
Also, fixed-point and floating-point are referred as (FXP and FP).
Figure 5.11. Delay degradation improvement of the MAC array after 10 years.
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
(9, 5) (9, 7.5) (11, 5) (11, 7.5) (13, 5) (13, 7.5)
Delay Degradation Improvement (%)
(N
AxC,Bits
, Weigth Interval)
LeNet-5 ResNet-50
116
Table 5.4. Comparison of different parameters X-NVDLA with Prior related works
Ref\Par AGR AxC AxM AO AC RAC EC RTNN FS PRC LNN
[103] FG ✓ small ✓ ✓ ✓ Int8 ✓
[102] CG ✓ high Int8 ✓
[88] CG ✓ medium ✓ Int8
[94] CG ✓ small Int16
[89] CG ✓ small ✓ Int8 ✓
[90] CG ✓ none ✓ FP
[95] CG ✓ none FP?
[100] CG ✓ large ✓ Bin
[91] MG ✓ small ✓ ✓ Int8 ✓
[96] CG ✓ ✓ none ✓ ✓ FP
[93] CG ✓ medium ✓ ✓ Int8
[101] CG ✓ ✓ none ✓ Int8 ✓
[97] CG ✓ ✓ small ✓ ✓ ✓ FP
[98] CG ✓ none ✓ FP
[99] CG ✓ none FXP4/8/16 and FP
[92] CG ✓ small ✓ ✓ ✓ Int8
X-NVDLA FG/CG ✓ ✓ small ✓ ✓ Int8 ✓
It should be mentioned that the difference between the runtime accuracy reconfigurability with the
accuracy reconfigurability is the ability of changing the accelerator accuracy on the fly without changing
any hardware unit in use.
5.4 Conclusion
In this chapter, to lower energy consumption of deep learning accelerators, we applied the voltage
overscaling technique to MAC array and memory unit (on-chip buffers) of this computing fabric. The use
of the VOS technique provided us with runtime accuracy reconfigurability as well as lifetime improvement.
Having applied the structure to NVDLA, we called it X-NVDLA.
Concurrent applying of VOS to the computation and memory units lowered energy consumption under
the predefined accuracy constraint when compared to the case of applying VOS to one of them. The energy
versus accuracy characteristic of X-NVDLA was assessed by executing the inference phase of the LeNet-
5 and ResNet-50 NNs. The results which obtained using a 15-nm FinFET technology revealed that the
energy consumption of the convolution core of the NVDLA decreased by about 35% with a 3% reduction
in the Top1-accuracy. Also, the lifetime of the NVDLA can be improved up to 7%.
117
CHAPTER 6. CONCLUSION
In this chapter, first a summary of the researches performed in this dissertation is presented. Then, some
suggestion for future work are given.
6.1 Dissertation Conclusion
In this dissertation, we investigated the use of the voltage overscaling (VOS) technique to lower the energy
consumption of digital computing system. Also, the impact of the technique in improving the lifetime of
the circuit is studied. First, the VOS technique was applied to functional units of the processing elements
of coarse-grained reconfigurable architectures (GCRAs) fabrics. The CGRAs are used as accelerators for
low-power, error-tolerant applications where using VOS reduces the (strongly voltage-dependent) wearout
effects and the energy consumption of processing elements (PEs) whenever the error impact on the output
quality degradation can be tolerated. Multiple degrees of computational accuracy achieved by using
different overscaled voltage levels for the PEs. The use of different overscaled voltage levels for different
operations of a DFG of a computational kernel was suggested. The architecture of the CGRA was equipped
with power switch boxes for each row of PEs providing the overscaled voltage levels required for each
application based on an output quality constraint. Then, the approximation level of each processing element
in the CGRA was determined by the applied VOS-determined voltage level. By employing the technique,
the architecture was configured for accurate or approximate modes of computation depending on a user-
specified output quality-of-service target for a given application. More precisely, operating voltages used
for performing various operations in the application dataflow graph were minimized subject to the output
quality constraint using an energy–quality trade-off algorithm using an ILP based mapping algorithm. The
efficacy of the VOS technique in improving the lifetime was studied by considering the bias temperature
instability (BTI). The CGRAs were implemented with a 15nm FinFET technology operating at a nominal
supply voltage of 0.8V. In addition, supply voltages of 0.75, 0.7, 0.65, and 0.6V were considered as
overscaled voltage levels for this technology. Based on the quality constraint requirements of the
118
benchmarks, optimum overscaled voltage levels for various PEs were determined and utilized. The
efficiency of the proposed approach was evaluated using two benchmarks of FIR and PoE. The results
indicated considerable improvements in the energy consumption (up to 44%) and lifetime (up to 77%) of
the fabric. At the end, the VOS results were implemented in Verilog to verify the variance error model for
both benchmarks.
To make the hardware implementation of the scheme more efficient, PEs were clustered into groups of
(e.g., 3 × 1 and 2 × 1) voltage islands. To assess the efficacy of the proposed method in improving the
power (energy) consumption and reliability of CGRAs, different combinations of minimum output quality
constraints, voltage levels, and cluster sizes for several benchmarks were studied. The simulation results
indicated considerable reductions in energy consumption (up to 43%) and aging rate (up to 73%) when
compared to the conventional CGRA with perfect output quality (i.e., with no approximate computations).
As the next step, the application of the VOS technique block-based Carry Look Ahead adder (CLA) was
explored and optimized. The technique made a low-power accuracy-configurable CLA adder (AC-CLA).
The structure employed the voltage overscaling and the number of approximate blocks as approximation
knobs to improving the energy consumption as well as the reliability and lifetime of the adder. While the
former knob may be set in the design time as well as the runtime, the latter may only be invoked in the
design time. In this adder, for a given accuracy level, some of the blocks worked in the approximate mode
by using overscaled voltages. The block-based structure enabled applying the overscaled voltage for each
block independently. The efficacy of the adder depended on the number of the approximate blocks as well
as the VOS levels used for these blocks. The use of lower VOS levels for the blocks responsible for lower
significant bits which had higher switching activities was the key to reduce the power consumption of the
adder while having the error within a tolerable limit. The structure required few level shifters making the
realization overhead low. The efficiency of the AC-CLA structure was again studied using a 15nm FinFET
technology. The results of the study indicated that, in the approximate mode, up to 57% energy saving
might be achieved. In addition, for this adder, the BTI induced delay degradation of the adder over 10 years
119
decreased by up to 7% compared to 50% in the case of the exact operating mode. Finally, the efficacy of
AC-CLA adder was assessed in a neural network (NN) for the classification application. The study revealed
a small accuracy degradation compared to the case of using exact adder.
As another approximate arithmetic unit based on VOS, we proposed and evaluated energy-efficient
accuracy-configurable Dadda (X-Dadda) multiplier. The structure employed the voltage overscaling and
approximate width setting as the approximation knobs to improve the energy consumption as well as the
reliability and lifetime of the multiplier. The former was to be set in the design time as well as in the runtime,
while the latter was fixed in the design time. For a given accuracy level, the partial product columns and the
overscaled voltages for optimizing the energy were determined. Normally, the error was kept within a
tolerable limit, by selecting the least significant bit (LSB) columns which had higher switching activities
for applying the VOS level. The structure made use of a low number of level shifters for a low-overhead
realization. The approximate columns which started from the first column were contiguous. To further
improve the efficiency of the multiplier, four-bit truncation of the multiplier output was also suggested. The
efficiency of the X-Dadda structure was investigated again using a 15-nm FinFET technology. The results
indicated that, for example, when the approximate mode with the mean relative error distance (MRED) of
0.11 was considered, up to 43% energy saving was achieved. Also, when the width of the approximate part
was half of the width of the multiplier and its operating voltage was the half of the nominal voltage, the
multiplier consumed 21% lower energy consumption at the cost of only 0.057 MRED. In addition, in this
case, the delay increase due to the BTI effect was only 28% of the delay increase in the multiplier in the
exact operating mode. Also, the impact of process variations on the accuracy of the X-Dadda was studied
where the results did not indicate a considerable deterioration of the error. Finally, we determined the
efficacy of the proposed X-Dadda structure in realizing NNs for the applications of MNIST classification,
image smoothing and sharpening, and Discrete Cosine Transform (DCT).
At the end, we investigated a runtime accuracy reconfigurable implementation of an energy efficient
deep learning accelerator. It was based on voltage overscaling technique which provided dynamic
120
adjustment of approximation level as well as improving lifetime/reliability of the accelerator. The technique
was applied to both computing and memory units where based on the minimum required accuracy the
applied voltage was adjusted during the runtime. The implementation of the network is performed using
NVDLA which is an open-source convolutional neural network (CNN) accelerator. The approximation was
applied to both the accelerator multiply-and-accumulate (MAC) array employed for the required network
computations and to the accelerator SRAM memory utilized for storing the inputs (images), weights, and
activation data. To control the accuracy degradation of the approximate accelerator (called X-NVDLA), the
reduced voltage was applied only to LSB bits of the MAC array and SRAM unit. To assess the efficacy of
the proposed energy efficient accelerator, the energy-accuracy characteristics of X-NVDLA when running
LeNet-5 and ResNet-50 networks with 8-bit (integer) precision were investigated. Concurrent applying of
VOS to the computation and memory units lowered energy consumption under the predefined accuracy
constraint when compared to the case of applying VOS to one of them. The energy versus accuracy
characteristic of X-NVDLA was assessed by executing the inference phase of the LeNet-5 and ResNet-50
NNs. The results which obtained using a 15-nm FinFET technology revealed that the energy consumption
of the convolution core of the NVDLA decreased by about 35% with a 3% reduction in the Top1-accuracy.
In addition, the characteristic of bias temperature instability, as one of the lifetime deteriorating phenomena
showed improvements up to 7%.
6.2 Future Work
In this section the continuation direction for the works presented in this thesis are presented. In chapter 3,
the idea of using VOS for CGRA has been presented. The idea of Bi-VOS, which has been presented in
chapter 4, can be applied to the CGRAs to compare the improvement of energy efficiency and output quality
to the VOS implementation of CGRA. Moreover, the complete Verilog implementation of the CGRA,
which uses VOS and Bi-VOS, can be implemented by using the opensource tool of OpenCGRA [148].
In chapter 4, CLA adder and Dadda multiplier were used for exploring the idea of Bi-VOS. Different high-
speed adders and multipliers can be used for applying the idea of Bi-VOS to explore the efficacy of this
121
technique on other types of adders and multipliers. Moreover, different error parameters for different adders
and multipliers can be investigated to provide the best error parameter that has the most reflection of the
output quality degradation in a given application. This will give a clear indication of which adders and
multipliers are suitable for the Bi-VOS technique that will give the maximum energy efficiency with
minimum value for the error parameters. Also, the actual implementation of the X-SRAM can be done, and
the energy and lifetime improvements can be reported for different VOS levels.
In chapter 5, as has been stated in the thesis, the NVDLA compiler cannot compile several important
neural network models, such as AlexNet and VGG16. One important step is to modify the NVDLA
compiler to provide the opportunity to compile these important neural networks. To reduce the long runtime
of the inference phase of the neural networks implemented on the NVDLA, the next step is to change the
System-C code of the NVDLA hardware to provide a zero-latency System-C simulation. One other
direction is to implement the medium grain approach for approximating the MAC array of the convolution
core of the NVDLA. Moreover, the System-C code can be changed in a way to provide the granularity of
approximation at the level of neural network layers. Also, the System-C code of the hardware and the virtual
platform of NVDLA should be modified to consider the energy consumption of the MAC array and SRAM
banks of the convolution core. This way, the energy consumption of the convolution core by end of each
round of simulation of virtual platform can be reported. Additionally, the errors of X-Dadda multiplier can
be incorporated in the training phase of the neural networks for reducing the accuracy degradation of the
inference phase. Finally, the adder unit inside the MAC array can be replaced by the AC-CLA adder to
investigate the energy/lifetime improvement opportunities.
122
REFERENCES
[1] Katherine Compton and Scott Hauck. 2002. Reconfigurable computing: a survey of systems and
software. ACM Comput. Surv. 34, 2 (June 2002), 171-210.
DOI=http://dx.doi.org/10.1145/508352.508353.
[2] K. Keutzer, S. Malik and A. R. Newton, "From ASIC to ASIP: the next design
discontinuity," Proceedings. IEEE International Conference on Computer Design: VLSI in
Computers and Processors, Freiberg, Germany, 2002, pp. 84-90.
doi: 10.1109/ICCD.2002.1106752.
[3] K. Karuri et al., "A Design Flow for Architecture Exploration and Implementation of Partially
Reconfigurable Processors," in IEEE Transactions on Very Large Scale Integration (VLSI) Systems,
vol. 16, no. 10, pp. 1281-1294, Oct. 2008.
doi: 10.1109/TVLSI.2008.2002685
[4] X. Tang, E. Giacomin, G. D. Micheli and P. Gaillardon, "FPGA-SPICE: A Simulation-Based
Architecture Evaluation Framework for FPGAs," in IEEE Transactions on Very Large Scale
Integration (VLSI) Systems, vol. 27, no. 3, pp. 637-650, March 2019.
[5] M. B. Taylor, "Is dark silicon useful? Harnessing the four horsemen of the coming dark silicon
apocalypse," DAC Design Automation Conference 2012, San Francisco, CA, 2012, pp. 1131-1136.
[6] S. Oh, H. Lee, and J. Lee, “Efficient execution of stream graphs on coarse-grained reconfigurable
architectures,” in IEEE Trans. Computer-Aided Design of Integrated Circuits and Systems, vol. 36,
no. 12, pp. 1978-1988, Dec. 2017.
[7] http://cccp.eecs.umich.edu/research/cgra.php.
[8] Yoonjin Kim, Rabi N. Mahapatra, Design of Low-Power Coarse-Grained Reconfigurable
Architectures.
[9] Rehan Hameed, Wajahat Qadeer, Megan Wachs, Omid Azizi, Alex Solomatnikov, Benjamin C. Lee,
Stephen Richardson, Christos Kozyrakis, and Mark Horowitz. 2010. Understanding sources of
inefficiency in general-purpose chips. SIGARCH Comput. Archit. News 38, 3 (June 2010), 37-47.
DOI: https://doi.org/10.1145/1816038.1815968.
[10] J. Koomey, S. Berard, M. Sanchez, H. Wong, “Implications of Historical Trends in the Electrical
Efficiency of Computing” IEEE Annals of the History of Computing, pp. 46-54, March 2011.
[11] M. Alioto, “Designing (Relatively) Reliable Systems with (Highly) Unreliable Components” keynote
at NEWCAS 2016 conference, Vancouver (CA), June 24-26, 2016.
[12] International Technology Roadmap for Semiconductors: 2015 edition
[online].http://www.semiconductors.org/main/2015_international_technology_roadmap_for_semic
onductors_itrs, 2015.
[13] Robert Colwell (DARPA) at HotChips 2013,
http://www.hotchips.org/wpcontent/uploads/hc_archives/hc25/HC25.15-keynote1-Chipdesign-
epub/HC25.26.190-Keynote1-ChipDesignGameColwell-DARPA.pdf.
[14] J. M. Shalf and R. Leland, "Computing beyond Moore's Law," in Computer, vol. 48, no. 12, pp. 14-23,
Dec. 2015.
[15] https://semiengineering.com/transistor-aging-intensifies-10nm/
[16] J. Angermeier, D. Ziener, M. Glaß, and J. Teich, “Stress-aware module placement on reconfigurable
devices,” in Proc. Int. Conf. Field Programmable Logic Appl., 2011, pp. 277–281.
[17] E. A. Stott, J. S. Wong, P. Sedcole, and P. Y. Cheung, “Degradation in FPGAs: Measurement and
modelling,” in Proc. 18th Annu. ACM/SIGDA Int. Symp. Field Programmable Gate Arrays,
2010,pp. 229–238.
123
[18] J. Gu, S. Yin, L. liu and S. Wei, "Stress-Aware Loops Mapping on CGRAs with Dynamic Multi-
Map Reconfiguration," in IEEE Transactions on Parallel and Distributed Systems, vol. 29, no. 9,
pp. 2105-2120, 1 Sept. 2018.
[19] F. Nakhaee et al., “Lifetime improvement by exploiting aggressive voltage scaling during runtime
of error-resilient applications,” Integration, The VLSI Journal, 2017.
[20] K. Kang, S. P. Park, K. Roy, and M. A. Alam, “Estimation of statistical variation in temporal NBTI
degradation and its impact on lifetime circuit performance,” in Proc. IEEE/ACM Int. Conf. Comput.-
Aided Des., 2007, pp. 730–734.
[21] D. Lorenz, G. Georgakos, and U. Schlichtmann, “Aging analysis of circuit timing considering NBTI
and HCI,” in Proc. 15th IEEE Int. On-Line Testing Symp., 2009, pp. 3–8.
[22] M. Noda, S. Kajihara, Y. Sato, K. Miyase, X. Wen, and Y. Miura, “On estimation of NBTI-induced
delay degradation,” in Proc. 15th IEEE Eur. Test Symp., 2010, pp. 107–111.
[23] U. Schlichtmann, "Frontiers of timing," 2017 ACM/IEEE International Workshop on System Level
Interconnect Prediction (SLIP), Austin, TX, 2017, pp. 1-4.
[24] K. Usami and M. Horowitz, “Clustered voltage scaling technique for low-power design,” ISLPD,
pp. 3-8, 1995.
[25] N. Rohrer, et al., “A 480MHz RISC microprocessor in a 0.12um Leff CMOS technology with copper
interconnects,” ISSCC, pp. 240-241, 1998.
[26] Kihwan Choi, Ramakrishna Soma, and, Massoud Pedram, “Dynamic voltage and frequency scaling
based on workload decomposition,” in Proc ISLPED, 2004, 174-179. DOI:
[27] Mutoh et al., “1-V power supply high-speed digital circuit technology with multithreshold-voltage
CMOS,” JSSC, vol. SC-30, pp. 847–854, Aug. 1995.
[28] R. G. Dreslinski, M. Wieckowski, D. Blaauw, D. Sylvester, and T. Mudge, “Near-Threshold
Computing: Reclaiming Moore's Law Through Energy Efficient Integrated Circuits,” Proceedings
of the IEEE, vol. 98, no. 2, Feb. 2010.
[29] M. Alioto, "Energy-quality scalable adaptive VLSI circuits and systems beyond approximate
computing," Design, Automation & Test in Europe Conference & Exhibition (DATE), 2017,
Lausanne, 2017, pp. 127-132.
[30] V. K. Chippa et al., “Analysis and characterization of inherent application resilience for approximate
computing,” in Proc of DAC 2013.
[31] Q. Xu, T. Mytkowicz, N. Sung Kim, “Approximate Computing: a survey”, in IEEE Design & Test
journal, Vol. 33, Issue 1, PP. 8-22,February 2016.
[32] Sparsh Mittal, “A Survey of Techniques for Approximate Computing,” ACM Comput. Surv. 48, 4,
Article 62 (March 2016), 33 pages.
[33] L. Sekanina et al., "Special session: How approximate computing impacts verification, test and
reliability," 2018 IEEE 36th VLSI Test Symposium (VTS), San Francisco, CA, 2018, pp. 1-1.
[34] O. Akbari, M. Kamal, A. Afzali-Kusha and M. Pedram, "Dual-Quality 4:2 Compressors for Utilizing
in Dynamic Accuracy Configurable Multipliers," in IEEE Transactions on Very Large Scale
Integration (VLSI) Systems, vol. 25, no. 4, pp. 1352-1361, April 2017.
[35] H. Esmaeilzadeh, A. Sampson, L. Ceze, and D. Burger, “Architecture support for disciplined
approximate programming,” in ASPLOS, London, UK, 2012, pp. 301–312.
[36] K. He, A. Gerstlauer, and M. Orshansky, “Controlled timing-error acceptance for low energy IDCT
design,” in Proc. DATE, pp. 1-6, 2011.
[37] G. Tziantzioulis, A. Gok, et al., “Lazy pipelines: Enhancing quality in approximate computing,” in
Proc. of DATE, 2016.
124
[38] O. Akbari, M. Kamal, A. Afzali-Kusha, M. Pedram and M. Shafique, “PX-CGRA: Polymorphic
approximate coarse-grained reconfigurable architecture,” in Proc. DATE, Dresden, 2018, pp. 413-
418.
[39] C. Li, D. Sengupta, F. S. Snigdha, W. Xu, J. Hu, and S. S. Sapatnekar, “Special session: a quantifiable
approach to approximate computing,” in Proc. CASES, Seoul, South Korea, 2017, pp. 1–2.
[40] J. Gu, S. Yin, and S. Wei, ”Stress-aware loops mapping on CGRAs with considering NBTI aging
effect.” in DAC, Austin, TX, USA, 2017, pp. 1–6.
[41] Y. Kim, R. N. Mahapatra and K. Choi, "Design Space Exploration for Efficient Resource Utilization
in Coarse-Grained Reconfigurable Architecture," in IEEE Transactions on Very Large Scale
Integration (VLSI) Systems, vol. 18, no. 10, pp. 1471-1482, Oct. 2010.
[42] Moghaddam M.S., Cho JM., Choi K. (2017) Reconfigurable Architectures. In: Ha S., Teich J. (eds)
Handbook of Hardware/Software Codesign. Springer, Dordrecht.
[43] M. J. P. Walker and J. H. Anderson, “ Generic Connectivity-Based CGRA Mapping via Integer
Linear Programming”, arXiv:1901.11129 [cs.AR].
[44] M. Ghasemazar, H. Goudarzi, and M. Pedram, “Robust optimization of a chip multiprocessor's
performance under power and thermal constraints,” in ICCD, Montreal, Canada, 2012, pp. 108–114.
[45] C. Piguet, Low-Power CMOS Circuits: Technology, Logic Design and CAD Tools. Boca Raton, FL,
USA: CRC Press, 2006.
[46] J. W. McPherson, Reliability Physics and Engineering Time-to-Failure Modeling, 2nd Edition ed.:
Springer, 2013.
[47] Kukner, H. et al. Scaling of bti reliability in presence of time-zero variability. In IRPS (2014).
[48] Goel, N. et al. Impact of time-zero and nbti variability on sub-20nm Finnfet based sram at low
voltages. In IRPS (2015).
[49] S. A. Chin and J. H. Anderson, "An Architecture-Agnostic Integer Linear Programming Approach
to CGRA Mapping," 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC), San
Francisco, CA, 2018, pp. 1-6.
[50] Z. Zhao, Y. Liu, W. Sheng, T. Krishna, Q. Wang and Z. Mao, "Optimizing the data placement and
transformation for multi-bank CGRA computing system," 2018 Design, Automation & Test in
Europe Conference & Exhibition (DATE), Dresden, 2018, pp. 1087-1092.
[51] S. Yin, J. Gu, D. Liu, L. Liu and S. Wei, "Joint Modulo Scheduling and Vdd Assignment for Loop
Mapping on Dual-Vdd CGRAs," in IEEE Transactions on Computer-Aided Design of Integrated
Circuits and Systems, vol. 35, no. 9, pp. 1475-1488, Sept. 2016.
[52] M. Brandalero, A. C. S. Beck, L. Carro and M. Shafique, “Approximate On-The-Fly Coarse-Grained
Reconfigurable Acceleration for General-Purpose Applications,” in Proc. DAC, 2018, pp. 1-6.
[53] H. Afzali-Kusha, O. Akbari, M. Kamal, and M. Pedram, “Energy consumption and lifetime
improvement of coarse-grained reconfigurable architectures targeting low-power error-tolerant
applications,” in GLSVLSI, Chicago, IL, USA, 2018, pp. 431–434.
[54] M. Hamzeh, A. Shrivastava, and S. Vrudhula, “Epimap: using epimorphism to map applications on
CGRAs,” in DAC, San Francisco, CA, USA, 2012, pp. 1284–1291.
[55] V. K. Chippa, D. Mohapatra, K. Roy, S. T. Chakradhar, and A. Raghunathan, “Scalable effort
hardware design,” in IEEE Trans. on VLSI Syst., vol. 22, no. 9, pp. 2004–2016, Sept. 2014.
[56] R. Ragavan, B. Barrois, C. Killian, and O. Sentieys, “Pushing the limits of voltage over-scaling for
error-resilient applications,” in DATE, Lausanne, Switzerland, 2017, pp. 476–481.
[57] D. Mohapatra, V. K. Chippa, A. Raghunathan and K. Roy, “Design of voltage-scalable meta-
functions for approximate computing,” in DATE, Grenoble, France, 2011, pp. 1–6.
125
[58] D. Ernst et al., “Razor: A low-power pipeline based on circuit-level timing speculation,” in MICRO,
San Diego, CA, USA, 2003, pp. 7–18.
[59] S. Lee, L. K. John and A. Gerstlauer, “High-level synthesis of approximate hardware under joint
precision and voltage scaling,” in Proc of DATE, Lausanne, 2017, pp. 187-192.
[60] S. Xu and B. C. Schafer, “Exposing Approximate Computing Optimizations at Different Levels:
From Behavioral to Gate-Level,” in IEEE TVLSI, vol. PP, no. 99, pp. 1-12.
[61] S. Lee, D. Lee, K. Han, E. Shriver, L. K. John and A. Gerstlauer, “Statistical quality modeling of
approximate hardware,” in Proc of ISQED, 2016.
[62] G. Zervakis, S. Xydis, V. Tsoutsouras, D. Soudris, and K. Pekmestzi, “Multi-Level approximation
for Inexact accelerator synthesis under voltage island constraints,” in Workshop of Approximate
Computing, Pittsburgh, PA, USA, 2016.
[63] M. W. Chen et al., “A dual-edged triggered explicit-pulsed level converting flip-flop with a wide
operation range,” in Proc. SOCC, 2013.
[64] Chaofan Li, Wei Luo, S. S. Sapatnekar and Jiang Hu, “Joint precision optimization and high level
synthesis for approximate computing,” in Proc. DAC, Jun. 2015, pp. 1-6.
[65] P.M. Heysters and G.J.M. Smit, “Mapping of DSP algorithms on the montium architecture”, in
IPDPS, Nice, France, 2003.
[66] W. K. Mak and J. W. Chen, “Voltage island generation under performance requirement for SoC
designs,” in ASPDAC, Yokohama, Japan, 2007, pp. 798–803.
[67] J. M. Lin and Z. X. Hung, “SKB-Tree: a fixed-outline driven representation for modern floorplanning
problems,” IEEE Trans. VLSI Syst., vol. 20, no. 3, pp. 473–484, Mar. 2012.
[68] J. M. Lin and J. H. Wu, “F-FM: fixed-outline floorplanning methodology for mixed-size modules
considering voltage-island constraint,” IEEE Trans. Computer-Aided Design of Integrated Circuits
and Systems, vol. 33, no. 11, pp. 1681–1692, Nov. 2014.
[69] K. Usami et al., “Automated low-power technique exploiting multiple supply voltages applied to a
media processor,” IEEE JSSC, vol. 33, no. 3, pp. 463–472, Mar. 1998.
[70] Q. Chen, J. A. Davis, P. Zarkesh-Ha, and J. D. Meindl, “A compact physical via blockage model,”
IEEE Trans. VLSI Syst., vol. 8, no. 6, pp. 689–692, Dec. 2000.
[71] R. Sarvari, A. Naeemi, P. Zarkesh-Ha, and J. D. Meindl, “Design and optimization for nanoscale
power distribution networks in gigascale systems,” in IITC, Burlingame, CA, USA, 2007, pp. 190–
192.
[72] T. Peyret, G. Corre, M. Thevenin, K. Martin, and P. Coussy, “Efficient application mapping on
CGRAs based on backward simultaneous scheduling / binding and dynamic graph transformations,”
in ASAP, Zurich, Switzerland, 2014, pp. 169–172.
[73] M. Martins, J. M. Matos, R. P. Ribas, A. Reis, G. Schlinker, L. Rech and J. Michelsen, “Open cell
library in 15nm FreePDK technology,” in ISPD, Monterey, CA, USA, 2015.
[74] LLC Gurobi Optimization, “Gurobi Optimizer Reference Manual”, 2018.
[75] M. Balasubramanian, S. Dave, A. Shrivastava, and R. Jeyapaul, “LASER: a hardware/software
approach to accelerate complicated loops on CGRAs,” in DATE, Dresden, Germany, 2018, pp.
1069–1074.
[76] M. Karunaratne, A. K. Mohite, T. Mitra, and L. S. Peh, “HyCUBE: a CGRA with reconfigurable
single-cycle multi-hop interconnect,” in DAC, Austin, TX, USA, 2017, pp. 1–6.
[77] R. Zheng, J. Velamala, V. Reddy, V. Balakrishnan, E. Mintarno, and S. Mitra, “Circuit aging
prediction for low-power operation,” in CICC, Rome, Italy, 2009, pp. 427–430.
[78] J. George, B. Marr, B. E. S. Akgul, and K. V. Palem, “Probabilistic arithmetic and energy efficient
embedded signal processing” in Proc. CASES, 2006, New York, NY, USA, 158-168.
126
[79] Z. M. Kedem, V. J. Mooney, K. K. Muntimadugu and K. V. Palem, “An approach to energy-error
tradeoffs in approximate ripple carry adders,” in Proc. ISLPED, Fukuoka, 2011, pp. 211-216.
[80] C. H. Lin and I. C. Lin, “High accuracy approximate multiplier with error correction,” in Proc. of
ICCD, Oct. 2013, pp. 33–38.
[81] A. Gupta et al., "Low Power Probabilistic Floating Point Multiplier Design," 2011 IEEE Computer
Society Annual Symposium on VLSI, Chennai, 2011, pp. 182-187.
[82] M. Cho, J. Schlessman, W. Wolf and S. Mukhopadhyay, "Accuracy-aware SRAM: A reconfigurable
low power SRAM architecture for mobile multimedia applications," 2009 Asia and South Pacific
Design Automation Conference, Yokohama, 2009, pp. 823-828.
doi: 10.1109/ASPDAC.2009.4796582
[83] G. Zhou, J. Zhou and H. Lin, “Research on NVIDIA Deep Learning Accelerator,” in Proc.
International Conference on Anti-counterfeiting, Security, and Identification, 2018, pp. 192-195.
[84] http://nvdla.org/hw/v1/ias/unit_description.html
[85] http://nvdla.org/hw/v1/hwarch.html
[86] S. -M. Liu, L. Tang, N. -C. Huang, D. -Y. Tsai, M. -X. Yang and K. -C. Wu, "Fault-Tolerance
Mechanism Analysis on NVDLA-Based Design Using Open Neural Network Compiler and
Quantization Calibrator," 2020 International Symposium on VLSI Design, Automation and Test
(VLSI-DAT), Hsinchu, Taiwan, 2020, pp. 1-3.
[87] http://nvdla.org/vp.html
[88] M. A. Hanif, F. Khalid, and M. Shafique, “CANN: Curable approximations for high-performance
deep neural network accelerators,” in Proc. 56th Annu. Design Autom. Conf., Jun. 2019, pp. 1–6.
[89] D. Shin, W. Choi, J. Park and S. Ghosh, "Sensitivity-Based Error Resilient Techniques With
Heterogeneous Multiply–Accumulate Unit for Voltage Scalable Deep Neural Network
Accelerators," in IEEE Journal on Emerging and Selected Topics in Circuits and Systems, vol. 9, no.
3, pp. 520-531, Sept. 2019, doi: 10.1109/JETCAS.2019.2933862
[90] Vojtech Mrazek, Syed Shakib Sarwar, Lukas Sekanina, Zdenek Vasicek, and Kaushik Roy. “Design
of power-efficient approximate multipliers for approximate artificial neural networks” in Proc of
ICCAD 2016.
[91] R. Elangovan, S. Jain, A. Raghunathan “Ax-BxP: Approximate Blocked Computation for Precision-
Reconfigurable Deep Neural Network Acceleration,” 2020,
arXiv:2011.13000.[Online].Available: https://arxiv.org/abs/2011.13000
[92] G. Zervakis, H. Amrouch, and J. Henkel, “Design automation of approximate circuits with runtime
reconfigurable accuracy,” IEEE Access, vol. 8, pp. 53522–53538, 2020.
[93] Jeff Zhang, Kartheek Rangineni, Zahra Ghodsi, and Siddharth Garg, “Thundervolt: enabling
aggressive voltage underscaling and timing error resilience for energy efficient deep learning
accelerators.” In Proc. Of DAC, New York, NY, USA, 2018.
[94] M. Ha, Y. Byun, S. Moon, Y. Lee and S. Lee, "Layerwise Buffer Voltage Scaling for Energy-
Efficient Convolutional Neural Network," in IEEE Transactions on Computer-Aided Design of
Integrated Circuits and Systems,
[95] L. Yang, D. Bankman, B. Moons, M. Verhelst and B. Murmann, “Bit Error Tolerance of a CIFAR-
10 Binarized Convolutional Neural Network Processor,” in Proc.ISCAS, Florence, 2018, pp. 1-5.
[96] S. Kim, P. Howe, T. Moreau, A. Alaghi, L. Ceze and V. S. Sathe, "Energy-Efficient Neural Network
Acceleration in the Presence of Bit-Level Memory Errors," in IEEE TCAS I, vol. 65, no. 12, pp.
4285-4298, Dec. 2018.
[97] N. Chandramoorthy et al., “Resilient Low Voltage Accelerators for High Energy Efficiency,” in Proc
of HPCA, Washington, DC, USA, 2019.
127
[98] L. Yang and B. Murmann, "SRAM voltage scaling for energy-efficient convolutional neural
networks," in Proc. ISQED, Santa Clara, CA, 2017, pp. 7-12.
[99] B. W. Denkinger et al., "Impact of Memory Voltage Scaling on Accuracy and Resilience of Deep
Learning Based Edge Devices," in IEEE Design & Test, vol. 37, no. 2, pp. 84-92, April 2020.
[100] A. D. Mauro, F. Conti, P. D. Schiavone, D. Rossi and L. Benini, "Always-On 674μ W@4GOP/s
Error Resilient Binary Neural Networks With Aggressive SRAM Voltage Scaling on a 22-nm IoT
End-Node," in IEEE TCAS I,vol. 67, no. 11, pp. 3905-3918, Nov. 2020.
[101] B. Salami et al., "An Experimental Study of Reduced-Voltage Operation in Modern FPGAs for
Neural Network Acceleration," 2020 50th Annual IEEE/IFIP International Conference on
Dependable Systems and Networks (DSN), Valencia, Spain, 2020, pp. 138-149,
[102] V. Mrazek, Z. Vasicek, L. Sekanina, M. A. Hanif, and M. Shafique, “ALWANN: Automatic layer-
wise approximation of deep neural network accelerators without retraining,” in Proc. IEEE/ACM
Int. Conf. Comput.-Aided Design (ICCAD), Nov. 2019, pp. 1–8.
[103] Z. Tasoulas, G. Zervakis, I. Anagnostopoulos, H. Amrouch and J. Henkel, “Weight-Oriented
Approximation for Energy-Efficient Neural Network Inference Accelerators,” in IEEE TCASI,
2020.
[104] D. Esposito, A. G. M. Strollo, E. Napoli, D. De Caro and N. Petra,”Approximate Multipliers Based
on New Approximate Compressors,” in IEEE TCAS I, vol. 65, no. 12, pp. 4169-4182, Dec. 2018
[105] M. S. Ansari, H. Jiang, B. F. Cockburn and J. Han, “Low-Power Approximate Multipliers Using
Encoded Partial Products and Approximate Compressors,” in IEEE JETCAS, vol. 8, no. 3, pp. 404-
416, Sept. 2018.
[106] W. Liu, J. Xu, D. Wang, C. Wang, P. Montuschi and F. Lombardi, “Design and Evaluation of
Approximate Logarithmic Multipliers for Low Power Error-Tolerant Applications,” in IEEE
Transactions on Circuits and Systems I: Regular Papers, vol. 65, no. 9, pp. 2856-2868, Sept. 2018.
[107] W. Liu, L. Qian, C. Wang, H. Jiang, J. Han and F. Lombardi, “Design of Approximate Radix-4 Booth
Multipliers for Error-Tolerant Computing,” in IEEE Transactions on Computers, vol. 66, no. 8, pp.
1435-1441, 1 Aug. 2017.
[108] A. Gorantla and P. Deepa, “Design of approximate compressors for multiplication,” ACM Journal
on JETC, vol. 13, no. 3, May 2017.
[109] S. Venkatachalam and S.-B. Ko, “Design of power and area efficient approximate multipliers,”in
IEEE TVLSI, vol. 25, no. 5, pp. 1782–1786, May 2017.
[110] H.R. Mahdiani, A. Ahmadi, S.M. Fakhraie and C. Lucas, “Bio-Inspired Imprecise Computational
Blocks for Efficient VLSI Implementation of Soft-Computing Applications”, in IEEE TCASI:
Regular Papers, vol. 57, no. 4, pp. 850-862, 2010.
[111] V. Mrazek, R. Hrbacek, Z. Vasicek and L. Sekanina, “EvoApprox8b: Library of Approximate
Adders and Multipliers for Circuit Design and Benchmarking of Approximation Methods,” in Proc.
of DATE, 2017, pp. 258-261.
[112] A. Momeni, J. Han, P. Montuschi and F. Lombardi, “Design and Analysis of Approximate
Compressors for Multiplication,” in IEEE Transactions on Computers, vol. 64, no. 4, pp. 984-994,
April 2015.
[113] G. Zervakis, K. Tsoumanis, S. Xydis, D. Soudris and K. Pekmestzi, "Design-Efficient Approximate
Multiplication Circuits Through Partial Product Perforation," in IEEE Transactions on Very Large
Scale Integration (VLSI) Systems, vol. 24, no. 10, pp. 3105-3117, Oct. 2016.
[114] M. Lau, K. V. Ling, and Y. C. Chu, “Energy-Aware Probabilistic Multipliers: Design and Analysis,”
in Proc. of CASES, Oct. 2009, pp. 281- 290
[115] V. Mrazek, Z. Vasicek, and L. Sekanina, “Design of quality-configurable approximate multipliers
suitable for dynamic environment,” in Proc. AHS, Aug. 2018, pp. 264_271.
128
[116] T. Yang, et. al., “A low-power high-speed accuracy-controllable approximate multiplier design,” in
Proc. ASP-DAC, 2018, pp. 605-610.
[117] H. Baba, et. al., “A Low-Power and Small-Area Multiplier for Accuracy-Scalable Approximate
Computing”, in Proc. of ISVLSI, 2018.
[118] H. Jiang, C. Liu, F. Lombardi and J. Han, "Low-Power Approximate Unsigned Multipliers With
Configurable Error Recovery," in IEEE TCAS I: Regular Papers, vol. 66, no. 1, pp. 189-202, Jan.
2019.
[119] O. Akbari, M. Kamal, A. Afzali-Kusha and M. Pedram, “RAP-CLA: A Reconfigurable Approximate
Carry Look-Ahead Adder,” in IEEE TCAS II, vol. 65, no. 8, pp. 1089-1093, Aug. 2018.
[120] F. Ebrahimi-Azandaryani, O. Akbari, M. Kamal, A. Afzali-Kusha and M. Pedram, “Block-based
Carry Speculative Approximate Adder for Energy-Efficient Applications,” in IEEE TCAS II:
Express Briefs.
[121] M. Shafique, W. Ahmad, R. Hafiz and J. Henkel, “A low latency generic accuracy configurable
adder,” in Proc. DAC, 2015, pp. 1-6.
[122] A. Shafaei, Y. Wang, L. Chen, S. Chen and M. Pedram, “Maximizing the performance of NoC-based
MPSoCs under total power and power density constraints,” in Proc. of ISQED, March 2016, pp. 49-
56.
[123] H. Afzali-Kusha, O. Akbari, M. Kamal and M. Pedram, “Energy and Reliability Improvement of
Voltage-Based, Clustered, Coarse-Grain Reconfigurable Architectures by Employing Quality-
Aware Mapping,” in IEEE JETCAS, vol. 8, no. 3, pp. 480-493, Sept. 2018.
[124] C.-H. Chang, J. Gu, and M. Zhang, “Ultra low-voltage low-power CMOS 4-2 and 5-2 compressors
for fast arithmetic circuits,” in IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 51, no. 10, pp. 1985–
1997, Oct. 2004.
[125] H. Kaul, M. Anders, S. Hsu, A. Agarwal, R. Krishnamurthy and S. Borkar, “Near-threshold voltage
(NTV) design — Opportunities and challenges,” in Proc. of DAC, 2012, pp. 1149-1154.
[126] A. Shapiro and E.G. Friedman, “Power Efficient Level Shifter for 16 nm FinFET Near Threshold
Circuits,”in IEEE TVLSI, vol. 24, no.2, pp. 774-778, 2016.
[127] P. Pandey, P. Basu, K. Chakraborty and S. Roy, “GreenTPU: Improving Timing Error Resilience of
a Near-Threshold Tensor Processing Unit,” in Proc. of DAC, June 2019, pp. 1-6.
[128] M. Ansari, H. Afzali-Kusha, B. Ebrahimi, Z. Navabi, A. Afzali-Kusha, M. Pedram, “A near-
threshold 7T SRAM cell with high write and read margins and low write time for sub-20nm FinFET
technologies,” Integration, The VLSI Journal, vol. 50, June 2015, pp 91-106.
[129] H. R. Myler and A. R. Weeks, The Pocket Handbook of Image Processing Algorithms in C.
Englewood Cliffs, NJ, USA: Prentice-Hall, 2009.
[130] R.J. Cintra, F.M. Bayer, C.J. Tablada, “Low-complexity 8-point DCT approximations based on
integer functions,” Signal Processing, Vol. 99, 2014, pp. 201-214
[131] B. Parhami. Computer Arithmetic: Algorithms and Hardware Designs. Oxford University Press, Inc.,
New York, NY, USA, 2005.
[132] H. Afzali-Kusha, M. Vaeztourshizi, M. Kamal and M. Pedram, “Design exploration of energy-
efficient accuracy-configurable Dadda multipliers with improved lifetime based on voltage
overscaling,” in IEEE TVLSI vol. 28, no. 5, pp. 1207-1220, May 2020.
[133] R. Venkatesan et al., "MAGNet: A modular accelerator generator for neural networks," in Proc
ICCAD, 2019, pp. 1-8.
[134] https://github.com/nvdla/sw/blob/master/LowPrecision.md
[135] https://github.com/mravendi/caffe-test-mnist-jpg/tree/master/model
[136] https://github.com/KaimingHe/deep-residual-networks
129
[137] Y. LeCun, C. Cortes, and C. Burges, The MNIST Database of Handwritten Digits. [Online].
Available: http://yann.lecun.com/exdb/mnist/
[138] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. FeiFei, “Imagenet: A large-scale hierarchical
image database,” in Proc. ICVPR, 2009, pages 248–255.
[139] https://carc.usc.edu/
[140] P. N. Whatmough, S. K. Lee, D. Brooks and G. Wei, “DNN engine: A 28-nm timing-error tolerant
sparse deep neural network processor for IoT applications,” in IEEE JSSC, vol. 53, no. 9, pp. 2722-
2731, Sept. 2018,
[141] A. Sayal, S. S. T. Nibhanupudi, S. Fathima and J. P. Kulkarni, “A 12.08-TOPS/W all-digital time-
domain CNN engine using bi-directional memory delay lines for energy efficient edge computing,”
in IEEE JSSC, vol. 55, no. 1, pp. 60-75, Jan. 2020.
[142] TensorRT: https://developer.nvidia.com/tensorrt. (2021).
[143] S. Ganapathy, J. Kalamatianos, K. Kasprak and S. Raasch, “On characterizing near-threshold SRAM
failures in FinFET technology,” in Proc. DAC, 2017, pp. 1-6,
[144] A. Shafaei, Y. Wang, X. Lin and M. Pedram, “FinCACTI: Architectural analysis and modeling of
caches with deeply-scaled FinFET devices,” 2014 IEEE Computer Society Annual Symposium on
VLSI, Tampa, FL, 2014, pp. 290-295.
[145] https://github.com/SCLUO/Open-DLA-Performance-Profiler
[146] S. Amanollahi, M. Kamal, A. Afzali-Kusha and M. Pedram, “Circuit-level techniques for logic and
memory blocks in approximate computing systems,” in Proceedings of the IEEE, vol. 108, no. 12,
pp. 2150-2177, Dec. 2020.
[147] M. D. Ercegovac and T. Lang, Digital Arithmetic. Amsterdam, The Netherlands: Elsevier, 2004.
[148] Cheng Tan, Chenhao Xie, Ang Li, Kevin Barker, and Antonino Tumeo. “OpenCGRA: An Open-
Source Framework for Modeling, Testing, and Evaluating CGRAs.” in Proc. ICCD, Oct 2020.
[149] W. Lin, C. Hsieh and C. Chou, “ONNC-Based Software Development Platform for Configurable
NVDLA Designs,” in Proc. VLSIDAT, 2019, pp. 1-2..
Abstract (if available)
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
An FPGA-friendly, mixed-computation inference accelerator for deep neural networks
PDF
Energy-efficient shutdown of circuit components and computing systems
PDF
Energy proportional computing for multi-core and many-core servers
PDF
Energy efficient design and provisioning of hardware resources in modern computing systems
PDF
SLA-based, energy-efficient resource management in cloud computing systems
PDF
A joint framework of design, control, and applications of energy generation and energy storage systems
PDF
Multi-level and energy-aware resource consolidation in a virtualized cloud computing system
PDF
Towards green communications: energy efficient solutions for the next generation cellular mobile communication systems
PDF
Integration of energy-efficient infrastructures and policies in smart grid
PDF
Energy-efficient computing: Datacenters, mobile devices, and mobile clouds
PDF
Architectures and algorithms of charge management and thermal control for energy storage systems and mobile devices
PDF
Performance improvement and power reduction techniques of on-chip networks
PDF
Charge-mode analog IC design: a scalable, energy-efficient approach for designing analog circuits in ultra-deep sub-µm all-digital CMOS technologies
PDF
Compiler and runtime support for hybrid arithmetic and logic processing of neural networks
PDF
Modeling and optimization of energy-efficient and delay-constrained video sharing servers
PDF
Towards efficient edge intelligence with in-sensor and neuromorphic computing: algorithm-hardware co-design
PDF
Economic model predictive control for building energy systems
PDF
Resource underutilization exploitation for power efficient and reliable throughput processor
PDF
Improving the speed-power-accuracy trade-off in low-power analog circuits by reverse back-body biasing
PDF
Demand based techniques to improve the energy efficiency of the execution units and the register file in general purpose graphics processing units
Asset Metadata
Creator
Afzalikusha, Hassan
(author)
Core Title
Energy consumption and lifetime/reliability improvement of computing systems using voltage overscaling (VOS) approximation technique
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Degree Conferral Date
2022-05
Publication Date
04/19/2022
Defense Date
03/01/2021
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
aging reduction,approximate computing,approximate memory,CGRA,CNN accelerator,energy-efficient multiplier,finFET,NVDLA,OAI-PMH Harvest,runtime accuracy configurable,voltage clustering,voltage overscaling
Format
theses
(aat)
Language
English
Advisor
Haas, Stephan (
committee member
), Nuzzo, Pierluigi (
committee member
), Pedram, Massoud (
dissertation committee chair
)
Creator Email
afzaliku@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC11666498
Unique identifier
UC11666498
Legacy Identifier
etd-Afzalikush-9494
Dmrecord
447006
Document Type
Dissertation
Format
theses (aat)
Rights
Afzalikusha, Hassan
Internet Media Type
application/pdf
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
cisadmin@lib.usc.edu
Tags
aging reduction
approximate computing
approximate memory
CGRA
CNN accelerator
energy-efficient multiplier
finFET
NVDLA
runtime accuracy configurable
voltage clustering
voltage overscaling