Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
An asynchronous resilient circuit template and automated design flow
(USC Thesis Other)
An asynchronous resilient circuit template and automated design flow
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
AN ASYNCHRONOUS RESILIENT CIRCUIT TEMPLATE AND
AUTOMATED DESIGN FLOW
by
Dylan Hand
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
DOCTOR OF PHILOSOPHY
(ELECTRICAL ENGINEERING)
May 2019
Copyright c
2019
Dedication
To my parents and Jenny
ii
Acknowledgements
This dissertation would not have been possible without the continued guidance and
unwavering support of Professor Peter A. Beerel. From our rst conversation when
he encouraged me to pursue a doctoral degree to the weeks leading up to gradua-
tion, Peter has been an advisor, colleague, and friend throughout. No matter how
many other obligations he has, he always makes time to oer insights, opinions,
or suggestions regardless of the topic. His enthusiasm for mentoring and research
has made a lasting impression that I hope to emulate throughout my career.
I would like to thank my committee members, Massoud Pedram and Leana
Golubchik, for their assistance over the course of my research. When struggling to
balance my academic goals with my career ambitions, they responded only with
complete support and encouragement. Their feedback has made this dissertation
stronger and more impactful.
I am also thankful for the technical discussions and support of my two col-
leagues, Fei Huang and Matheus Moreira. Their insights and feedback helped
shape my research direction, and their willingness to lend support for last minute
iii
deadlines was always appreciated. A special thank you to Diane Demetras for
her indispensable help in navigating the intricacies and non-technical challenges of
pursuing a graduate degree.
Finally, my deepest appreciation to my parents and my girlfriend, Jenny, for
their unconditional love and encouragement.
iv
Table of Contents
Dedication ii
Acknowledgements iii
List of Figures vii
List of Tables ix
Abstract x
Chapter 1: Introduction 1
1.1 Timing Resilient Design . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Asynchronous Solutions . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.1 Asynchronous Background . . . . . . . . . . . . . . . . . . . 5
1.2.2 Bundled Data Communication Channel . . . . . . . . . . . . 8
1.2.3 Bundled Data Design Template . . . . . . . . . . . . . . . . 10
1.2.4 Conditional Communication . . . . . . . . . . . . . . . . . . 13
1.2.5 Existing Design Flows . . . . . . . . . . . . . . . . . . . . . 14
1.3 Contributions of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.4 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Chapter 2: Glitch Analysis on Error Detecting Sequentials 22
2.1 Glitch Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2 Application to TDTB . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.2.1 Modied TDTB . . . . . . . . . . . . . . . . . . . . . . . . . 27
Chapter 3: Blade Template 29
3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2 Error Detection Logic . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.3 Speculative Handshaking Protocol . . . . . . . . . . . . . . . . . . . 34
3.4 Metastability Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.4.1 Practical Overview . . . . . . . . . . . . . . . . . . . . . . . 36
3.4.2 Analytical Approach . . . . . . . . . . . . . . . . . . . . . . 39
3.5 Blade Controllers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.5.1 Petri Net Model . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.5.2 Burst Mode Implementation . . . . . . . . . . . . . . . . . . 44
v
3.5.3 Click Implementation . . . . . . . . . . . . . . . . . . . . . . 45
3.6 Timing Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.6.1 Timing Resiliency Window . . . . . . . . . . . . . . . . . . . 51
3.6.2 Propagation Delay . . . . . . . . . . . . . . . . . . . . . . . 51
3.6.3 Contamination Delay . . . . . . . . . . . . . . . . . . . . . . 53
3.6.4 Hiding Handshaking Overhead . . . . . . . . . . . . . . . . . 54
3.6.5 Maximum Timing Resiliency Window . . . . . . . . . . . . . 55
3.6.6 Q-Flop Setup and Hold . . . . . . . . . . . . . . . . . . . . . 57
3.7 Performance Modeling . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.7.1 Delay distributions . . . . . . . . . . . . . . . . . . . . . . . 58
3.7.2 Systematic error rate . . . . . . . . . . . . . . . . . . . . . . 58
3.7.3 Optimal Blade Performance . . . . . . . . . . . . . . . . . . 60
3.7.4 Performance impact of non-ideal eects . . . . . . . . . . . . 63
Chapter 4: Blade Design Flow 65
4.1 Flow Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.1.1 Handling Macros . . . . . . . . . . . . . . . . . . . . . . . . 67
4.1.2 Resynthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.2 Retiming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.2.1 Synchronous Latch Retiming . . . . . . . . . . . . . . . . . . 70
4.2.2 Synchronous Flop Retiming . . . . . . . . . . . . . . . . . . 71
4.2.3 Resilient-Aware Latch Retiming . . . . . . . . . . . . . . . . 72
4.3 SystemVerilogCSP Front End . . . . . . . . . . . . . . . . . . . . . 73
4.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.3.2 Slackless Controllers . . . . . . . . . . . . . . . . . . . . . . 74
Chapter 5: Case Studies 77
5.1 Plasma 3-Stage CPU . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.1.1 Resynthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.1.2 Area and Performance Comparisons . . . . . . . . . . . . . . 80
5.1.3 Power and Energy Comparison . . . . . . . . . . . . . . . . 83
5.2 Encryption Cores . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.2.1 Design and Fabrication . . . . . . . . . . . . . . . . . . . . . 86
5.2.2 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.2.3 Power and Energy . . . . . . . . . . . . . . . . . . . . . . . 89
Chapter 6: Summary and Conclusions 92
6.1 Academic Impact and Research . . . . . . . . . . . . . . . . . . . . 94
6.1.1 Extended Blade . . . . . . . . . . . . . . . . . . . . . . . . . 95
6.1.2 Testability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
6.1.3 EDL and Delay Line Design . . . . . . . . . . . . . . . . . . 96
6.2 Commercial Impact . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.2.1 Automatic Place & Route . . . . . . . . . . . . . . . . . . . 97
Bibliography 99
vi
List of Figures
Figure 1.1 Bundled Data Communication Channel . . . . . . . . . . . 8
Figure 1.2 (a) Four phase handshaking (b) Two phase handshaking . . 9
Figure 1.3 Bundled Data design templates . . . . . . . . . . . . . . . . 10
Figure 1.4 Conditional communication blocks . . . . . . . . . . . . . . 13
Figure 1.5 Simplied conditional pipeline . . . . . . . . . . . . . . . . 14
Figure 1.6 Proteus Flow . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Figure 2.1 Transition detecting with time borrowing (TDTB) latch
schematic . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Figure 2.2 Glitch Sensitivity Analysis of Original TDTB . . . . . . . . 26
Figure 2.3 Modied TDTB . . . . . . . . . . . . . . . . . . . . . . . . 27
Figure 2.4 Glitch Sensitivity Analysis of Modied TDTB . . . . . . . 28
Figure 3.1 The Blade template . . . . . . . . . . . . . . . . . . . . . . 29
Figure 3.2 Error detection logic including block level diagram of error
detecting latch . . . . . . . . . . . . . . . . . . . . . . . . . 31
Figure 3.3 Timing diagram of Blade template . . . . . . . . . . . . . . 34
Figure 3.4 Speculative handshaking protocol . . . . . . . . . . . . . . 35
Figure 3.5 Expected stage delays including metastability . . . . . . . . 38
Figure 3.6 Petri Net description of Blade Controller . . . . . . . . . . 43
Figure 3.7 Burst-mode state machines for Blade controller with error
detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
Figure 3.8 Handshaking between two controllers when TRW is shorter
than stage delay . . . . . . . . . . . . . . . . . . . . . . . . 46
Figure 3.9 Click Controller Implementation . . . . . . . . . . . . . . . 48
Figure 3.10 Timing constraints in Blade . . . . . . . . . . . . . . . . . 50
Figure 3.11 Logic delay distribution model . . . . . . . . . . . . . . . . 60
Figure 3.12 Normalized expected cycle time versus size of timing re-
siliency window for Normal and Log-Normal Distributions . 62
Figure 3.13 p
opt
versus variation and systematic error rate . . . . . . . 63
Figure 3.14 Eect of delay line quantization on the Expected Cycle time
for normal and log-normal distributions with= = 0.1, 0.2,
and 0.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
Figure 4.1 Blade design
ow . . . . . . . . . . . . . . . . . . . . . . . 65
Figure 4.2 Receive Controller . . . . . . . . . . . . . . . . . . . . . . . 75
vii
Figure 4.3 Send Controller . . . . . . . . . . . . . . . . . . . . . . . . 76
Figure 5.1 Block Diagram of Plasma CPU . . . . . . . . . . . . . . . . 78
Figure 5.2 Resynthesis to improve area and decrease error rate . . . . 79
Figure 5.3 Area overheads as percentage of total overhead for 666MHz
design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
Figure 5.4 Average case performance over time for Plasma CPU using
Blade. Original synchronous frequency of 666MHz . . . . . 81
Figure 5.5 Area overheads as percentage of total overhead for 500MHz
design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
Figure 5.6 Average case performance over time for Plasma CPU using
Blade. Original synchronous frequency of 500MHz . . . . . 83
Figure 5.7 Delay versus supply voltage in 28nm FD-SOI . . . . . . . . 85
Figure 5.8 Histogram of encryption core performance across chip batch 86
Figure 5.9 Performance versus voltage for (a) simon 128/128, (b) si-
mon 48/72, (c) speck 128/128, and (d) speck 48/72 . . . . 87
Figure 5.10 Energy consumption of encryption cores per (a) encrypted
bit and (b) encrypted block . . . . . . . . . . . . . . . . . . 90
Figure 5.11 Power versus voltage for all encryption blocks . . . . . . . . 91
viii
List of Tables
2.1 Curve t parameters for original TDTB and Latch . . . . . . . . . . 26
2.2 Curve t parameters for modied TDTB . . . . . . . . . . . . . . . 28
5.1 Power overhead in Plasma . . . . . . . . . . . . . . . . . . . . . . . 84
5.2 Area and Throughput at 1.2V . . . . . . . . . . . . . . . . . . . . . 88
ix
Abstract
As advancements in process technology slow and the ubiquity of mobile and embed-
ded devices increases, chip designers are looking to new technologies to stretch the
energy eciency of the standard silicon-based semiconductor. The increased focus
on energy eciency and decreased ability to provide eciency through process
shrinks has led the industry to look towards near-threshold and sub-threshold de-
sign paradigms to fuel the next generation of low-power devices. However, adding
margins to address increased delay variation in these regions of operation often
overshadow eciency gains. Resilient circuit templates have been proposed as one
general solution to reducing the required margins of synchronous designs, where
timing violations are allowed and errors introduced in the datapath by these vio-
lations are corrected at a later time to ensure data integrity.
However, none of these synchronous resilient solutions have gained industry
traction. Many have suered from metastability or require modifying the architec-
ture to add replay-based logic that recovers from timing errors, which leads to high
x
timing error penalties and poses a design challenge in modern processors. There-
fore, an asynchronous resilient circuit template, Blade, is proposed that addresses
the concerns and problems of existing resilient solutions; it is robust to metastabil-
ity issues, requires no replay-based logic, and has low timing error penalties. The
Blade template's timing assumptions and requirements are analyzed and a generic
approach for modeling the performance benets of applying Blade to an existing
circuit is provided.
In addition, complex design
ows are a common barrier to using asynchronous
circuits in industry. Therefore, an automated design
ow is proposed that syn-
thesizes synchronous RTL designs to gate-level asynchronous Blade designs using
existing, commercially available synchronous tools. The various steps of the de-
sign
ow are described in detail and its application to a 3-stage MIPS CPU and
light-weight encryption cores is explored. In addition, simple optimizations to re-
duce the area overhead associated with converting an existing design to Blade are
examined.
xi
Chapter 1
Introduction
The computing power of silicon devices has increased dramatically over the past
several decades. The semiconductor industry has historically followed Moore's
law, achieving a large portion of performance and power saving benets through
process shrinks combined with advancements at the process, transistor, gate, and
system level. With the end of modern lithography methods approaching [1] and
the meteoric rise of mobile devices, chip designers are looking to new technologies
to stretch the energy eciency of the standard silicon-based semiconductor. From
1999 to 2007, the single-threaded performance of Intel processors jumped approxi-
mately 500% while from 2007-2015 their processors increased only 2.4x in the same
metric [2], partially due to being power limited [3].
With increasing focus on energy eciency and decreasing ability to provide
eciency through process shrinks, the industry has looked towards near-threshold
and sub-threshold design paradigms to fuel the next generation of low-power de-
vices. However, as the voltage dips to these ranges, the delay variation of transis-
tors increases substantially [4]. Traditional synchronous design must incorporate
timing margin to ensure the correct operation under worst-case delay conditions,
1
which consequently reduce performance at a given voltage level. Delay variation
increases from 50% at nominal supply to around 2,000% in the near-threshold do-
main [5], thus the eciency benets obtained via scaling voltage are often lost to
these margins. Synchronous latch-based designs are becoming popular (e.g. [6])
to address variation, but hold times can still become problematic in performance-
driven applications. Other approaches involve pruning cells from standard libraries
that exhibit greater variation than other logically equivalent options [7]. In recent
years, industry research has shifted towards timing resilient circuits as a potential
solution. These circuits can be broken into two classes: the rst that adjust volt-
age or frequency in reaction variations in voltage or temperature, such as adaptive
voltage scaling (AVS) techniques [8{11], and a second that permits timing viola-
tions, which are then subsequently resolved. This thesis focuses on the latter class
of timing resilient circuit templates to reduce the required margins of synchronous
designs.
1.1 Timing Resilient Design
In a timing resilient design, timing violations are allowed (to a point) and errors
introduced in the datapath by these violations are corrected at a later time to
ensure data integrity. The most popular of these design templates is the Razor
family of resilient circuits [12{15]. The original Razor template consists of error
detecting
ip-
ops that record the occurrence of timing violations by comparing
2
the input value of the
op at two points in time - the initial instance when data
is passed to the next stage and a short time later, before the next clock cycle.
The period between these points is known as the timing resiliency window (TRW).
If a timing violation is detected during the TRW in Razor, an error signal is
asserted and handled at the system level. Most commonly, this would be the
hazard detection/correction block in a pipelined CPU. The reliance on a system-
level error governor restricts Razor's implementation to designs that can be altered
in such a way to globally correct timing violations.
RazorII [16] and Razor-Lite [17] were proposed in part to x some inherit prob-
lems of the original Razor template, including unsafe operation due to metasta-
bility, the large size of the error detecting sequential elements, and the limited
duration of the timing resiliency window. Timber [18], similar to Razor-II, har-
nessed the time-borrowing nature of adding latches to the datapath to enable error
correction across multiple stages. For example, an error occurring in stage 1 may
be resolved as it propagates through non-critical paths in stage 2, thereby prevent-
ing an error from being
agged in stage 2. While improvements over the original
Razor template, these designs still require architectural changes to adjust the clock
frequency on a cycle-by-cycle basis, which in many cases may not be possible (e.g.
existing IP or custom logic). Additionally, the system-level correction of timing vi-
olations incurs a large error penalty (i.e. multiple cycles required to correct a single
error), which incentivizes reducing the TRW and subsequently reducing resiliency.
3
Thus, architecturally-independent timing violation resilient templates were de-
veloped to improve resiliency and increase applicability. Bubble Razor [14] was
proposed as an architecture agnostic alternative to Razor, i.e. Bubble Razor does
not require a global circuit to govern when data errors are corrected. Rather data
integrity is ensured by locally stalling the pipeline and inserting \bubbles" (no-
ops) when a timing violation is detected, allowing sucient time for the error to
be corrected. However, Bubble Razor does not feature any mechanism to handle
metastability, which ultimately limits its potential in commercial products.
Other solutions have been proposed that attempt to resolve the architectural
modication problem but ignore metastability, potentially leading to designs with
MTBF too low to be commercially viable. In [19], metastability is addressed
through a questionable \metastability detector", a circuit that may not exist [20].
The authors in [21] do not address metastability at all, which raises questions
about the feasibility of their design to correctly resolve errors within one clock
cycle.
A semi-asynchronous approach, SafeRazor, wherein a global ring-oscillator
drives locally synchronous logic has also been proposed to address the architec-
tural modication and metastability problems of the existing Razor solutions [22].
Like [19], SafeRazor relies on detecting metastability in a bounded amount of time.
However, their detection circuit directly alters the clock period in an analog fash-
ion, which the authors claim to be metastable safe. The downfall of the SafeRazor
4
approach is using latches directly in the datapath that may be open simultaneously
in a chain or loop. Therefore, hold buers are necessary to ensure short paths do
not induce data contamination. As margins become worse as supply voltage drops,
this hold requirement may become more dicult to satisfy.
Beyond research on template design itself, there has been increased focused on
improving the area overhead and power consumption of resilient designs, speci-
cally with specialized CAD tools or approaches. For example, research on reducing
timing errors [23{26], improving performance [27], and minimizing area and power
overheads [28{32] have shown both great promise and interest in timing resilient
circuits.
1.2 Asynchronous Solutions
1.2.1 Asynchronous Background
Asynchronous circuits have long been trumpeted as an energy ecient solution for
a variety of reasons. Generally, asynchronous circuits rely on one stage of logic
communicating to its neighboring stages when data is available for processing using
local handshaking signals rather than using a global clock signal. Because the clock
network in a typical synchronous circuit may account for about 20% to 40% of the
total power consumption [33], using the local handshaking signals can eliminate the
power consumption of clock generation circuits and the global clock splines. The
5
ability to detect when data availability in asynchronous circuits falls into two main
categories - completion sensing and matched delay lines. Delay insensitive (DI)
and quasi-delay insensitive (QDI) asynchronous circuits, which rely on completion
sensing, can also promise a reduction in margins as each block of logic times itself
rather than relying on a worst-case delay clock period [34]. Likewise, each block of
a design can be individually optimized for a desired power, performance, or area
target. The local communication between blocks inherently provides the ability
for conditional execution at a ne-grain level, facilitating reductions in dynamic
power as well [35].
While fully asynchronous circuits have not yet become mainstream in com-
mercial applications, asynchronous Network On Chip designs [36{38] and globally
asynchronous locally synchronous (GALS) solutions [39, 40] have gained traction.
Some limited research and industrial eorts have shown that asynchronous tech-
niques can improve the throughput of a circuit while being more robust to variable
introduced by process or environmental factors [41{44].
As an alternative to completion sensing used by the DI and QDI solutions
described earlier, another class of asynchronous circuits use delay lines matched to
the critical path of the corresponding stage's logic to set when data will be available
for the subsequent stage [45{47]. As with DI and QDI designs, each pipeline stage
informs neighboring stages as soon as it has completed processing data. This is
in contrast to the global worst case delay assumption in synchronous circuits. In
6
theory, completion detection and delay lines can provide both average case delay
and better tolerance against process variability.
The benets of asynchronous circuits, however, come at the expense of incor-
porating extra overhead. For example, completion detection is commonly imple-
mented using dual-rail logic, which greatly increases the required logic area [48]
and switching activity [49]. Handshaking signals, controllers, and matched delay
lines also add to the required overhead of the circuit. Thus, it is essential to identify
overhead intensive aspects of the design and mitigate them when possible.
Bundled-data (BD) designs, which pair traditional sequential elements, stan-
dard logic cells, and matched delay lines, have similar switching activity as their
synchronous counterparts because the combinational logic is often unchanged and
the total area is also similar because the area of the control circuits and delay
lines is comparable to that of a clock tree [45]. However, one challenge in BD
designs is that the delay line must be conservatively designed to be longer than
the worst case delay of its corresponding logic under all possible process, voltage,
and temperature (PVT) corners, and this can squander much of its advantages.
Researchers have proposed ideas mitigating this problem, such as duplicating the
BD delay lines [50], constraining the design to regular structures such as PLAs [51],
and using soft latches [52].
7
Figure 1.1: Bundled Data Communication Channel
1.2.2 Bundled Data Communication Channel
Asynchronous circuits are comprised of multiple processes communicating via
ports. Connections between ports facilitate passing messages to and from processes
through standardized communication channels. While many implementations ex-
ist, the bundled data (BD) channel is of particular interest in this thesis.
There are two parts of a standard BD channel: handshaking signals and data
signals. Furthermore, the handshaking signals typically consist of two sub-signals,
request and acknowledgement, while the data signals may be comprised of an
arbitrary number of data bits. As shown in Figure 1.1, the bundled data channel
connects the input and output ports of a Receiver and Sender process, respectively.
The request signal is asserted to indicate that the data signals available on the
channel have been updated by the output port and are ready to be consumed
by the next process. The receiving process monitors this signal to determine the
appropriate time to sample the data signals. Once sampling has nished, the
receiving process can then acknowledge its receipt of the data by asserting the
channel's acknowledgement signal.
8
Figure 1.2: (a) Four phase handshaking (b) Two phase handshaking
Two
avors of handshaking protocols are shown in Figure 1.2 [53]. Four phase
handshaking employs a return to zero paradigm to ensure that both the request
and acknowledgement are received at both sides of the channel. As shown in Figure
1.2(a), the request is lowered after the acknowledgement is asserted, which only
then allows the acknowledgement to return to zero. In this way, every cycle of op-
eration starts and ends with the handshaking signals set to zero. Conversely, the
state of the handshaking signals in two phase handshaking depends on the number
of previously communicated messages, assuming the channel starts from a known
initial state. When a message is communicated, the logical value of the handshak-
ing signal changes only once to indicate either the request or acknowledgement, as
seen in Figure 1.2(b).
9
(a) Micropipeline (b) Click template
Figure 1.3: Bundled Data design templates
1.2.3 Bundled Data Design Template
As described in the previous section, asynchronous circuits eliminate the global
clock signal typically used in digital circuits, instead using communication chan-
nels and handshaking signals to pass messages between logic blocks. To simplify
and standardize the implementation details of asynchronous circuits, designers will
often use asynchronous templates. A template denes a standard circuit form, typi-
cally a pipeline stage, which receives data, computes a result, and sends information
to another circuit implemented using the same template. As long as each connect
block follows the same general template, data can
ow freely through the system.
A general bundled data template, commonly known as a micropipeline [54], is
shown in Figure 1.3a. A more specic implementation of the micropipeline, known
as the Click template [55], is shown in Figure 1.3b as an example.
10
The micropipeline template describes the generic implementation of a bundled
data stage, containing single-rail combinational logic, state holding sequential el-
ements, handshaking signals, and an asynchronous controller. The output of the
state holding sequential elements drives the data inputs of the next stage while the
asynchronous controller handles the handshaking signals and generates the clock
signal for the state holding elements. A delay line matched to the critical path
of the combinational logic is used to suciently delay the request signal so that
stable data is sampled into the sequential elements.
An example implementation template of the more general micropipeline design
template, Click [55], is shown in Figure 1.3b. Click uses two-phase handshaking,
which requires the use of an internal state holding element (in this case a
ip-
op)
in the asynchronous controller. Initially, the output of all gates are set to zero
via reset. When L.req becomes high, indicating new data is available, the right
half of the AND-OR gate will resolve to 1, generating a rising transition on the
ip-
op's clock input. Because the input to the
op is its own inverted output,
the L.ack and (after some delay) R.req signals will both become high. The rising
transition on L.ack disables the right side of the AND-OR gate, creating a falling
pulse on the
op's clock input. Only once R.ack is asserted and L.req becomes
zero, indicating that the forward data has been received by the next stage and new
data is available in the current stage, respectively, will another pulse appear on
the clock pin to sample new data - this time via the left side of the AND-OR gate.
11
While this template is presented in the context of a linear pipeline, it can easily
be extended to more complex pipelines with forks, joins, cycles, and conditional
communication.
While the area overheads of this template can be low, the margins necessary
in the delay lines to accommodate the worst-case path in the combinational logic
reduce the potential for signicant performance improvements in many applica-
tions, especially when the supply voltage is reduced. More generally, a signicant
advantage of bundled data design styles is that the datapath can be designed from
standard gates that can be found in any synchronous library. This means the area
overhead can also be low. With the datapath potentially being the same as the
synchronous circuit, the largest contributors to area overhead become the asyn-
chronous controllers and delay lines, which can be balanced with the removal of
the global clock tree or clock generation circuits.
While the delay lines make using a standard cell datapath feasible, they can
also become a signicant portion of the area if not carefully designed. Subse-
quently, there exists a rich body of research into delay line synthesis, testing, and
verication. Recent works have addressed the intricacies of delay lines in low-
power, voltage scaled designs [56{58] as well as verication across multiple PVT
corners [59]. Additionally, some applications may warrant the use of programmable
delay lines [60], which can be tuned post-silicon to mitigate process variation at
the expense of further area overhead and system-level complexity.
12
(a) SEND (b) RECV
Figure 1.4: Conditional communication blocks
1.2.4 Conditional Communication
Conditional communication, similar to clock gating a traditional synchronous cir-
cuit, allows an unused block of logic to be disabled during certain operations and
thereby reduce dynamic power. SEND and RECEIVE (RECV) blocks oer one
generalized approach for implementing conditional communication in asynchronous
circuits. Both the SEND (Fig 1.4a) and RECV (Fig 1.4b) blocks feature an input,
output, and enable channel. For SEND, one transaction is required on both the
input and enable channels for every cycle of operation. Based on the data value of
the enable channel, a transaction is selectively generated on the output channel.
For RECV, one transaction is required on both the output and enable channels
for every cycle of operation. In this case, the data value of the enable channel will
determine whether a transaction is also required on the input channel.
These blocks are inserted between unconditional asynchronous stages and selec-
tively enable/disable communication between adjacent stages, thereby creating a
13
Figure 1.5: Simplied conditional pipeline
conditional pathway. As an example, the simple pipeline system in Figure 1.5 con-
sists of a decoder, arithmetic unit, multiplier, and merge. Note that some channels
and copy blocks are omitted for clarity. Once the decoder stage has determined
the operation to be performed, it can selectively activate either the arithmetic
or multiplier units. Likewise, the merge stage can selectively receive from the
appropriate stage. By completely bypassing the multiplier stage when it is not
needed, the overall system can potentially perform faster (i.e. if the arithmetic
unit is faster than the multiplier) and save dynamic power that would otherwise
be wasted. Therefore, conditional communication in asynchronous design can be
a powerful tool in optimizing designs for both power and performance.
1.2.5 Existing Design Flows
Historically, a common barrier to adoption of asynchronous circuits in industry is
the lack of mature tools to synthesize, analyze, and verify these circuits. While a
number of tools from multiple vendors exist for synchronous circuits at each stage
of the design process, very few design
ows have been developed commercially to
14
build asynchronous circuits, with the majority of tools coming solely from academia
for research purposes.
Design
ows built around re-using commercial synchronous tools are signi-
cantly more attractive for asynchronous commercialization eorts, especially when
targeting modern nanometer processes. Early eorts include Theseus Logic's Null
Convention Logic (NCL)
ow [61] and Minatec's Weak Conditioned Half Buer
(WCHB)
ow [62]. While these
ows focused on non-BD forms of asynchronous
logic, they still beneted heavily from the mature logic synthesis and timing en-
gines the commercial tools provided. On the back-end of the design
ow, robust
place and route, power analysis, and design optimization tools are essential to de-
veloping commercial chips. An academic Bundled Data
ow, ACDC, was recently
introduced which uses Synopsys EDA tools to generate GDSII from RTL while en-
suring the desired timing constraints are maintained throughout the process [63].
There have also been a number of attempts that relied more heavily on custom
EDA tools. Balsa [64] was both a hardware description language and asynchronous
synthesis
ow developed in the early 2000s capable of producing various types of
asynchronous circuits, including bundled data micropipelines.
While not formalized into a CAD tool, the desynchronization framework in [45]
provides thorough analysis of converting existing synchronous designs to asyn-
chronous circuits by removing and replacing the global synchronous clock with
localized handshaking controllers. Techniques derived from this work have formed
15
the basis of more recent bundled data
ows, including the Blade design
ow that
will be introduced in Section 4.
SVC2RTL
The widespread adoption of SystemVerilog in industry has increased eorts in mak-
ing it the preferred HDL for asynchronous circuits. The addition of tasks and inter-
faces provides a supported framework for describing and abstracting asynchronous
communication between blocks. Tiempo developed a method of describing asyn-
chronous circuits in SystemVerilog [65] and SystemVerilogCSP (SVC) [66] was de-
veloped as an alternative to proprietary CSP-based languages, such as CAST [67],
aimed reducing the barrier to adopting asynchronous design in front-end design.
However, designs written in these HDLs cannot be directly synthesized using com-
mercial logic synthesis tools. In the case of SystemVerilogCSP a secondary prepro-
cessor, SVC2RTL, was developed at USC that converts a SVC design to synthesiz-
able RTL by identifying a synchronous core of each model (state and combinational
logic) and tracking the inferred communication on each I/O channel on a cycle-by-
cycle basis. Once a conditional channel has been identied, the appropriate SEND
or RECV blocks are inserted into the RTL to wrap the synchronous RTL body.
The nal output can be then be synthesized using standard tools and support for
recognizing the SEND/RECV blocks can be added to an asynchronous
ow.
16
Figure 1.6: Proteus Flow
Proteus
One example of a successful commercial solution is the Proteus
ow [68]. Proteus
is a commercially proven asynchronous ASIC
ow that combines a number of
commercially available synchronous design tools with a custom designed scripts
and programs to take an asynchronous design from a high-level language, such as
SystemVerilogCSP [66], or a synchronous RTL design to a post-place and route
implementation using a library of custom asynchronous cells. The entire process
is shown in Figure 1.6.
There are four main steps in the Proteus
ow:
Converting to RTL (Optional): If an asynchronous design is specied using
SystemVerilogCSP, it will be converted into block capable of being processed
by standard synthesis tools.
17
RTL synthesis: A synchronous logic synthesis tool implements the speci-
cation using image library cells for a target clock frequency with preset I/O
delays and output loads.
Clustering & Pipeline Optimization: asynchronous control cells are inserted
and logic is clustered into stages to optimize power and performance of the
synthesized design.
Place & Route: A commercial place-and-route tool is used complete the
physical design.
While the commercial variant of the Proteus
ow implements designs using a
QDI template, specically the precharge half buer (PCHB) template which still
suered from high area and power penalties as those discussed earlier for QDI
circuits, the idea of reusing commercial design tools whenever possible is generally
applicable. The Proteus
ow could have been extended to use Bundled Data
templates or other asynchronous templates; however, a major downside of the
existing
ow is its reliance on a completely custom C++ program, which reads in
the netlist at various stages in the
ow to perform modications and optimizations
outside of the commercial tools' frameworks. Given the advantages of using the
existing tools wherever possible to increase industry adoption, it would therefore
be benecial to transition this functionality into a natively supported framework.
18
1.3 Contributions of Thesis
This thesis proposes a new asynchronous circuit template, Blade, which uses the
benets of asynchronous circuits to realize the promises of resilient circuit design,
with a focus on low-power and high energy eciency. To aid building circuits using
the Blade template, an automated design
ow for generating Blade circuits from
synchronous RTL or asynchronous SystemVerilogCSP descriptions is described
that relies on industry standard design tools to synthesize, verify, and analyze
the asynchronous circuits using a standard cell library. In addition, a framework
for comparing the ecacy of error detecting sequentials (EDS) is shown, which
can critically in
uence the implementation of the Blade template.
More specically, the contributions of this work are:
A model for comparing the ability of error detecting sequentials to correctly
identify and
ag input glitches as errors
A resilient asynchronous template, Blade, that achieves average case perfor-
mance by correcting timing violations that were allowed to occur in the same
cycle of operation while maintaining robustness to metastability
Analysis of timing assumptions and requirements of the Blade template,
including the impact of metastability
19
Development of an RTL-to-gate level automated design
ow that creates
Blade circuits from both synchronous RTL and SystemVerilogCSP circuit
descriptions
{ Simple approaches to reduce the area overhead of a Blade design through
retiming and resynthesis
Case Studies of the Blade template and design
ow to prove the performance,
energy eciency, and automation claims of the thesis
{ Area overhead and energy savings analysis of a MIPS-based 3-stage
CPU from synchronous RTL source. Comparisons will be made to a
synchronous implementation of the same RTL design, including a qual-
itative analysis of expected performance and energy consumption in
near-threshold operation
{ Application to SystemVerilogCSP-based designs shown using four en-
cryption cores
1.4 Organization
The remainder of this dissertation is organized as follows. Chapter 2 presents the
glitch analysis model for error detecting sequentials and shows its application to
compare potential improvements to existing EDS designs. Chapter 3 presents the
Blade template, including two possible implementations of the Blade controller,
20
timing equations, an analysis of the impact of metastability, and a general frame-
work for timing analysis of Blade circuits. Chapter 4 introduces the Blade design
ow, including techniques for handling macros, latch retiming, and conditional
communication in SystemVerilogCSP designs. Chapter 5 covers the case studies
chosen to evaluate the benets, trade os, and impacts of the Blade template and
Blade design
ow on a 3-stage MIPS CPU and four cryptography cores, including
performance and power analysis for each. Chapter 6 concludes the dissertation
with a summary of work, general conclusions about resilient asynchronous design,
and a call for future research that will strengthen and improve the Blade template.
21
Chapter 2
Glitch Analysis on Error
Detecting Sequentials
An error detecting element is a critical component of any resilient circuit archi-
tecture. In particular, error detecting sequential (EDS) circuits allow resilient
designs to operate at frequencies higher than those restricted by combinational
path worst-case delays by monitoring timing faults (or errors). As resilient design
gains popularity, various works have proposed dierent approaches to designing
EDSs and their respective benets, including improving performance and increas-
ing yield [14,69{71] Of the EDS circuits published thus far, the transition detector
with time borrowing (TDTB) error detecting latch (EDL), proposed by Bowman
et al. in [70] is one we consider to be particularly interesting. For instance, they
showed that the TDTB provides the best energy eciency among several related
circuit options. More importantly, the TDTB stands out by easing the task of deal-
ing with metastability, a critical requirement of the Blade template, by preventing
possible metastable signals from propagating through the datapath. Instead, these
signals propagate to the control block, where they can be more easily handled.
22
Of the works published to address the usage and design of TDTBs [70, 71],
none evaluated the sensitivity of the circuit to glitches, which can jeopardize cir-
cuit functionality if not detected and signaled as errors. Therefore, an analytical
model is developed to explore its behavior for dierent timing violations, including
glitches, and quantify its sensitivity to such eects. This model is used to show the
impact of a few straightforward optimizations which greatly increase the sensitivity
of the TDTB to glitches.
2.1 Glitch Model
To develop a model of glitch sensitivity for EDSs, the analytical approach of mod-
eling the glitch sensitivity of combinational gates proposed by Gili et al. in [72]
has been extended. The model relies on tting simulation data obtained through
simulation of the circuits under evaluation (CUEs) to a three dimensional surface.
Additionally, the authors presentV
0
as the pulse height at the input of a CUE for a
specic pulse width such thatV
out
, the height of the pulse generated at the output
of CUE, is equal toV
dd
=2. In other wordsV
0
represents the switching threshold of
CUEs, i.e. the minimum height for an input pulse width that creates a pulse in
the output. In [72], the authors dene V
0
as:
V
0
=V
DC
(1 + (t
d
=t
win
)
) (2.1)
23
where is a curve-t parameter and t
win
is the input pulse width. t
d
andV
DC
are the propagation delay and switching threshold, respectively, and come from
simulation.
Because the primary concern is quantifying the switching threshold of CUEs,
we focus onV
0
, which quanties the minimum pulse width and height combination
that causes the propagation of a timing violation to the datapath for the latch
and that enable
agging an error. The drawback is that V
D
C cannot be easily
determined with a high level of accuracy because a DC analysis does not capture
the sequential behavior of the circuit well. Therefore, is introduced to replace
V
DC
and allow the curve-tting algorithm to determine this value based on our
simulation data. Accordingly, V
0
0
is:
V
0
0
=(1 + (t
d
=t
win
)
) (2.2)
2.2 Application to TDTB
To collect data, we created a TDTB targeting a 65nm bulk CMOS technology
using conventional cells from the core library and a C-element from the ASCEnD
library [73]. The width and height of pulses in V were varied from 1ps to 200ps
and 50mV to 1V (nominal voltage), respectively, and C from 1fF to 15fF. The
combination of values enabled the analysis of over 90,000 glitch scenarios. This
24
Figure 2.1: Transition detecting with time borrowing (TDTB) latch schematic
allowed a comprehensive exploration of the behavior of the evaluated circuits,
enhancing the precision of curve tting. We simulated all scenarios for each circuit
using Cadence Spectre and measured the height and width of the glitch generated
by the input inverter together with the pulse propagated through CUE. We tested
two variations of the original TDTB: 4I TDTB, with 4 inverters in the transition
detector, and 6I TDTB, which had 6 inverters in the transition detector.
Curve tting was completed using Matlab's lscurvet function, although any
general method of curve tting should provide similar results. Through simulation
and tting, we obtained Figure 2.1 with the parameters shown in Table 2.1.
Observing Figure 2.2, the most sensitive design will have a curve as close to
the (0, 0) intersection as possible, indicating that it responds to input glitches that
25
Figure 2.2: Glitch Sensitivity Analysis of Original TDTB
Table 2.1: Curve t parameters for original TDTB and Latch
t
d
(ps) R
2
Latch 0.8029 0.4409 47.4 0.9914
4I TDTB 0.4963 0.3196 111.3 0.9933
6I TDTB 0.7024 0.3284 122.8 0.9752
present small heights and widths. As Figure 2.2 shows, the 6I TD+EL is clearly
not suited for safe operation, as its sensitivity is worse than that of the latch. The
4I TD+EL, on the other hand, safely captures small glitches, but presents similar
sensitivity to wide glitches as the latch. As discussed in [16], glitches propagated
through combinational logic can have dierent widths and heights. Therefore, the
obtained results indicate that the classic TDTB does not guarantee safe operation,
i.e. some glitches could be propagated through the latch without generating an
error signal.
26
Figure 2.3: Modied TDTB
2.2.1 Modied TDTB
Based on the results from this analysis, two optimizations to the original TDTB
were developed in [74]. The rst optimization (1) involves careful sizing of the tran-
sistors implementing the XOR gate to elongate the pulse created by the transition
detector. The second optimization (2) involves replacing the semi-static keeper in
the original design's C-Element with a fully static implementation, as shown in
Figure 2.3. This change eliminates contention when asserting the error signal.
Two new TDTB circuits were analyzed: the OX-TD+EL, comprising only
optimization (1); and the SOX-TD+EL, utilizing both optimizations (1) and (2).
27
Figure 2.4: Glitch Sensitivity Analysis of Modied TDTB
Table 2.2: Curve t parameters for modied TDTB
t
d
(ps) R
2
OX-TD+EL 0.3914 0.2510 95.9 0.976
SOX-TD+EL 0.3499 0.2485 77.9 0.9982
The parameters obtained via curve tting are provided in Table 2.2. V
0
0
as a
function of twin for this circuit is also plotted in Figure 2.4. As the chart shows,
combining optimization (1) and (2) yields the best glitch sensitivity, 37.6% and
30.3% higher than the latch at t
win
of 50ps and 100ps, respectively. Moreover, it
enables modest reductions on power and energy overheads. Accordingly a SOX-
TDTB presents a 30% reduction in leakage power and 8% reduction in energy per
operation compared to the original TDTB, at the cost of only 3 extra transistors.
28
Chapter 3
Blade Template
3.1 Overview
EDL
(Error
Detecting
Latch)
R.data
Combinational
Logic
Blade
Controller
L.data
Reconfigurable Delay Line ( δ)
Err
Δ
L.ack
L.req
LE.req
LE.ack
R.ack
RE.ack
R.req
RE.req
Sample
CLK
Blade Stage
Error Detection Logic
2
Figure 3.1: The Blade template
The Blade template, as shown in Figure 3.1, uses single-rail logic followed by
error detecting latches (EDLs), two recongurable delay lines, and an asynchronous
Blade controller. The rst delay line is of duration and controls when the EDL
becomes transparent, allowing the data to propagate through the latch. The Blade
controller speculatively assumes that the data at the input of the EDL is stable
when it becomes transparent and thus sends an output request along the typical
bundled data channel L/R. The second delay line, with duration , denes the
time window during which the EDL is transparent. If data changes during this
window, but stabilizes before the latch becomes opaque, it is recorded as a timing
29
violation, which can subsequently be corrected. Consequently, denes a timing
resiliency window (TRW) after during which the speculative timing assumption
may be safely violated.
In particular, if the combinational output transitions during the TRW, the
error detection logic
ags a timing violation by asserting its Err signal, which is
sampled by the controller. The Blade controller then communicates with its right
neighbor using a novel handshaking protocol implemented with an additional error
channel (RE/LE) to recover from the timing violation by delaying the opening of
the next stage's latch, as will be described in more detail in Section 3.3.
3.2 Error Detection Logic
As illustrated in Figure 3.2, the error detection logic consists of EDLs, generalized
C-elements, and Q-Flops [75]. While there are many possible implementations of
EDLs (e.g., [15, 18, 70, 74]), we implemented a custom design based on the Tran-
sition Detecting Time Borrowing (TDTB) latches proposed in [70], a functional
block diagram of which is shown in Figure 3.2. The already low overhead of the
TDTB is further reduced by integrating the transition detector into the pass-gate
latch circuit, where inherit internal latch delays are repurposed to replace the t
TD
delay line connected to the XOR gate. The XOR gate itself is also optimized at
the transistor level to improve the transition detector's sensitivity [74].
30
The generalized C-elements in Figure 3.2 are also designed at the transistor
level using the
ow proposed in [76] and act to temporarily remember violations
detected by the EDL during the high phase of CLK. While the input connected
to CLK is symmetric, i.e. required for both low-to-high and high-to-low output
transitions, the X signal from the EDL feeds a positive asymmetric input, which
can only aect low-to-high transitions. Accordingly, the generalized C-element will
switch to 0 if CLK is at 0 and to 1 only if both CLK and the X input are at 1. This
creates a memory cell that temporarily stores any violation detected by the EDL
during the high phase of CLK, i.e. during the TRW. Note that a compensation
delay is added by the t
comp
delay line, the purpose of which will be explained in
Section 3.6.
Under normal operation, the pulse on X will be suciently large to guaran-
tee the output node of the C-element is fully charged, indicating an error has
Out
CLK
In
Sample
Error Detection Logic
Controller
C
+
D Q
delay
Err0
Latch
delay
X
EDL
Q-Flop
Err1
From other
C-elements
{
{
{
From other
Q-Flops
t
comp
t
TD
Figure 3.2: Error detection logic including block level diagram of error detecting latch
31
occurred while CLK is high, as outlined in [74]. However, because the data may
violate the setup time of the EDLs, the X signal and the C-element may exhibit
metastablity, as will be further discussed in Section 3.4. To ensure safe operation,
this metastability must be ltered out before reaching the main controller. In syn-
chronous designs, the ltering would be handled through multi-stage synchronizers
increasing the latency of error detection dramatically.
In contrast, the output of the C-element in the Blade template is sampled at
the end of the TRW using a Q-Flop, which contains a metastability lter that
prevents the dual rail output signal, Err, from ever becoming metastable, even if
the C-element is in a metastable state.
The Blade controller simply waits for the dual-rail Err signal to evaluate to
determine whether or not an error occured, gracefully stalling until metastability
is resolved.
To minimize area overheads due to error detection, it is desirable to amortize
the cost of the C-elements and Q-Flops across multiple EDLs. As shown in Figure
3.2, a 4-input generalized C-element can combine the X signals of 3 EDLs using
parallel inputs such that an error from any of the three EDLs triggers the C-element
output to re. An OR gate can further combine 4 C-elements before reaching a
Q-Flop. In this scenario, a single Q-Flop will accurately catch errors and lter
metastability from 12 EDLs. Counterintuitively, this added delay provides timing
32
benets in addition to multifaceted area savings, as will be further explored in
Sections 3.6 and 3.6.5.
Note that the C-element's static implementation [53] makes it undesirable to
have more than 4-inputs as the PMOS stack grows too large.
To further reduce area and power overheads of the error detection logic, two ad-
ditional micro-architectural optimizations are considered. First, not every pipeline
stage need be error-detecting and non error-detecting stages can time borrow.
Time-borrowing stages permit data to pass through the latch during the entire
time it is transparent without
agging any violations. In particular, we found
alternating between error-detecting and time-borrowing stages can work well as
this eectively halves the overhead of error detection logic while still providing
sucient resiliency. Secondly, we dene a stage's critical path as the longest pos-
sible input to output path in the combination logic, which sets the endpoint of the
TRW. If another path has delay within the TRW it is said to be \near-critical".
Only latches that terminate near-critical paths
1
need be error detecting, further
reducing the number of EDLs required in the entire design.
1
Note that by denition a critical path is also \near-critical".
33
3.3 Speculative Handshaking Protocol
Stage 1
CLK
Stage 2
CLK
Stage 3
CLK
Stage 4
CLK
Timing
Violation
Extend
Instruction 1
δ δ+Δ δ
Instruction 2
δ
Figure 3.3: Timing diagram of Blade template
The proposed Blade template implements a new form of asynchronous hand-
shaking: speculative handshaking. To understand this protocol, we rst introduce
the expected behavior of the CLK signals of four Blade stages in a pipeline, shown
in Figure 3.3. As Instructions 1 and 2
ow through the pipeline, the arrows indi-
cate the dependency of one clock signal on another. Instruction 1, shown in red,
launches from Stage 1 at time zero. While Stage 2's latch is transparent, a timing
violation occurs indicating the delay line in Stage 1 was shorter in duration than
the combinational logic path. The rising edge of Stage 3's CLK signal is nominally
scheduled to occur time units after Stage 2's, shown as the dotted gray region;
however, the timing violation extends this time, giving Instruction 1 a total of
+ to pass from Stage 2 to Stage 3. Conversely, Instruction 2 does not suer a
34
L.data
L.req
L.ack
Speculative
Committed
LE.req
LE.ack
δ
Δ
Useful Calculation
(a) Without extension
Committed
Extension
δ
Speculative
Δ
Δ
Useful Calculation
L.data
L.req
L.ack
LE.req
LE.ack
(b) With extension
Figure 3.4: Speculative handshaking protocol
timing violation in Stage 2, which allows Stage 3's CLK signal to activate time
units after Stage 2's.
An example of the speculative handshaking protocol that achieves this behavior
using two-phase signaling is shown in Figure 3.4. Here, a Blade stage speculatively
receives a request and data value on its L channel. The request passes through
the delay line before reaching the Blade controller while the speculative data
propagates the combinational logic. The Blade controller then checks with the
previous stage's controller if the speculative request was sent before the input
data was actually stable, i.e., if the previous stage experienced a timing violation.
This action is implemented via a second handshake on the pull-channel LE. When
no timing violations occur in the previous stage (Figure 3.4a), the LE.req signal
is immediately acknowledged by LE.ack, indicating the speculative request was
correct and no extension is required. In Figure 3.4b, on the other hand, a timing
violation occurs in the previous stage causing the LE.ack signal to be delayed by
time units while the nal, committed input data passes through the stage's
35
combinational logic. In both cases this stage is given a nominal delay of to
process stable data.
In addition, notice that the information of whether a timing violation occurred
is not directly transmitted between stages; rather, this information is encoded into
the variable response time between LE.req and LE.ack. Additionally, the R.req
signal of the controller, not shown in Figure 3.4, is coincident with the arrival of
LE.ack, which forces the R channel request to be delayed by as well when an
extension is necessary.
3.4 Metastability Analysis
3.4.1 Practical Overview
Since the input data may stabilize sometime after the opening of the latch, Blade's
susceptibility to metastability (MS) must be examined. MS in the datapath is not
a concern as we ensure is set suciently large as to avoid closing the latch
while the datapath is still evaluating. However, certain internal nodes of the error
detection logic can become metastable due to several dierent scenarios:
Scenario M1: A data transition occurring near the rising edge of CLK will
cause a pulse on the X output of the EDL to occur before the rising edge of
CLK arrives at the generalized C-element. In this case, the C-element may
only partially discharge its internal dynamic node, resulting in metastability
36
at the output. Fortunately, the width of the timing window in which this can
occur is suciently small that timing violations caused by these transitions
are short in duration and their impact can be absorbed by the following
stage. Consequently, the value to which metastability resolves is not critical
and the circuit will work correctly regardless of the value to which the Q-
op
eventually resolves.
Scenario M2: Late transitions in the datapath can cause pulses on the EDL's
X output that are coincident to the falling edge of CLK. Similarly, the rising
edge of the C-element's output may coincide with the rising edge of the Q-
Flop's sampling signal. Timing violations in this case indicate the datapath
is so slow that it exceeds our timing resiliency window and such circuits
should be ltered out during post-fabrication testing.
Scenario M3: Datapath glitches that occur in the middle of the TRW may
also induce metastability in the C-element. However, through careful design
of the EDL, these input glitches will only cause glitches on the X output
and not the data output [74], i.e. the transition detector is more sensitive
to glitches than the data latch itself. Consequently, metastability in this
scenario only aects performance but not correctness, just as MS in Scenario
M1. Moreover, the probability of entering MS can be reduced by making the
generalized C-element more sensitive to glitches than the transition detector.
37
In rare cases, the output of the Q-Flop will take an arbitrarily long time to
resolve due to internal MS. In a robust synchronous design, similar resolution delays
translate directly into increased margins or extra clock cycles and synchronizers to
wait for this rare occurrence to resolve. However, due to the asynchronous nature
of our template, the Blade controller will gracefully wait for the metastable state
to resolve before allowing the next stage to open its latch, eectively stalling the
stage and ensuring correct operation. This is a signicant benet of asynchronous
design which, to the best of our knowledge, cannot be easily approximated in
synchronous alternatives.
E[delay]
δ + Δ + t
MSQ
δ + t
MSQ
δ + Δ
δ
δ + Δ
δ + Δ
δ
δ + Δ
δ No MS MS in TDTB
(only)
MS in Q-Flop
Figure 3.5: Expected stage delays including metastability
38
3.4.2 Analytical Approach
To analytically analyze the impact of metastability on performance we look at pos-
sible scenarios, as illustrated in Figure 3.5, and create a weighted sum of expected
stage delays based on the probability that each scenario will occur. We dene an
event, met, in which MS has occurred in the error detection logic, and thus the
probability of this event as P
R
(met). Accordingly, the probability that MS does
not occur is then 1P
R
(met).
Metastability scenarios
We dene an expected delay associated with each of the nine scenarios. The
expected delays of the two MS-free scenarios, highlighted in checkered blue, are
easily obtained based on the analysis in Section 3.7.3. The remaining scenarios
are divided into two categories: MS occurs in the TDTB's E only and MS occurs
in both the TDTB and Q-Flop. When MS occurs in the TDTB but resolves
before the Q-Flop samples its output at time , it should be noted that it is
impossible to know whether MS resolved randomly or due to another datapath
transition arriving at the TDTB's D input that set the E output to '1'. Therefore,
three separate conditions, shown in the red horizontally lined region of Figure 3.5,
should be evaluated: i) a new timing violation occurred with probability p; ii) no
violation occurred but MS randomly resolved to '1' with 0.5 probability; or iii) no
violation occurred and MS resolved to '0' with 0.5 probability. In the rst and
39
second conditions, the total stage delay will be + , while the last condition has
expected delay of .
If MS in the TDTB lasts longer than , then the Q-Flop will sample the
unknown value and become metastable itself. However, a stable output from the
Q-Flop is not required until the R.Req signal propagates through the delay line
and the next stage issues a request on its LE channel, as explained in Section
3.3. This allows up to for MS in the Q-Flop to resolve before impacting the
performance, shown in the green vertically lined region. Only when MS propagates
from the TDTB to the Q-Flop and persists longer than does the time to
resolve, t
MSQ
, appear in the expected delay value, shown in the purple region.
Analytical Model
As demonstrated in Section 3.4.1, a transition in the datapath must occur during
the W
1
time window to induce MS in the error detection logic. Therefore, we can
dene the probability of event met based on a normal distribution as:
P
R
(met) =
Z
+
W
1
2
W
1
2
N(x;;
2
)dx (3.1)
To analyze the individual components of this probability, we must dene the prob-
ability that MS does not resolve in a certain amount of time. As shown in [77],
this can be dened using two parameters: t
r
, the time to resolve MS; and, a time
40
constant that is derived from simulation of the circuit experiencing MS. Accord-
ingly, we uset
MST
andt
MSQ
as the time to resolve MS in the TDTB and Q-Flop,
respectively, and
T
and
Q
as the time constants, respectively. As an example,
the probability that MS lasts longer than a time T in the TDTB conditioned on
event met occurring is given by:
P
R
(t
MST
Tjmet) =e
C
T
(3.2)
Using the same form as (3.2), the probabilities of each of the branches shown in
Figure 3.5 can be derived in a similar fashion. To simplify our results we set the
time constants for the C-element and Q-Flop to be equal, i.e.
T
=
Q
=.
Taking all conditions into consideration and assuming delays are normally dis-
tributed, the expected delay per stage can then be calculated as:
E[delay] = (ab + 1) +a
1p(c
2
a
)
2
b
+
ab
(3.3)
where
a =Q
W
1
2
Q
+
W
1
2
(3.4)
b =e
(3.5)
c =e
(3.6)
41
The Q function in (3.4) is a well-known equation that computes the area under
the tail of a normal distribution for a given value in the distribution. The dierence
between two Q functions is therefore the probability landing in the interval of the
two parameters, in our case between
W
1
2
. To quantify the impact of MS, we
look at the throughput ratio, dened as the expected delay with MS (3.3) divided
by the nominal delay (3.18) versus variation. Here we set = 1 and , p, and
according to the analysis presented in Section 3.7.3. The time constant and MS
windowW
1
can be derived from either SPICE simulation or more accurately using
a physical circuit, as shown in [78], where the authors obtained = 3 and W
1
=
0.07 using an older process. As an example, using these values we can compute
that the expected impact on throughput for normally distributed data delays with
= of 0.1, 0.2, and 0.3 is 1.5%, 1.1%, and 0.9%, respectively. In addition, modern
processes will tend to feature a larger , smaller W
1
, and greater variation due to
PVT and unbalanced propagation delays, further reducing the performance impact
of MS [79]. In other words, we conclude that it is reasonable to use (3.18) directly
to model performance because the impact on stage delay due to MS is exceedingly
small.
42
Figure 3.6: Petri Net description of Blade Controller
3.5 Blade Controllers
3.5.1 Petri Net Model
A Petri Net (PN) is common method to describe controllers for synthesis. PNs
can be formally analyzed for correctness and delay sensitivity. PNs can also be
43
synthesized to library gates and C-Elements using well-known methods and tools
[80].
The PN in Figure 3.6 shows just one of many possible realizations of the Blade
controller. Unlabeled transitions are internal states and shown only for complete-
ness. Places with delay due to the Blade protocol are labeled with , while un-
labeled places have no additional delay. The place between LE.req and LE.ack is
labeled \0 or " to indicate the environment's variable delay of acknowledging the
request on the extend channel, which is dependent on an error occurring in the
previous stage, as described in Section 3.3.
3.5.2 Burst Mode Implementation
goD+ / CLK+
delay+ /
goL+ Sample+ CLK-
goD- delay- /
Sample-
goD+ / CLK+
delay+ /
goL- Sample+ CLK-
goD- delay- /
Sample-
RE.req+ Err[1]+ /
edo+
edi+ /
RE.ack+ goD- edo-
Err[1]- edi- /
Err[0]- /
RE.req+ Err[0]+ /
RE.ack+ goD-
L.req+ / LE.req+
LE.ack+ / goR+
goL+ / Lack+ L.req- / LE.req-
LE.ack- / goR-
goL- / L.ack-
RE.req- Err[1]+ /
edo+
Err[1]- edi- /
Err[0]- /
RE.req- Err[0]+ /
RE.ack- goD-
edi+ /
RE.ack- goD- edo-
goR+ R.ack- / goD+
R.req+
goR- R.ack+ / goD+
R.req-
goL
goR
goD
L.req
L.ack
LE.req
LE.ack
R.req
R.ack
RE.req
RE.ack
Err[1] Err[0] Sample CLK
δ
Δ
Δ
edo
edi
delay
Blade Controller
Figure 3.7: Burst-mode state machines for Blade controller with error detection
The Blade controller can also be implemented as a set of three interacting
Burst-Mode state machines [81] and synthesized using the tool 3D [82].
44
Figure 3.7 shows these state machines for pipeline stages with EDLs. Note
that intermediate signals goL, goR, and goD are communication signals between
the three individual state machines, and signals delay, edi, and edo are used to
add the delay line into the controller. For simplicity, the delay line is duplicated
between CLK!delay and edo!edi.
Extending this controller to a token version enables generating an output re-
quest after reset. In addition, simplied versions for stages without error detection
logic can be generated from the more general specication, creating four distinct
Blade controllers. Reset is added manually to the unmapped netlists and the design
can then be mapped to a standard cell library. For all cases, the implicit funda-
mental mode timing assumption [81] was validated using a simulation environment
with random environmental delays.
3.5.3 Click Implementation
Improved Blade Protocol
The controller in Figure 3.7 can be further simplied by making a few observations
about Blade's timing and handshaking signals
2
. In particular, when the stage
delay between controllers matches or exceeds the timing resiliency window, the
45
R.req
R.ack
RE.req
RE.ack
L.req
L.ack
LE.req
LE.ack
Controller
A
Controller
B
Figure 3.8: Handshaking between two controllers when TRW is shorter than stage
delay
handshake on the error channel becomes enclosed by the normal handshake on
Rreq/Rack and Lreq/Lack, as shown in Figure 3.8.
Because controller A waits for a transition on its REreq before continuing and
controller B waits for a transition on its LEack before opening its latches, there
is no need for this extra communication. Controller A can be simplied to send a
request only after it has determined whether an error has occurred, reducing the
number of signals between stages to two, i.e. similar to a typical bundled data
channel. To compensate for the dierence between the stage delay and the timing
resiliency window, the delay line between controllers will be shortened to .
In the case of no error, the request is sent after time units, passes through the
delay line, and arrives at Controller B exactly time units after Controller
A received the token. When an error occurs, Controller A will delay the request
an additional time units, pushing the arrival of the token at Controller B to
+ , just as the Blade specication dictates.
2
The simplied Blade protocol was independently derived by myself and Moises Herrera.
46
Consequently, the reduction in handshaking signals enables a simplication of
controller logic. For example, the three burst mode machines of Figure 3.7 can
be combined to a single machine, but the implementation details are outside the
scope of this thesis. Instead, the improved Blade protocol will be realized in a
Click controller.
Design
While burst mode controllers may have area and power benets, they can be
problematic to handle in modern synchronous-based CAD tools. In particular, the
internal combinational logic self-loops are incompatible with the timing engines
used in synthesis, timing analysis, and place and route tools. These loops must be
broken either by the user after careful analysis and consideration or automatically,
which can often result in confusing or misleading results. Therefore, an alternative
implementation is presented based on the Click template [55], which incorporates a
tradition sequential
ip-
op in the controller. The
ip-
op provides a natural break
point in the control path that the timing engines recognize and are optimized to
handle. These controllers are therefore more suitable for use in automatic synthesis
and place and route tools.
The Blade Click controller, shown in Figure 3.9, builds o the improved burst
mode implementation in Sec. 3.5.3 by also eliminating the Error channels (i.e. LE
and RE). An AND-OR gate collects the input signals from the left (L.req) and
47
Figure 3.9: Click Controller Implementation
right channels (R.ack) as well as the controller's internal state (S0) to generate
the clock signal that drives the sequential gate, which in this case is a standard D
ip-
op. Note that the two halves of the AND-OR structure are complimentary, as
the click controller is a two-phase controller, yet a complete clock pulse (rising and
falling edges) must be generated for every token. To implement the error channel's
handshake, sample is generated by inverting the clock output and Err[0]/Err[1]
drive a larger AND-OR structure which controls a second state
op. In practice, a
small asymmetric delay line may be necessary on sample to meet timing constrains
in the error detection logic, as explained in Section 3.6. This second click element
is necessary to achieve Blade's goal of metastable safe operation, i.e. when the
Q-Flop enters metastability and Err[0] and Err[1] are both held low, the second
click element will prevent a request from being sent to subsequent stages as long
as is necessary to resolve the metastable condition.
48
Both d1 and d2 delay lines should be roughly set to . d1 sets the high
pulse width of CLK while d2 sets the error penalty. Note how d2 is activated
independently of Err[0] or Err[1]. This allows faster operation in most cases, but
does introduce a timing assumption, i.e. in the no error case, a new transition
on d1 's output cannot occur until the previous transition has passed through the
entire d2 delay line. There is a similar timing assumption around the S0
op,
i.e. the next transition on L.req should not arrive before L.ack's transition has
propagated through the AND-OR structure and created a sucient low pulse on
CLK. In practice, this is easily met as the L.req transition is dependent on L.ack
being received at the previous controller and there is a delay line between R.req
and L.req.
This particular controller implementation is roughly equivalent in area to a
burst mode controller implementing the improved Blade protocol, and it was ulti-
mately chosen as the best option for the test cases in Chapter 5 due to the CAD
tool compatibility benets.
3.6 Timing Constraints
The datapath in Blade most closely resembles a standard time borrowing de-
sign [83]. However, the introduction of error detecting stages as well as the error
49
detection logic itself alters these constraints making the analysis of Blade timing
constraints similar to that of Bubble Razor [15].
The annotated timing diagram of the CLK, X, and D signals for a single error
detecting Blade stage in Figure 3.10 shows the overheads associated with the error
detection logic. The delay through the error detection logic is comprised of ve
components: (i) propagation delay from D to X of the EDL,t
X;pd
; (ii) output pulse
width of pin X, t
X;pw
; (iii) C-element propagation delay, t
CE;pd
; (iv) Q-Flop setup
time, t
QF;setup
; and (v) propagation delay of the OR gate between the C-elements
and Q-Flop, t
OR;pd
.
Note that t
X;pd
and t
X;pw
would enforce a large setup time before the EDL
becomes transparent to ensure a transition before the rising edge of CLK is not
agged as a timing violation. Similarly, anothert
X;pd
andt
X;pw
would decrease the
time allowed for detecting violations near the end of the clock pulse. Therefore, a
small compensation delay t
comp
= t
X;pd
+t
X;pw
is added to the CLK input of the
t
X,pd
t
X,pw
t
CE,pd
+
t
OR,pd
+
t
QF,setup
t
X,pd
CLK
X
D
Figure 3.10: Timing constraints in Blade
50
C-element, as seen in Figure 3.2, to shift the timing resiliency window and prevent
unintended errors on the rising edge and missing errors on the falling edge.
3.6.1 Timing Resiliency Window
The actual size of the timing resiliency window is aected by each of the error
detection logic delays. In particular, the TRW can be dened as:
TRW = +t
X;pw
(t
CE;pd
+t
OR;pd
+t
QF;setup
) (3.7)
Note thatt
X;pd
impacts the TRW in two ways: positively for transitions occurring
near the rising edge of the CLK and negatively for transitions at the falling edge.
Hence this term cancels out in (3.7).
3.6.2 Propagation Delay
When using the optimizations described in Section 3.2, there are three potential
logic path end points. First, pipeline stages that do not have error detection
use regular latches that allow time borrowing. Second, latches in error detecting
pipeline stages that are not on near-critical paths are not converted to EDLs and
have constraints similar to
ops. Finally, the EDLs in error detecting stages are
the end points for paths with delay longer than .
51
For paths ending at non-error detecting stages, the maximum propagation delay
is simply:
t
pd;TB
t
latch;CQ
+t
B
(3.8)
t
B
t
latch;setup
(3.9)
where t
latch;CQ
is the clock to Q delay of the source latch, t
latch;setup
is the setup
time of the sink latch, and t
B
is the time borrowed from the subsequent stage
3
.
(3.9) gives the maximum time borrowing allowed for an alternating set of non-error
detecting and error detecting stages.
For paths ending at non-error detecting latches in an error detecting stage, the
propagation delay is also straightforward:
t
pd;NE
t
latch;CQ
t
B
(3.10)
Note that latch setup time is not included in this constraint because the data is
arriving at the rising edge of clock, i.e. when the latch becomes transparent. t
B
represents the time borrowed, if any, from the source latch by the preceding stage.
Finally, the propagation delay of paths ending at EDLs can be derived as:
t
pd;E
+TRWt
latch;CQ
t
B
(3.11)
3
These equations assume that each stage can borrow the maximum amount of , which
occurs when time borrowing and non-time borrowing stages are alternated. See [83] for the more
general time borrowing constraints.
52
where TRW is dened as in (3.7). Note that latch setup time does not appear
here either as the requirement to meet the TRW is always stricter than the latch's
setup time.
Given these maximum propagation delays are met, the total cycle time of the
interleaved time borrowing / error detecting conguration is thus 2 in the no error
case
4
and 2 + when a timing violation occurs. The overheads associated with
controllers, which may decrease throughput, are discussed in Section 3.6.4.
3.6.3 Contamination Delay
The Blade controller enforces a condition that latches of neighboring stages cannot
be transparent at the same time, which provides signicant hold time margin.
When including the clock tree delays, t
CLK;pd
, the hold time constraint between
two stages is:
t
cd
(t
CLK
R
;pd
t
CLK
L
;pd
)t
Rack to Lclk
(3.12)
where L and R represent two neighboring stages and t
Rack to Lclk
is the delay from
R's controller generating an acknowledgement signal to L's controller raising its
clock signal. In practice, t
Rack to Lclk
is around 4 gate delays, making t
cd
small or
even negative for balanced local clock trees. This is in contrast to many resiliency
schemes which exacerbate hold time issues (e.g. [15]).
4
This assumes the two stages are equally divided in time (i.e. +), which is only done for
simplicity and not a requirement. Delay between stages can be balanced as desired, but meeting
some constraints may become dicult.
53
Furthermore,t
Rack to Lclk
can become tunable by adding an additional delay line
on the acknowledgement paths between controllers. This can be desirable in highly
variable environments or when variability cannot be easily estimated a priori.
3.6.4 Hiding Handshaking Overhead
The Blade controller itself of course has inherit logic delays that can impact overall
system performance; however, nearly all of these delays can be hidden if properly
handled. For simplicity, the modied Blade protocol of Section 3.5.3 will be ana-
lyzed here, but the process would apply equally to the generic template.
t
Lreq to clk
is the delay between a controller receiving a request and raising its
clock signal. As long as this delay does not exceed , it can be completely
hidden by reducing the delay line between controllers, i.e. a delay line of
length t
Lreq to clk
rather than .
t
Lreq to Lack
represents the time necessary to produce an acknowledgement
signal for the previous stage. For the controllers implemented in this thesis,
this value should be equal to to preserve non-overlapping phases between
controllers. On its upper-bound, this value cannot exceed 2, which provides
sucient margin in most applications as <.
t
clk to Rreq
encapsulates the delay through a Blade stage's error detection logic,
as Rreq is dependent on Err1 or Err0. In the non-error case, this value
54
should thus be smaller than post-compensated delay line between controllers,
i.e. t
clk to Rreq
t
Lreq to clk
, to ensure it is hidden and does not aect
performance.
t
ack to req
is the controller's internal delay from receiving an acknowledgement
to processing the next request. As previously mentioned, the acknowledge-
ment in Blade will normally arrive suciently earlier than the next request,
hiding this small overhead - which is only 1-2 gate delays in many controller
implementations.
Out of these delays, t
clk to Rreq
will pose the biggest challenge as it contains
the entire error collection tree in addition to the TRW. In applications where
it becomes impossible to hide this delay, the total cycle time will increase to +
t
clk to Rreq
+t
Lreq to clk
. This is due to the fact thatt
clk to Rreq
andt
Lreq to clk
are both
normally hidden by the nominal delay between stages. Without that overlapping
delay, both quantities are observed and additive.
3.6.5 Maximum Timing Resiliency Window
To compute the maximum width of the timing resiliency window, TRW
max
, we
rst dene a few additional delays:
t
QF;pd
: the nominal propagation delay from the sample input to the outputs
of the Q-Flop without metastability.
55
t
ET;pd
: the maximum propagation delay of the AND and OR trees that
collect the individual dual-rail error signals from the Q-Flops.
To nd TRW
max
, it is also helpful to rst dene
max
, the maximum clock
pulse width for a Blade stage. Because opening the latch of one stage depends
on checking if an error occurred in a previous stage, cannot be equal to
and still achieve the expected cycle time including overheads. Therefore,
max
is
conservatively set as:
max
=t
ET;pd
t
QF;pd
t
Err[0] to clk
(3.13)
where t
Err[0] to clk
is the internal controller delays from receiving Err[0] one con-
troller to raising the clock signal in the subsequent stage. Combining (3.7) and
(3.13) we nd:
TRW
max
=t
ET;pd
t
QF;pd
t
Err[0] to clk
+t
X;pw
(t
CE;pd
+t
OR;pd
+t
QF;setup
)
(3.14)
In some cases, a large TRW may not be ideal and setting it to 20-30% may be
sucient, as was done in [15]. In addition, reasonable estimates of t
CE;pd
and
t
QF;setup
in a modern process are on the order of tens of ps. However, the magnitude
oft
ET;pd
andt
OR;pd
depend on multiple factors, including the number of EDLs per
stage and the degree to which the EDLs are amortized across Q-Flops. This
56
presents an interesting optimization problem in which reducing the number of
EDLs may also maximize the potential performance of the design.
3.6.6 Q-Flop Setup and Hold
While meeting setup and hold on the Q-Flop is not functionally necessary, ignor-
ing these constraints could impact system performance by inducing false timing
violations or creating unnecessary metastable events. The Q-Flop is triggered on
sample's rising edge, which is generated by CLK's falling edge. As in a typical
setup constraint, sample must arrive at the Q-Flop suciently after data stabi-
lizes, which gives:
t
CLK
f
to sampler
t
CE;pd
+t
OR;pd
+t
QF;setup
(3.15)
Likewise, the data value must be held at the Q-Flop:
t
QF;hold
t
comp
+t
CE;pd
+t
OR;pd
(3.16)
Note how 3.16 exempliest
comp
easing margins, in this case increasing the hold
margin by delaying the clock signal into the C-Element.
57
3.7 Performance Modeling
3.7.1 Delay distributions
Delay variations in the datapath can be attributed to three main sources: global
variation, local variations, and data dependency. It is common to model random
local and global variations in circuits using normal distributions. However, it has
been shown that heavy tail distributions, such as log-normal, are more suitable
in near-threshold domains [84, 85]. Therefore, we analyze both normal and log-
normal distributions with the proposed performance model. Data dependency,
on the other hand, cannot be as well dened; it is determined by many factors,
including architectural description, logic synthesis, and input data.
To simplify the analysis and abstract the various sources of variation, it is desir-
able to consider a single delay distribution. According to [85,86], it is reasonable to
represent the sum of two normal or log-normal random variables as another normal
or log-normal random variable, respectively. In this way, the analyses presented in
this paper are based on combined distributions with a= that can be considered
to encompass all sources of variation.
3.7.2 Systematic error rate
In both the normal and log-normal distributions, there is a non-zero probability of
experiencing an innitely large delay value, i.e. it is impossible to set a traditional
58
clock cycle time that would catch all variations with 100% probablility. Therefore,
a notion of Systematic Error Rate () must be introduced to dene an upper bound
on the worst case performance of the circuit. sets an acceptable amount of errors
that may be allowed during operation of the circuit, which is typically a very small
value, e.g. in [87] the authors assume 0:1%. For traditional circuits, is
calculated as:
= 1 [P
R
fDCg]
N
(3.17)
where D is a random variable representing the delay of the worst case path
between two sequential elements, C is the clock period, P
R
(x) is dened to be the
probability of event x occuring, and N is the number of stages in the circuit.
59
3.7.3 Optimal Blade Performance
Performance model
Prob
Delay
Probility of
Timing violation
Logic Delay
δ δ+ Δ
Figure 3.11: Logic delay distribution model
There are two main timing parameters of Blade: the and delay lines, where
sets the length of the TRW. Compared to a traditional synchronous circuit with
clock period C, we can set C = + . Therefore, a trade o in setting these
values emerges as decreasing allows the system to operate faster if no timing
violations (errors) occur; however, the shorter stage-to-stage delay means that
more transitions will occur while the latch is transparent, thereby increasing the
frequency of errors that force subsequent pipeline stages to be delayed by the now
larger value, as C remains constant. To quantify this optimization problem,
consider a delay distribution of a combinational logic block between two latches
as shown in Figure 3.11. The area of the blue vertically-lined region represents
the probability that an error occurs at a previous output latch, dened as p, such
60
that the eective delay of this pipeline stage is + . The area of the green
horizontally-lined region is thus 1p. We propose to model the performance of
a pipeline stage as a discrete two-valued distribution, which yields the following
equation for average delay of a Blade stage:
d = +p (3.18)
The optimal performance of simple structures, such as N-stage rings, occurs
when each stage's average-case delay is minimized, i.e. when
d =
d
min
. Further-
more, in practice this equals the eective cycle time (EC) of the design, as in-
troduced in [87]. In this way, the asynchronous and synchronous implementations
can be compared directly by their ECs, whereEC =C for traditional synchronous
designs.
Normal and log-normal distributions
We explore normal and log-normal distributions to analyze the impact on perfor-
mance due to dierences in distributions. Both are dened by two parameters:
, the mean; and , the standard deviation. To generalize our analysis for any
possible delay values, we dene a distribution by its = ratio instead of these
individual components.
61
The probability of an error occurring depends on the chosen TRW and the
expected . Setting according to Section 3.7.2, we can then sweep the TRW
from 0 to to plot the EC using (3.18).
The results for both normal and log-normal distributions at three dierent=
values are shown in Figure 3.12. In Blade, the TRW is limited to
2
; however, the
proposed model can obtain the expected performance at all TRWs.
This dierence is denoted by the colored lines turning gray at the maximum
TRW. The optimal EC is obtained by nding the minimum of these curves; from
that, the optimal error rate,p
opt
, can be trivially derived. By varying the= ratio
and recomputing the minimum EC andp
opt
, the relationship between variation and
optimal error rate can be plotted, as shown in Figure 3.13.
Interestingly, for normal distribution p
opt
appears to be constant as variation
increases. In fact, p
opt
of normal distribution is independent of both and [88].
0 50 100
60
80
100
Normal
TRW (%)
EC (%)
0 50 100
0
50
100
Log−Normal
TRW (%)
EC (%)
σ/μ = 0.10 σ/μ = 0.20 σ/μ = 0.30
Figure 3.12: Normalized expected cycle time versus size of timing resiliency window
for Normal and Log-Normal Distributions
62
0 0.2 0.4 0.6
20
30
40
σ/μ
p
opt
(%)
0 0.005 0.01
20
30
40
ξ
p
opt
(%)
normal log−normal
Figure 3.13: p
opt
versus variation and systematic error rate
This means that, given solely an expected and knowledge that the distribution
of delays resembles a normal curve, the optimal performance can be obtained by
tuning the circuit to always achieve a pre-dened error rate.
3.7.4 Performance impact of non-ideal eects
Robustness to delay line accuracy
In Blade, the and delays are typically implemented using simple delay lines
comprised of inverters or buers, which imposes a limit to the accuracy of the
delay line. In other words, the total delay of the delay line may be up to one gate
delay o from the ideal value. Even if the delay lines are tunable, there will
still be a quantization of the delay line such that the ideal delay is unobtainable.
To quantify the impact, the variation in versus the resulting variation in EC is
plotted in Figure 3.14. For a 10% variation in, we only see a 6.3% to 4.7% change
63
in performance for normally and log-normally distributed delays, respectively. At
30% variation, the impact drops to 2.3% and 1.3%, respectively.
0 5 10
0
2
4
6
8
Normal
Change in δ (%)
Change in EC (%)
0 5 10
0
2
4
6
8
Log−Normal
Change in δ (%)
Change in EC (%)
σ/μ = 0.10 σ/μ = 0.20 σ/μ = 0.30
Figure 3.14: Eect of delay line quantization on the Expected Cycle time for normal
and log-normal distributions with = = 0.1, 0.2, and 0.3
64
Chapter 4
Blade Design Flow
4.1 Flow Overview
Physical Design
Latch to EDL Conversion +
Controller Insertion
Retiming
FF to Latch Conversion
Synch
Library
Custom
Blade
Cells
Synchronous Synthesis
RTL Specification
Resynthesis
FF-Based Design
Master-Slave Latch-Based Design
Balanced Latch-Based Design
Resiliency-Aware Optimized Latch-Based Design
Figure 4.1: Blade design
ow
An automated
ow to convert single CLK domain synchronous RTL designs to
asynchronous Blade using industry standard tools, including DesignCompiler and
PrimeTime from Synopsys (for synthesis and STA) and NC-Sim from Cadence
65
(for simulation), was developed to analyze the benets of the proposed template
on a variety of case studies, as described in Chapter 5. The
ow consists of
various Tcl and shell scripts, a library of custom cells, and a Verilog co-simulation
environment for verication and analysis that are wrapped in a Makele system,
which provides multiple conguration knobs to control the synthesized frequency,
TRW, compensation for overheads, and other aspects of the design. This work
targeted a 28-nm FD-SOI process, but the majority of scripts are process agnostic
and process design kits (PDKs) can be interchanged. The
ow has 5 main steps:
1) Synchronous Synthesis: The synchronous RTL is synthesized to a
ip-
op (FF) based design at a given clock frequency with preset I/O delays and
output load values.
2) FF to Latch Conversion: The FFs are converted to master-slave latches
by synthesizing the design using a fake library of standardized D-Flip Flops
(DFFs) that can be easily mapped to standard-cell latches.
3) Latch Retiming: The latch-based netlist is then retimed using a target
TRW that denes the maximum time borrowing allowed, where the combined
path delay constraint of any two stages equals the given clock period.
4) Resynthesis: The retimed netlist is then resynthesized to optimize the ex-
pected area and performance of the nal resilient netlist, as will be described
in Section 4.1.2.
66
5) Blade Conversion: The resynthesized latch-based netlist is then converted
to the Blade template by removing clock trees and replacing them with Blade
controllers. The control logic, delay lines, and error detection logic are also
inserted to create a nal Blade netlist.
The nal Blade netlist is validated via co-simulation with the synchronous netlist
from step 1 to verify correct operation and measure performance. In particular,
to verify correct operation the stream of inputs is forked to both the synchronous
and Blade netlists and the stream of outputs is compared.
4.1.1 Handling Macros
In many designs there may be logic blocks that are either implemented using hard
macros or would be problematic to convert to the Blade template directly. There-
fore, it is benecial to capture errors at the inputs to these cells and ensure the
timing for the macro is satised at the ideal target clock frequency, i.e. the given
clock period minus the TRW. Fortunately, an important advantage of asynchronous
design is that we can add new pipeline stages to the design without changing func-
tionality. For Blade, we take advantage of this feature by adding an error-detecting
pipeline stage at the input of the macro controlled by a non-token-buer pipeline
controller. These controllers only pass tokens through the system; unlike token
controllers, they do not generate tokens on reset. Therefore, the functional behav-
ior of the design is unchanged. In synchronous designs, this would not be possible
67
without major architectural modications as adding a pipeline stage changes the
functionality greatly.
As an example of this process, the Plasma CPU contains a 32 entry register le
(RF) that can be implemented using a memory generator or synthesized directly
as 32
ip-
ops per register. It is not uncommon for either the input or output of
the RF to be on a critical path in the CPU; however, it is often the case that the
majority of this critical delay occurs outside of the macro boundary (e.g. an ALU's
result being stored into the RF). With Blade, if a near-critical path ends at the
RF, all internal registers would need to be converted to EDLs, resulting in large
area overheads. But we can exploit the fact that the decoding logic inside the RF
macro is quick in comparison to the rest of the input path by adding a non-token
Blade stage on the data and address inputs to the RF. We therefore achieve the
same resiliency benets while reducing the number of EDLs drastically without
changing the macro itself; for a 32-bit RF, only 37 EDLs are required when placed
at the input (32 for data, 5 for address) instead of 1024 when the internal
ops are
converted to EDLs. The nominal datapath delay from the added error detecting
Blade stage, through the RF, and to the subsequent Blade stage must be faster
than the ideal target frequency for this method to be eective, which was easily
met in our case.
68
4.1.2 Resynthesis
Each EDL adds overhead in timing and area in multiple ways: i) the EDL itself
is larger than a latch; ii) the number of C-elements and Q-Flops increase; and
iii) the size of the OR/AND trees needed to combine error signals also increases.
Therefore, it is desirable to minimize the number of EDLs while maintaining both
the robustness to timing violations and the expected performance increases. One
method to achieve these goals is through resynthesis.
During retiming, the
ow generates a report of latches that should be converted
to EDLs, i.e. all latches that are on a near-critical path, such that the static timing
analysis indicates a timing violation would occur when running at the ideal target
frequency. Resynthesis involves constraining the delay to one or more of these
latches to be no greater than the target frequency and running logic optimization
on the design, thus removing the selected latch from the EDL report and allowing
it to be implemented using a standard latch rather than an EDL. Although the
combinational area may increase due to tighter constraints on certain paths, this
overhead can be oset if multiple latches that were slated to become EDLs are
no longer on near-critical paths as well. Unfortunately, the high degree of shared
paths in the combinational logic makes it challenging to estimate the the reduction
in EDLs, i.e. constraining one latch may also speed up shared paths to many other
latches. Moreover, the reduction of EDLs combined with faster combinational logic
69
may lead to a reduced frequency of timing violations during simulation, which
aects the maximum performance of the circuit.
Without reliable methods of estimating these two eects, it is dicult to know
a priori which latch(es) in the EDL report to further constrain; therefore, a brute-
force approach in which all latches marked EDL are tested one by one is employed
to nd a suitable candidate latch. The initial results of this approach on one case
study are provided in Section 5.1.1. While outside the scope of this thesis, ap-
proaches using more sophisticated path selection have achieved similar benets for
some designs while greatly reducing runtime compared to the brute-force approach
shown here [89].
4.2 Retiming
4.2.1 Synchronous Latch Retiming
The retiming step of the Blade conversion
ow may reduce the performance of
Blade and increase area overhead of the nal netlist. This opens the door to op-
timization problems that involve retiming to maximize average case performance.
For example, a traditional synchronous retiming algorithm may prefer unbalanced
paths between time-borrowing latches in order to save area without sacricing per-
formance. However, the nal placement of the latches also aects the number of
near-critical paths in the circuit. For resilient designs, poor latch placement could
70
unnecessarily in
ate the number of EDLs, resulting not only in larger area over-
heads but also higher error rates and lower performance. Finding ways to exploit
the positive benets of retiming in Blade may lead to signicant savings in area
overhead.
4.2.2 Synchronous Flop Retiming
In practice, the existing industry-leading synthesis tools tend to be better opti-
mized for retiming
ops rather than latches, producing much higher quality of
results in terms of area overhead and balancing timing between stages. Therefore,
switching to
op-based retiming is the simplest approach to improving retiming
in the
ow. First, the FF to Latch conversion step is split into two parts. Pre-
retiming, the original FFs are converted to two stages of FFs (rather than two
latches) clocked using the single original clock signal. The two sets of
ops must
be tracked throughtout retiming such that the appropriate master and slave clock
signals can be introduced in a later step. One straightforward solution is designat-
ing one set of
ops as immovable (i.e. set adont retime attribute). After retiming
the FF-only design, every FF can be converted to a latch while connecting the
corresponding clock signal for its set.
While often an improvement in area over latch-based retiming, this approach
not only ignores the possible timing benets of using latches but, more importantly,
71
the timing benets possible when considering the timing resiliency of the nal
netlist.
4.2.3 Resilient-Aware Latch Retiming
Therefore, it is necessary to further explore alternatives to our current retiming
approach, which relies solely on the synchronous synthesis tools. Note that while
theoretical resilient-aware retiming algorithms could be devised, achieving retiming
optimality is not the goal of this research. Consequently, two approaches using
heuristic algorithms and tricks that can persuade the synchronous tools to achieve
a desired retiming result will be discussed.
First, a custom, heuristic retiming algorithm was developed, where latches are
moved a designed distance (measured in delay) away from their original location
by rst analyzing the circuit using a linear program and then moving latches semi-
automatically by generating a Tcl script to invoke inside the synthesis tool [31]. A
benet of this approach is that the cost of introducing and removing an EDL on
near-critical paths can be taken into account by simply adding additional edges and
nodes to the traditional retiming graph. It also shows how the runtime consuming
integer linear program (ILP) can be replaced with a network
ow formulation
without loss of quality.
72
An alternative approach involves generating a virtual library (VL) that enables
the synthesis tools to automatically select between non-error-detecting and error-
detecting latches [32], similar to the virtual library approach for resynthesis shown
in [89]. In the VL approach, two virtual latches are added to the existing standard
cell latch in the library. The rst virtual latch has an extended setup time that
captures the resiliency window to imitate a non-error-detecting latch. The second
virtual latch features an enlarged area cost to represent the EDL overhead. Since
the library enables the tool to understand the area overhead of EDL and the
timing constraints of non-error-detecting latches, it can, in principle, eciently
optimize the total area of the design during retiming. In practice, the authors
in [32] and [89] describe ineciencies in the tool's ability to properly select the
ideal latch and approaches to mitigate these factors.
4.3 SystemVerilogCSP Front End
4.3.1 Overview
In addition to translating from synchronous RTL, this
ow also supports directly
synthesizing asynchronous designs written in SystemVerilogCSP (SVC) that con-
tain conditional communication. These high-level circuit descriptions are auto-
matically converted to synthesizable RTL by calling the existing SVC2RTL tool
73
prior to Step 1 (Synchronous Synthesis) in the traditional design
ow. This en-
ables reusing much of the existing
ow infrastructure, with only a few additions
throughout to handle the unique aspects of conditional communication. In partic-
ular, conditional communication with the outer bundled-data (BD) environment
is handled by placing special asynchronous controllers in the control path, denoted
in this text as SEND and RECV.
4.3.2 Slackless Controllers
Conventionally SEND and RECV stages are implemented using sequential ele-
ments to store data between the conditional and unconditional logic blocks. These
sequential gates increase the area and power overheads associated with adding
conditional communication and would further increase overheads associated with
adding timing resiliency if these paths became near-critical. An alternative op-
tion for iterative designs where the frequency of performing a send or receive ac-
tion is low compared to the overall number of cycles executed was devised using
slackless conditional controllers [53]. These controllers enclose the unconditional
handshake(s) within the conditional handshake, i.e. if a token is passed to a RECV
block that is not set to receive a token, the controller will simply not acknowledge
the preceding stage until the communication is allowed. This eliminates the se-
quential gates, and the system appears as a single long BD pipeline stage from the
74
perspective of the preceding and succeeding logic stages through these slackless
conditional controllers.
While there is a performance impact on the surrounding stages due to the
enclosed handshake, this delay could be hidden in many systems where the sur-
rounding environment is suciently faster than the total computation time of the
internal logic block. In other words, as long as the next input token is waiting
at the RECV controller before the logic blocks nishes its internal computation,
there is no major loss in performance due to this slackless optimization.
Two example implementations of the slackless SEND and RECV blocks using
Click style controllers are shown below:
RECV
D Q
D Q
L.req
L.ack
R.req / E.ack
R.ack
E.data
E.req
C
Scan
FF Scan
FF
Figure 4.2: Receive Controller
75
The receive controller, Figure 4.2, requires three channels: L, the input channel;
R, the output channel; andE, the enable channel. E, a 1-bit channel, has its data
wire (E:data) fed directly into the RECV controller along with its request (E:req)
and acknowledgement (E:ack). Because the channels are two-phase, it is necessary
to store the phase of each channel independently - hence two scan FFs are used.
The left FF stores the current phase of the L channel while the right FF stores
the phase of the R and E channels, which operate in the same phase.
SEND
D Q D Q
L.req
L.ack
R.ack
E.data
single_cycle
R.req
Scan
FF
Scan
FF
Figure 4.3: Send Controller
The SEND controller, Figure 4.3, also requires two scan FFs to store the phases
of theL andR channels. If it can be assumed the logic stage that generates theL
token is also generating theE token, theE channel's request and acknowledgement
then become super
uous. By suciently constraining the E:data signal to arrive
before L:req, E:req and E:ack can be completely eliminated.
76
Chapter 5
Case Studies
To prove the ecacy of the Blade Template, a number of case studies will be
examined. These sample designs will be used to explore various aspects of the
template, including providing practical implementation insight, testing potential
design
ow improvements, and collecting performance and power data.
5.1 Plasma 3-Stage CPU
The rst case study involves a 3-stage MIPS based CPU called Plasma [90]. As
shown by the block diagram in Figure 5.1 from [90], Plasma contains an ALU, mul-
tipler, register le, control unit, and system bus. It also supports external memory
through a memory control unit. Using the
ow described in Section 4.1, Plasma
was converted twice in a 28nm FDSOI process. First, a 666MHz synchronous
op-
based design was converted to Blade with with a timing resiliency window of 30%
using behavioral versions of the Burst Mode controllers from Section 3.5.2 and
delay lines. The second test run started from a 500MHz design, also targeting a
30% TRW but using fully gate-level Click controllers (see Section 3.5.3) and delay
77
Figure 5.1: Block Diagram of Plasma CPU
lines. In both cases, new library cells were created and characterized for the EDLs,
C-elements, and Q-Flops to obtain accurate area and timing information for the
synthesis tools and our simulations.
5.1.1 Resynthesis
The brute-force approach to reclaiming area by resynthesizing paths ending at
selected error detecting latches, as described in Section 4.1.2, was applied to the
78
rst 666MHz design. Figure 5.2 shows the results of this approach with a given
starting frequency of 666MHz and a target frequency of 952MHz. After retiming,
there 456 latches required to be converted to EDLs. A max delay constraint
equal to the target clock period was placed on each latch separately to ensure no
timing violations would occur. Then the netlist was resynthesized, converted to
Blade, and simulated in the co-simulation environment to obtain both the post-
conversion area and error rate, i.e. the frequency of timing violations averaged
over the entire simulation. The best point, highlighted in red in Figure 5.2, yields
a 27% decrease in number of EDLs with a 1.79% decrease in overall area and a
39% improvement in error rate. Note that the potential benets of this resynthesis
approach will depend heavily on the initial starting frequency, i.e. a design that
−40 −20 0 20
−2
−1
0
1
2
Change in Error Rate (%)
Change in Area (%)
Figure 5.2: Resynthesis to improve area and decrease error rate
79
is already heavily constrained cannot easily be constrained further to achieve area
and performance benets.
5.1.2 Area and Performance Comparisons
666MHz Burst Mode Design
While a behavioral model of the burst-mode Blade controller was used for simula-
tion, a preliminary gate-level design was also mapped to our technology to estimate
controller area and timing. The timing information generated through synthesis
was then used to inform delays in our behavioral controllers and delay lines.
The nal asynchronous control logic and error detection overheads are depicted
in Figure 5.3. The overall area overhead from the original synchronous design is
8.4% after one pass of the resynthesis method presented in Section 4.1.2.
C-elements
24%
Q-Flops
12%
FF to Latch + EDL
32%
Controllers
6%
Delay Elements
24%
AND / OR Tree
2%
Figure 5.3: Area overheads as percentage of total overhead for 666MHz design
80
0 0.5 1 1.5 2 2.5 3
700
750
800
850
900
Cycle Count (millions)
Frequency (MHz)
Average
Instantaneous
Figure5.4: Average case performance over time for Plasma CPU using Blade. Original
synchronous frequency of 666MHz
To compare the performance between the synchronous and asynchronous de-
signs, one iteration of an industry standard benchmark, CoreMark [91], was exe-
cuted on both CPUs. The Blade design achieved an average frequency of 793MHz
with a peak frequency of 950 MHz, an increase of 19% and 42%, respectively. A plot
of the performance over time is shown in Figure 5.4, where average performance
is measured across the entire benchmark while the instantaneous performance is
measured only over the previous 1,000 cycles. The Blade design quickly switches
operating frequencies, beneting from large variations in data dependent delays
near the beginning of the benchmark before the overall performance averages to
just under 800MHz.
500MHz Click Design
To demonstrate a fully gate-level design ready for automatic place-and-route
(PnR), a switch from Burst Mode-based controllers to Click-based controllers was
81
necessary due to diculty with handling timing of the burst mode controllers (see
Section 3.5.3). The 500MHz design was chosen to demonstrate Blade's ability to
improve performance from a range of starting points. The nal area overhead was
8.82%, on par with the results in Section 5.1.2; however, the breakdown of area
overhead was quite dierent. As shown in Figure 5.5, the FF to Latch + EDL
conversion accounted for the largest change, from 32% to 50%. This may be due
to skipping the resynthesis step while converting the 500MHz design.
Performance improvements were similar to those of the 666MHz design. Figure
5.6 again shows the instantaneous and average performance over one iteration
of the CoreMark benchmark running on the converted processor. In this case
the average performance improvement was slightly higher, reaching 22%, and the
C-elements
17%
Q-Flops
9%
FF to Latch + EDL
50%
Controllers
9%
Delay Elements
12%
AND / OR Tree
2%
Figure 5.5: Area overheads as percentage of total overhead for 500MHz design
82
fastest instantaneous performance was 44% better reaching a maximum frequency
of 722MHz.
0 0.5 1 1.5 2 2.5 3
Cycle Count (millions)
580
600
620
640
660
Frequency (MHz)
Average
Instantaneous
Figure5.6: Average case performance over time for Plasma CPU using Blade. Original
synchronous frequency of 500MHz
5.1.3 Power and Energy Comparison
Power analysis of the fully gate-level 500MHz design was performed using Syn-
opsys's PrimeTime [92] for both synchronous and asynchronous versions. These
tools typically work using activity factors, probabilities of a particular gate being
activated on a given cycle, to estimate the average design power. However, this
analysis assumes a synchronous system with constant rate cycle times and can thus
produce misleading estimates for Blade designs. To achieve more accurate results,
it is recommended to capture a value change dump (VCD) le (or equivalent) from
an actual simulation of the design and apply it in PrimeTime using the time-based
83
Table 5.1: Power overhead in Plasma
Power %
Delay Lines 2.32
Q-Flops 1.80
Controllers 1.42
EDLs
1
1.01
AND / OR Trees 0.46
C-Elements 0.45
Total 7.46
analysis mode. In this mode, the tool replays the simulation data and computes
the energy consumed by every gate for every transition at the time it happened.
Table 5.1 shows the power overhead associated with various Blade components
as a percentage of total power of the asynchronous design. Overall power overhead
is only 7.46% running a small benchmark on the converted processor, while perfor-
mance improved 19% on average. The delay lines contribute the largest individual
portion to this overhead at 2.32%, which is likely due to the extra delay elements
used in the tunable delay line. A purpose built or iterative delay line could poten-
tially decrease overhead further. The Q-Flops, with their integrated metastability
lter, constitute the next largest overhead. The EDL overhead is relatively small,
but would vary for alternative EDL implementations.
1
EDL power was computed by taking the total power of all EDLs and subtracting the power
of an equal number of regular latches averaged across the total number of latches. Thus this
value represents roughly the overhead of adding error detection to a typical latch.
84
When compared to the original synchronous version total power increases from
5.7 mW to 7.28 mW; however, this represents only a 1% increase in energy con-
sumption as the asynchronous design nishes processing data 22% faster thus using
less energy for the same computation.
Figure 5.7: Delay versus supply voltage in 28nm FD-SOI
This increase in processing rate can be converted into energy savings through
voltage scaling. Figure 5.7 shows circuit delay versus supply voltage in the 28nm
process used to synthesize Plasma, which was collected using SPICE. The 22%
increase in performance by the Blade version of Plasma can be traded o for
iso-performance at a 13% lower supply voltage. Given dynamic power greatly
dominates static power in this FD-SOI process, the decrease in average power can
roughly be estimated using the switching power formula:
1
2
CV
2
F . In this scenario,
the Blade design achieves a 41% reduction in switching power leading to an esti-
mated 4.3 mW average power consumption { 24.5% lower than the synchronous
design { thus showing the power saving potential of using the Blade template.
85
5.2 Encryption Cores
5.2.1 Design and Fabrication
The next case study was the result of a joint research project with Galois, Inc
and Reduced Energy Microsystems, Inc [93]. Four encryption cores were designed
asynchronously from the start using SystemVerilogCSP and converted to the Blade
template. The cores implemented lightweight block ciphers focused on enabling
low power encryption in constrained devices through a reduction in hardware com-
plexity [94]. Two
avors of the Simon and Speck ciphers were fabricated: one with
128-bit keys and 128-bit blocks and another with 48-bit blocks and 72-bit keys.
The cipher is iterative { every block and key require multiple internal cycles to
encrypt the data. The algorithm was implemented using an asynchronous ring
with conditional communication to receive new unencrypted blocks into the ring
and route encrypted outputs to the environment. It allows input tokens to back
up while the encryption core is still processing and prevents dynamic power usage
when no tokens are
owing through the system.
Figure 5.8: Histogram of encryption core performance across chip batch
86
The nal circuits were fabricated in a 130nm process. Out of a Multi-Project
Wafer (MPW) run of 39 chips, 38 returned operational with the remaining chip
failing due to a suspected catastrophic manufacturing defect. Figure 5.8 shows a
histogram of chip performance across the batch, which was successfully achieved
without individual per-chip delay tuning. The chips achieved full operation across
a range from roughly 0.625V to 1.35V, with nominal voltage at 1.2V. The minimum
voltage appeared to be limited by the simple level shifters used to communicate
between logic and memory power islands.
(a) (b)
(c) (d)
Figure 5.9: Performance versus voltage for (a) simon 128/128, (b) simon 48/72, (c)
speck 128/128, and (d) speck 48/72
87
5.2.2 Performance
Throughput scaled with voltage nicely without adjusting delay lines or altering
to the clock trees, as seen in Figure 5.9 for each encryption core. This is in
stark contrast to synchronous designs, where a frequency scaling solution would
be required to track the supply voltage and ensure timing closure across the voltage
range.
To compare with results found in the original Simon and Speck [94], Table 5.2
shows each design's area based on \Gate Equivalent", where 1 GE = 5.76m
2
for
our process. While the designs are much larger than the originally estimated design
sizes in [94], measured throughput per area was signicantly larger. For example,
the Simon 128/128 implementation achieves a throughput/area ratio of 23,200
bps/GE compared to the estimate of 87.5 bps/GE in [94], a 265x improvement
2
.
Table 5.2: Area and Throughput at 1.2V
Area
(GE)
Throughput
(kbps)
simon 128/128 10,799 250,643
simon 48/72 7,237 192,537
speck 128/128 19,345 523,377
speck 48/72 7,560 299,112
2
It should be noted that the improvement in throughput per area was achieved through a
variety of improvements, including algorithmic enhancements, in addition to using Blade. The
impact breakdown of each enhancement was not available.
88
5.2.3 Power and Energy
Each encryption core was supplied power via separate power pins and grids. The
energy measurement included all logic in the encryption core, including the Blade
controllers, resilient latches, and delay lines. I/O pads, testing logic, and memories
were on a separate power domain to independently enable voltage scaling. Figure
5.10 shows an increase in energy eciency as voltage supply drops on a per bit and
per block basis, i.e. the core required less energy to encrypt the same information
at a lower voltage.
Even at minimum voltage, Simon 128/128 still managed to encrypt data at a
very reasonable rate of 48 Mbps using just 0.8 nJ for each block, or 0.3 mW
3
. Figure
5.11 shows the power consumption for each encryption core across the measured
voltage range.
3
Unfortunately, these are currently the only published hardware implementations of the Si-
mon and Speck ciphers and thus there no comparison points for report energy consumption.
89
(a)
(b)
Figure 5.10: Energy consumption of encryption cores per (a) encrypted bit and (b)
encrypted block
90
Figure 5.11: Power versus voltage for all encryption blocks
91
Chapter 6
Summary and Conclusions
The shift from desktop, always plugged-in devices to mobile, battery-powered de-
vices has centered both academic and industrial focus on entirely new elds of
research in low-power design. Timing resilient circuits were slated to solve many
problems of near-threshold design, but wide-spread industry support has not yet
materialized. This thesis provides a viable timing resilient circuit template, Blade,
rooted in asynchronous design that is posed to solve the many issues of previous
attempts.
The Blade template combines a latch-based datapath with error detecting
latches, asynchronous bundled data-style controllers, and programmable delay lines
to achieve safe same-cycle recovery from timing violations due to datapath de-
lay variation. The combination of Q-Flops and asynchronous controllers ensures
metastability cannot propagate through the control path and allows the circuit to
gracefully pause until the metastable event is resolved. An analytical model of the
performance impact of metastability in Blade shows a less than 1.5% change in
throughput, which decreases for more advanced process nodes, making its aects
negligible.
92
Glitch sensitivity is introduced as an important metric for comparing error
detecting sequentials. Our analysis of existing EDL designs shows, with mod-
est changes, the sensitivity to errors can be increased by nearly 38% while also
decreasing static power consumption by up to 30%.
In addition to a veriable Petri Net model of a Blade controller, two distinct
controller implementations are presented both to show the versatility of the tem-
plate and to spur its adoption within dierent design goals. The Burst Mode
specications can be synthesized to many cell libraries through existing tools, but
may suer from dicult to manage timing constraints in modern automatic PnR
ows. The Click implementation eases integration into existing design
ows by
breaking the control paths cleanly at logical boundaries with a standard
ip-
op.
Very similar implementations have been successfully taped out in both the encryp-
tion core test chip presented here and an external commercial prototype.
To better understand the performance of Blade, a full analytical model is pre-
sented and used to derive the optimal Blade performance given a particular delay
distribution. This analysis proved that for a given systematic error rate the opti-
mal probability of incurring a timing violation is constant for normally distributed
delays. Beyond the analytical model, a set of timing equations are presented from
a designer perspective to aid in building timing constraints for physical design.
93
Historically, poor CAD tool support has been a major hurdle in gaining traction
for asynchronous approaches in large industrial projects. Therefore, it was impor-
tant to create a front-end design
ow that largely re-uses existing synchronous
tools to create asynchronous designs. To further improve compatibility, the design
ow accepts both asynchronous and synchronous HDL specications.
Two case studies were conducted to explore an implementation of Blade and
demonstrate the design
ow. On a 3-stage CPU, Blade provided up to a 44% in-
crease in cycle-to-cycle performance with a sustained 18-22% increase in through-
put. Area overheads associated with converting the existing synchronous design to
Blade were a modest 8-9% with a negligible increase in overall energy consumption
for the same workload.
6.1 Academic Impact and Research
This work has already spawned a number of distinct research threads that have
beneted Blade and the timing resilient design eld in general. In particular,
there have been advancements in retiming [31, 32] and resynthesis [30, 89] that
successfully were demonstrated on various resilient design templates, including
Blade.
On-going research of the Blade template focuses on extending the resiliency
window, improving testability, and designing for near- and sub-threshold operation.
94
6.1.1 Extended Blade
The Blade template and speculative handshaking protocol presented sets no lim-
itation on TRW duration. Although the particular implementation explored in
this thesis was simplied with the assumption that TRW would be shorter than
a single stage's nominal delay, i.e. roughly 50% of the total cycle time, in some
applications a longer TRW would lead to improved performance, particularly in
environments with extreme variability. An intuitive extension of Blade could thus
involve sets of three stages (rather than the two explored here), where the TRW
extends to roughly 2=3 of the total cycle time. Early research on this template,
dubbed Blade-OC, has been published [95]; however, the implementation of such
a protocol raises serious challenges. The controller design becomes complex as
timing violation information must be collected and communicated across interme-
diate stages. This added complexity increases internal logic delay of the controller
and reduces the ability to hide controller overhead. Additionally, a three stage
template would be at an immediate disadvantage when converting from existing
synchronous designs as three latches would be required for every original
op. In
turn, the additional set of latches would only exacerbate retiming diculty.
6.1.2 Testability
Blade design-for-test (DFT) is necessary for commercial success, but academia is
best suited to solve it. The asynchronous control path introduces unique challenges
95
and possibilities when designing testability into a system. For example, cycle-by-
cycle control could be added to the asynchronous system by making controller
modications that pause and restart operation arbitrarily, similar to [96]. Careful
consideration must be given to how additional input signals are added to the con-
troller such that they do not introduce new points of metastability. The controllers
used in the encryption core case study in Section 5.2 featured a simple one-time
settable signal that would allow only one cycle of operation after reset. This
combined with a standard synchronous scan chain across through the datapath
sequentials enabled rudimentary testing of on-die faults. Researchers at PUCRS
in Brazil have shown more sophisticated approaches for addressing stuck-at fault
and delay fault testing in Blade [97,98].
6.1.3 EDL and Delay Line Design
The overhead of adding error detection can be signicant. For applications target-
ing near-threshold or sub-threshold, increases in delay variation and uncertainty in
circuit libraries at low voltages may dictate adding error detection to the majority
of sequential elements in a design. Therefore, energy ecient and area compact
error detecting sequentials will become increasingly important. Early work has
shown promise by integrating detection logic into the latch circuit and amortizing
the detection across multiple input bits on a single latch [99].
96
6.2 Commercial Impact
Except for an R&D push in the early-2010s, timing resilient circuits have not
achieved widespread use in industry. We believe this is not due to any general
issue with resilient circuits; rather, the inherent problems in the currently published
synchronous resilient solutions have hindered their acceptance. Blade tackles the
biggest obstacles of these synchronous solutions.
Blade has already made headway in industrial applications, namely through a
start-up company, Reduced Energy Microsystems (REM) [100]. REM has licensed
the underlying Blade technology to create a derivative timing resilient architecture,
called Sharp [101], for use in its products. The company has raised over $2M in
venture funding to create the most energy ecient computer vision processor on the
market. These eciency benets are enabled by their proprietary implementation
of Blade, which allows voltage scaling down to near-threshold levels. They have
licensed the design
ow and extended it to include a full PnR backend for physical
implementation. REM also secured an NSF Small Business Innovation Research
(SBIR) award for improving asynchronous CAD tools [102].
6.2.1 Automatic Place & Route
Outside of REM's industrial PnR
ow, academic approaches to PnR will be nec-
essary to increase commercial adoption. One promising foundation may be ACDC
[103], a bundled-data design
ow that features automatic generation of delay lines
97
and enforces relative timing constraints necessary to nalize an asynchronous de-
sign. Another approach uses propagated clocks, a feature in many static timing
tools, to create clocking relationships between connected asynchronous stages, al-
lowing the synchronous tools to time the circuits appropriately [104]. Extending a
ow like ACDC to support Blade's timing constraints and incorporate constraints
using propagated clocks would be ideal future work towards an open source PnR
ow for timing resilient asynchronous designs. However, there are a number of
challenges that would benet from collaboration with industry researchers. In
particular, nding the solutions to complex routing or placement problems may
often still rely on tribal knowledge of how a particular tool works or carefully
constructed tricks to work around issues. From experience at REM, these prob-
lems areas centered around the routing requirements of the error collection trees,
placing the delay lines, and maintaining constraints across multiple asynchronous
blocks. An unsolved issue arises when a Blade stage grows, whether in area or
number of EDLs, to the point that it becomes impossible to collect the error infor-
mation before stalling the next stage. A problem like this requires expertise across
disciplines.
98
Bibliography
[1] J. M. Shalf and R. Leland, \Computing beyond moore's law," Computer,
vol. 48, no. 12, pp. 14{23, Dec 2015.
[2] Single thread performance: http://www.cpubenchmark.net/singlethread.html.
[3] H. P. Hofstee, \Future microprocessors and o-chip SOP interconnect,"
IEEE Transactions on Advanced Packaging, vol. 27, no. 2, pp. 301{303, 2004.
[4] S. Seo, R. Dreslinski, M. Woh, Y. Park, C. Charkrabari, S. Mahlke,
D. Blaauw, and T. Mudge, \Process variation in near-threshold wide SIMD
architectures," in DAC, June 2012, pp. 980{987.
[5] R. Dreslinski, M. Wieckowski, D. Blaauw, D. Sylvester, and T. Mudge,
\Near-threshold computing: Reclaiming moore's law through energy ecient
integrated circuits," Proceedings of the IEEE, vol. 98, no. 2, pp. 253{266, Feb
2010.
[6] M. Pons, T. Le, C. Arm, D. S everac, J. Nagel, M. Morgan, and S. Emery,
\Sub-threshold latch-based icy
ex2 32-bit processor with wide supply range
operation," in 2016 46th European Solid-State Device Research Conference
(ESSDERC), Sept 2016, pp. 33{36.
[7] S. Jain, S. Khare, S. Yada, V. Ambili, P. Salihundam, S. Ramani,
S. Muthukumar, M. Srinivasan, A. Kumar, S. K. Gb, R. Ramanarayanan,
V. Erraguntla, J. Howard, S. Vangal, S. Dighe, G. Ruhl, P. Aseron, H. Wil-
son, N. Borkar, V. De, and S. Borkar, \A 280mv-to-1.2v wide-operating-
range ia-32 processor in 32nm cmos," in 2012 IEEE International Solid-State
Circuits Conference, Feb 2012, pp. 66{68.
[8] K. A. Bowman, C. Tokunaga, T. Karnik, V. K. De, and J. W. Tschanz, \A
22 nm all-digital dynamically adaptive clock distribution for supply voltage
droop tolerance," IEEE Journal of Solid-State Circuits, vol. 48, no. 4, pp.
907{916, April 2013.
99
[9] K. Wilcox, R. Cole, H. R. F. III, K. Gillespie, A. Grenat, C. Henrion, R. Jot-
wani, S. Kosonocky, B. Munger, S. Naziger, R. S. Orece, S. Pant, D. A.
Priore, R. Rachala, and J. White, \Steamroller module and adaptive clock-
ing system in 28 nm cmos," IEEE Journal of Solid-State Circuits, vol. 50,
no. 1, pp. 24{34, Jan 2015.
[10] M. Cho, S. T. Kim, C. Tokunaga, C. Augustine, J. P. Kulkarni, K. Ravichan-
dran, J. W. Tschanz, M. M. Khellah, and V. De, \Postsilicon voltage guard-
band reduction in a 22 nm graphics execution core using adaptive voltage
scaling and dynamic power gating," IEEE Journal of Solid-State Circuits,
vol. 52, no. 1, pp. 50{63, Jan 2017.
[11] C. Gonzalez, M. Floyd, E. Fluhr, P. Restle, D. Dreps, M. Sperling, R. Rao,
D. Hogenmiller, C. Vezyrtis, P. Chuang, D. Lewis, R. Escobar, V. Ramadu-
rai, R. Kruse, J. Pille, R. Nett, P. Owczarczyk, J. Friedrich, J. Paredes,
T. Diemoz, S. Islam, D. Plass, and P. Muench, \The 24-core power9 proces-
sor with adaptive clocking, 25-gb/s accelerator links, and 16-gb/s pcie gen4,"
IEEE Journal of Solid-State Circuits, vol. 53, no. 1, pp. 91{101, Jan 2018.
[12] D. Ernst, N. S. Kim, S. Das, S. Pant, R. Rao, T. Pham, C. Ziesler, D. Blaauw,
T. Austin, K. Flautner, and T. Mudge, \Razor: a low-power pipeline based
on circuit-level timing speculation," in Microarchitecture, 2003. MICRO-
36. Proceedings. 36th Annual IEEE/ACM International Symposium on, Dec
2003, pp. 7{18.
[13] S. Kim, I. Kwon, D. Fick, M. Kim, Y.-P. Chen, and D. Sylvester, \Razor-
lite: A side-channel error-detection register for timing-margin recovery in
45nm soi cmos," in Solid-State Circuits Conference Digest of Technical Pa-
pers (ISSCC), 2013 IEEE International, Feb 2013, pp. 264{265.
[14] M. Fojtik, D. Fick, Y. Kim, N. Pinckney, D. Harris, D. Blaauw, and
D. Sylvester, \Bubble razor: An architecture-independent approach to
timing-error detection and correction," in Solid-State Circuits Conference
Digest of Technical Papers (ISSCC), 2012 IEEE International, Feb 2012,
pp. 488{490.
[15] ||, \Bubble razor: Eliminating timing nargins in an ARM cortex-M3 pro-
cessor in 45 nm CMOS using architecturally independent error detection and
correction," IEEE JSCC, vol. 48, no. 1, pp. 66{81, Jan 2013.
[16] S. Das, C. Tokunaga, S. Pant, W.-H. Ma, S. Kalaiselvan, K. Lai, D. Bull,
and D. Blaauw, \Razor II: In situ error detection and correction for PVT
and SER tolerance," IEEE JSCC, vol. 44, no. 1, pp. 32{48, Jan 2009.
100
[17] I. Kwon, S. Kim, D. Fick, M. Kim, Y.-P. Chen, and D. Sylvester, \Razor-lite:
A light-weight register for error detection by observing virtual supply rails,"
Solid-State Circuits, IEEE Journal of, vol. 49, no. 9, pp. 2054{2066, 2014.
[18] M. Choudhury, V. Chandra, K. Mohanram, and R. Aitken, \Timber: Time
borrowing and error relaying for online timing error resilience," in DATE,
March 2010, pp. 1554{1559.
[19] S. Valadimas, Y. Tsiatouhas, and A. Arapoyanni, \Timing error tolerance
in small core designs for soc applications," Computers, IEEE Transactions
on, vol. 65, no. 2, pp. 654{663, Feb 2016.
[20] On the (non) existence of bounded time metastability detectors:
http://www.csl.cornell.edu/ rajit/meta.html.
[21] S. Kim and M. Seok, \Variation-tolerant, ultra-low-voltage microprocessor
with a low-overhead, within-a-cycle in-situ timing-error detection and cor-
rection technique," Solid-State Circuits, IEEE Journal of, vol. 50, no. 6, pp.
1478{1490, June 2015.
[22] M. Cannizzaro, S. Beer, J. Cortadella, R. Ginosar, and L. Lavagno, \SafeR-
azor: Metastability-robust adaptive clocking in resilient circuits," IEEE
TCAS-I, vol. 62, no. 9, pp. 2238{2247, Sep 2015.
[23] Y. Liu, R. Ye, F. Yuan, R. Kumar, and Q. Xu, \On logic synthesis for timing
speculation," in ICCAD. IEEE, 2012, pp. 591{596.
[24] M. Nakai, S. Akui, K. Seno, T. Meguro, T. Seki, T. Kondo, A. Hashiguchi,
H. Kawahara, K. Kumano, and M. Shimura, \Dynamic voltage and fre-
quency management for a low-power embedded microprocessor," IEEE
JSCC, vol. 40, no. 1, pp. 28{35, 2005.
[25] R. Ye, F. Yuan, H. Zhou, and Q. Xu, \Clock skew scheduling for timing
speculation," in DATE. IEEE, 2012, pp. 929{934.
[26] A. B. Kahng, S. Kang, J. Li, and J. Pineda De Gyvez, \An improved
methodology for resilient design implementation," ACM Trans. Des. Autom.
Electron. Syst., vol. 20, no. 4, pp. 66:1{66:26, Sep. 2015. [Online]. Available:
http://doi.acm.org.libproxy1.usc.edu/10.1145/2749462
[27] B. Greskamp, L. Wan, U. R. Karpuzcu, J. J. Cook, J. Torrellas, D. Chen,
and C. Zilles, \Blueshift: Designing processors for timing speculation from
the ground up." in 2009 IEEE 15th International Symposium on High Per-
formance Computer Architecture. IEEE, 2009, pp. 213{224.
101
[28] A. B. Kahng, S. Kang, R. Kumar, and J. Sartori, \Recovery-driven design:
a power minimization methodology for error-tolerant processor modules," in
Proceedings of the 47th Design Automation Conference. ACM, 2010, pp.
825{830.
[29] ||, \Slack redistribution for graceful degradation under voltage overscal-
ing," in 2010 15th Asia and South Pacic Design Automation Conference
(ASP-DAC). IEEE, 2010, pp. 825{831.
[30] H. Huang, H. Cheng, C. Chu, and P. A. Beerel, \Area optimization of re-
silient designs guided by a mixed integer geometric program," in 2016 53nd
ACM/EDAC/IEEE Design Automation Conference (DAC), June 2016, pp.
1{6.
[31] H.-L. Wang, M. Zhang, and P. A. Beerel, \Retiming of two-phase latch-based
resilient circuits," in 2017 54th ACM/EDAC/IEEE Design Automation Con-
ference (DAC), June 2017, pp. 1{6.
[32] H. Cheng, H. Wang, M. Zhang, D. Hand, and P. A. Beerel, \Automatic
retiming of two-phase latch-based resilient circuits," IEEE Transactions on
Computer-Aided Design of Integrated Circuits and Systems, pp. 1{1, 2018.
[33] V. Tiwari, D. Singh, S. Rajgopal, G. Mehta, R. Patel, and F. Baez, \Re-
ducing power in high-performance microprocessors," in Design Automation
Conference, 1998. Proceedings, June 1998, pp. 732{737.
[34] A. Martin and M. Nystrom, \Asynchronous techniques for system-on-chip
design," Proceedings of the IEEE, vol. 94, no. 6, pp. 1089{1120, June 2006.
[35] M. Laurence, \Low-power high-performance asynchronous general purpose
ARMv7 processor for multi-core applications," 13th International Forum on
Embedded MPSoC and Multicore, pp. 304{314, 2013.
[36] A. Ghiribaldi, D. Bertozzi, and S. M. Nowick, \A transition-signaling
bundled data NoC switch architecture for cost-eective GALS multicore
systems," in Design, Automation Test in Europe Conference Exhibition
(DATE), 2013, March 2013, pp. 332{337.
[37] E. Kasapaki and J. Spars, \Argo: A time-elastic time-division-multiplexed
NoC using asynchronous routers," in Asynchronous Circuits and Systems
(ASYNC), 2014 20th IEEE International Symposium on, May 2014, pp. 45{
52.
[38] W. Jiang, K. Bhardwaj, G. Lacourba, and S. Nowick, \A lightweight early
arbitration method for low-latency asynchronous 2D-mesh NoC's," in Design
102
Automation Conference (DAC), 2015 52nd ACM/EDAC/IEEE, June 2015,
pp. 1{6.
[39] Y. Thonnart, P. Vivet, and F. Clermidy, \A fully-asynchronous low-power
framework for GALS NoC integration," in Design, Automation Test in Eu-
rope Conference Exhibition (DATE), 2010, March 2010, pp. 33{38.
[40] R. Soares, N. Calazans, F. Moraes, P. Maurine, and L. Torres, \A robust
architectural approach for cryptographic algorithms using GALS pipelines,"
Design Test of Computers, IEEE, vol. 28, no. 5, pp. 62{71, Sept 2011.
[41] I. Sutherland and S. Fairbanks, \GasP: A minimal FIFO control," in Asyn-
chronus Circuits and Systems, 2001. ASYNC 2001. Seventh International
Symposium on, 2001, pp. 46{53.
[42] M. Ferretti, \Single-track asynchronous pipeline template," Ph.D. disserta-
tion, University of Southern California, 2004.
[43] M. Davies, A. Lines, J. Dama, A. Gravel, R. Southworth, G. Dimou,
and P. Beerel, \A 72-port 10G ethernet switch/router using quasi-delay-
insensitive asynchronous design," in Asynchronous Circuits and Systems
(ASYNC), 2014 20th IEEE International Symposium on. IEEE, 2014, pp.
103{104.
[44] J. Teifel and R. Manohar, \Highly pipelined asynchronous FPGAs," in Pro-
ceedings of the 2004 ACM/SIGDA 12th international symposium on Field
programmable gate arrays. ACM, 2004, pp. 133{142.
[45] J. Cortadella, A. Kondratyev, L. Lavagno, and C. Sotiriou, \Desynchroniza-
tion: Synthesis of asynchronous circuits from synchronous specications,"
IEEE Trans. on CAD, vol. 25, no. 10, pp. 1904{1921, Oct 2006.
[46] D. Hand, M. Trevisan Moreira, H.-H. Huang, D. Chen, F. Butzke, Z. Li,
M. Gibiluka, M. Breuer, N. Vilar Calazans, and P. Beerel, \Blade { a timing
violation resilient asynchronous template," in Asynchronous Circuits and
Systems (ASYNC), 2015 21st IEEE International Symposium on, May 2015,
pp. 21{28.
[47] R. Diamant, R. Ginosar, and C. Sotiriou, \Asynchronous sub-threshold ultra-
low power processor," in Proceedings of PATMOS, June 2015.
[48] M. Ferretti, \Single-track Asynchronous Pipeline Template," Ph.D. disser-
tation, University of Southern California, August 2004.
103
[49] A. Yakovlev, P. Vivet, and M. Renaudin, \Advances in asynchronous logic:
From principles to GALS & NoC, recent industry applications, and commer-
cial CAD tools," in DATE, March 2013, pp. 1715{1724.
[50] I. J. Chang, S. P. Park, and K. Roy, \Exploring asynchronous design
techniques for process-tolerant and energy-ecient subthreshold operation,"
IEEE JSSC, vol. 45, no. 2, pp. 401{410, Feb 2010.
[51] N. Jayakuma, R. Garg, B. Gamache, and S. Khatri, \A PLA based asyn-
chronous micropipelining approach for subthreshold circuit design," in DAC,
2006, pp. 419{424.
[52] J. Liu, S. Nowick, and M. Seok, \Soft mousetrap: A bundled-data asyn-
chronous pipeline scheme tolerant to random variations at ultra-low supply
voltages," in ASYNC, May 2013, pp. 1{7.
[53] P. Beerel, R. Ozdag, and M. Ferretti, A Designer's Guide to Asynchronous
VLSI. Cambridge University Press, 2010.
[54] I. E. Sutherland, \Micropipelines," Commun. ACM, vol. 32, no. 6, pp. 720{
738, Jun. 1989.
[55] A. Peeters, F. te Beest, M. de Wit, and W. Mallon, \Click elements: An
implementation style for data-driven compilation," in Asynchronous Circuits
and Systems (ASYNC), 2010 IEEE Symposium on, May 2010, pp. 3{14.
[56] G. Heck, L. S. Heck, A. Singhvi, M. T. Moreira, P. A. Beerel, and N. L. V.
Calazans, \Analysis and optimization of programmable delay elements for
2-phase bundled-data circuits," in International Conference on VLSI Design
(VLSID), 2015, pp. 321{326.
[57] A. Singhvi, M. T. Moreira, R. N. Tadros, N. L. V. Calazans,
and P. A. Beerel, \A ne-grain, uniform, energy-ecient delay
element for 2-phase bundled-data circuits," J. Emerg. Technol. Comput.
Syst., vol. 13, no. 2, pp. 15:1{15:23, Nov. 2016. [Online]. Available:
http://doi.acm.org/10.1145/2948067
[58] R. N. Tadros, W. Hua, M. Gibiluka, M. T. Moreira, N. L. V. Calazans, and
P. A. Beerel, \Analysis and design of delay lines for dynamic voltage scal-
ing applications," in 2016 22nd IEEE International Symposium on Asyn-
chronous Circuits and Systems (ASYNC), May 2016, pp. 11{18.
[59] A. Moreno and J. Cortadella, \Synthesis of all-digital delay lines," in 2017
23rd IEEE International Symposium on Asynchronous Circuits and Systems
(ASYNC), May 2017, pp. 75{82.
104
[60] M. Maymandi-Nejad and M. Sachdev, \A digitally programmable delay ele-
ment: Design and analysis," IEEE Transactions on VLSI Systems, vol. 11,
no. 5, pp. 871{878, Oct. 2003.
[61] M. Ligthart, K. Fant, R. Smith, A. Taubin, and A. Kondratyev, \Asyn-
chronous design using commercial hdl synthesis tools," in Proceedings Sixth
International Symposium on Advanced Research in Asynchronous Circuits
and Systems (ASYNC 2000) (Cat. No. PR00586), 2000, pp. 114{125.
[62] Y. Thonnart, E. Beigne, and P. Vivet, \A pseudo-synchronous implemen-
tation
ow for WCHB QDI asynchronous circuits," in ASYNC, 2012, pp.
73{80.
[63] M. Gibiluka, M. T. Moreira, and N. L. V. Calazans, \A bundled-data asyn-
chronous circuit synthesis
ow using a commercial eda framework," in 2015
Euromicro Conference on Digital System Design, Aug 2015, pp. 79{86.
[64] D. Edwards and A. Bardsley, \Balsa: An asynchronous hardware synthesis
language," The Computer Journal, vol. 45, no. 1, pp. 12{18, 2002.
[65] M. Renaudin and A. Fonkoua, \Tiempo asynchronous circuits systemverilog
modeling language," in Asynchronous Circuits and Systems (ASYNC), 2012
18th IEEE International Symposium on, 2012, pp. 105{112.
[66] A. Saifhashemi and P. A. Beerel, \SystemVerilogCSP: Modeling digital asyn-
chronous circuits using SystemVerilog interfaces," in Proceedings of Commu-
nicating Process Architectures - WoTUG-33, 2011, pp. 287{302.
[67] A. Martin and M. Nystr om, \CAST: Caltech asynchronous synthesis tools."
[68] P. A. Beerel, G. D. Dimou, and A. M. Lines, \Proteus: An ASIC
ow for
GHz asynchronous designs," IEEE Design and test of Computers, vol. 28,
no. 5, pp. 36{51, 2011.
[69] S. Kim, I. Kwon, D. Fick, M. Kim, Y.-P. Chen, and D. Sylvester, \Razor-
lite: A side-channel error-detection register for timing-margin recovery in
45nm soi cmos," in Solid-State Circuits Conference Digest of Technical Pa-
pers (ISSCC), 2013 IEEE International, Feb 2013, pp. 264{265.
[70] K. Bowman, J. Tschanz, N. S. Kim, J. Lee, C. Wilkerson, S. Lu, T. Karnik,
and V. De, \Energy-ecient and metastability-immune resilient circuits for
dynamic variation tolerance," IEEE JSCC, vol. 44, no. 1, pp. 49{63, Jan
2009.
[71] T. Sato and Y. Kunitake, \A simple
ip-
op circuit for typical-case designs
for dfm," in ISQED, March 2007, pp. 539{544.
105
[72] X. Gili, S. Barcelo, S. Bota, and J. Segura, \Analytical modeling of single
event transients propagation in combinational logic gates," Nuclear Science,
IEEE Transactions on, vol. 59, no. 4, pp. 971{979, Aug 2012.
[73] M. Moreira, B. Oliveira, F. Moraes, and N. Calazans, \Impact of c-elements
in asynchronous circuits," in Quality Electronic Design (ISQED), 2012 13th
International Symposium on, March 2012, pp. 437{343.
[74] M. T. Moreira, D. Hand, N. L. V. Calazans, and P. A. Beerel, \TDTB error
detecting latches: Timing violation sensitivity analysis and optimization," in
Quality Electronic Design, 2015. ISQED '15. International Symposium on,
2015.
[75] F. Rosenberger, C. Molnar, T. Chaney, and T.-P. Fang, \Q-modules: inter-
nally clocked delay-insensitive modules," IEEE Trans. on Computers, vol. 37,
no. 9, pp. 1005{1018, Sep 1988.
[76] M. Moreira, B. Oliveira, J. Pontes, F. Moraes, and N. Calazans, \Adapting
a C-element design
ow for low power," in ICECS, Dec 2011, pp. 45{48.
[77] D. M. Chapiro, \Globally-asynchronous locally-synchronous systems," Ph.D.
dissertation, Stanford Univ., CA., October 1984.
[78] C. Foley, \Characterizing metastability," in ASYNC, Mar 1996, pp. 175{184.
[79] S. Beer, R. Ginosar, M. Priel, R. Dobkin, and A. Kolodny, \An on-chip
metastability measurement circuit to characterize synchronization behavior
in 65nm," in ISCAS, May 2011, pp. 2593{2596.
[80] J. Cortadella, M. Kishinevsky, A. Kondratyev, L. Lavagno, and A. Yakovlev,
\Methodology and tools for state encoding in asynchronous circuit synthe-
sis," in DAC, Jun 1996, pp. 63{66.
[81] R. Fuhrer, B. Lin, and S. Nowick, \Symbolic hazard-free minimization and
encoding of asynchronous nite state machines," in ICCAD, Nov 1995, pp.
604{611.
[82] K. Yun, D. Dill, and S. Nowick, \Synthesis of 3D asynchronous state ma-
chines," in ICCD, Oct 1992, pp. 346{350.
[83] K. Sakallah, T. Mudge, and O. Olukotun, \Analysis and design of latch-
controlled synchronous digital circuits," IEEE Trans. on CAD, vol. 11, no. 3,
pp. 322{333, Mar 1992.
106
[84] B. Zhai, S. Hanson, D. Blaauw, and D. Sylvester, \Analysis and mitigation
of variability in subthreshold design," in Low Power Electronics and Design,
2005. ISLPED '05. Proceedings of the 2005 International Symposium on,
Aug 2005, pp. 20{25.
[85] J. Kwong and A. Chandrakasan, \Variation-driven device sizing for minimum
energy sub-threshold circuits," in ISPLED, Oct 2006, pp. 8{13.
[86] S. C. Schwartz and Y. S. Yeh, \On the distribution function and
moments of power sums with log-normal components," Bell System
Technical Journal, vol. 61, no. 7, pp. 1441{1462, 1982. [Online]. Available:
http://dx.doi.org/10.1002/j.1538-7305.1982.tb04353.x
[87] G. Zhang and P. Beerel, \Stochastic analysis of bubble razor," in DATE,
March 2014, pp. 1{6.
[88] D. Hand, H.-H. Huang, B. Cheng, Y. Zhang, M. Trevisan Moreira, M. Breuer,
N. Vilar Calazans, and P. Beerel, \Performance optimization and analysis of
blade designs under delay variability," in Asynchronous Circuits and Systems
(ASYNC), 2015 21st IEEE International Symposium on, May 2015, pp. 61{
68.
[89] H. Huang, H. Cheng, C. Chu, and P. A. Beerel, \Area optimization of timing
resilient designs using resynthesis," IEEE Transactions on Computer-Aided
Design of Integrated Circuits and Systems, vol. 37, no. 6, pp. 1197{1210,
June 2018.
[90] Plasma CPU, 2014. Available: http://opencores.org/project,plasma.
[91] Embedded microprocessor benchmark consortium - coremark, 2014. avail-
able: http://www.eembc.org/coremark.
[92] \Primetime static timing analysis." [Online]. Available: https://www.
synopsys.com/implementation-and-signo/signo/primetime.html
[93] J. Kiniry, \Galois ultra low power high assurance asynchronous
crypto," 2016, nIST Lightweight Cryptography Workshop. [Online].
Available: https://www.nist.gov/sites/default/les/documents/2016/10/
19/kiniry-presentation-lwc2016.pdf
[94] R. Beaulieu, S. Treatman-Clark, D. Shors, B. Weeks, J. Smith, and
L. Wingers, \The simon and speck lightweight block ciphers," in 2015 52nd
ACM/EDAC/IEEE Design Automation Conference (DAC), June 2015, pp.
1{6.
107
[95] M. Herrera, T. Wang, and P. A. Beerel, \Blade-oc asynchronous resilient
template," in 2018 28th International Symposium on Power and Timing
Modeling, Optimization and Simulation (PATMOS), July 2018, pp. 147{154.
[96] M. Roncken, S. M. Gilla, H. Park, N. Jamadagni, C. Cowan, and I. Suther-
land, \Naturalized communication and testing," in 2015 21st IEEE Inter-
national Symposium on Asynchronous Circuits and Systems, May 2015, pp.
77{84.
[97] F. A. Kuentzer and A. M. Amory, \Fault classication of the error detection
logic in the blade resilient template," in 2016 22nd IEEE International Sym-
posium on Asynchronous Circuits and Systems (ASYNC), May 2016, pp.
37{42.
[98] F. A. Kuentzer, L. R. Juracy, and A. M. Amory, \On the reuse of timing
resilient architecture for testing path delay faults in critical paths," in 2018
Design, Automation Test in Europe Conference Exhibition (DATE), March
2018, pp. 379{384.
[99] W. Hua, R. N. Tadros, and P. A. Beerel, \Low area, low power,
robust, highly sensitive error detecting latch for resilient architectures," in
Proceedings of the 2016 International Symposium on Low Power Electronics
and Design, ser. ISLPED '16. New York, NY, USA: ACM, 2016, pp. 16{21.
[Online]. Available: http://doi.acm.org/10.1145/2934583.2934600
[100] REM: Reduced Energy Microsystems: http://www.remicro.com.
[101] M. Waugaman and W. Koven, \Sharp - a resilient asynchronous template,"
in 2017 23rd IEEE International Symposium on Asynchronous Circuits and
Systems (ASYNC), May 2017, pp. 83{84.
[102] \SBIR Phase I: An automated design
ow to build energy ecient vision
processing and machine learning chips for the internet of things." [Online].
Available: https://www.sbir.gov/sbirsearch/detail/1309819
[103] M. Gibiluka, M. T. Moreira, and N. L. V. Calazans, \A bundled-data asyn-
chronous circuit synthesis
ow using a commercial eda framework," in 2015
Euromicro Conference on Digital System Design, Aug 2015, pp. 79{86.
[104] G. Gimenez, A. Cherkaoui, G. Cogniard, and L. Fesquet, \Static timing
analysis of asynchronous bundled-data circuits," in 2018 24th IEEE Inter-
national Symposium on Asynchronous Circuits and Systems (ASYNC), May
2018, pp. 110{118.
108
Abstract (if available)
Abstract
As advancements in process technology slow and the ubiquity of mobile and embedded devices increases, chip designers are looking to new technologies to stretch the energy efficiency of the standard silicon-based semiconductor. The increased focus on energy efficiency and decreased ability to provide efficiency through process shrinks has led the industry to look towards near-threshold and sub-threshold design paradigms to fuel the next generation of low-power devices. However, adding margins to address increased delay variation in these regions of operation often overshadow efficiency gains. Resilient circuit templates have been proposed as one general solution to reducing the required margins of synchronous designs, where timing violations are allowed and errors introduced in the datapath by these violations are corrected at a later time to ensure data integrity. ❧ However, none of these synchronous resilient solutions have gained industry traction. Many have suffered from metastability or require modifying the architecture to add replay-based logic that recovers from timing errors, which leads to high timing error penalties and poses a design challenge in modern processors. Therefore, an asynchronous resilient circuit template, Blade, is proposed that addresses the concerns and problems of existing resilient solutions
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Formal equivalence checking and logic re-synthesis for asynchronous VLSI designs
PDF
Average-case performance analysis and optimization of conditional asynchronous circuits
PDF
Gated Multi-Level Domino: a high-speed, low power asynchronous circuit template
PDF
Clustering and fanout optimizations of asynchronous circuits
PDF
Advanced cell design and reconfigurable circuits for single flux quantum technology
PDF
Clocking solutions for SFQ circuits
PDF
Variation-aware circuit and chip level power optimization in digital VLSI systems
PDF
Power efficient design of SRAM arrays and optimal design of signal and power distribution networks in VLSI circuits
PDF
Development of electronic design automation tools for large-scale single flux quantum circuits
PDF
Radiation hardened by design asynchronous framework
PDF
Electronic design automation algorithms for physical design and optimization of single flux quantum logic circuits
PDF
Energy efficient design and provisioning of hardware resources in modern computing systems
PDF
Library characterization and static timing analysis of asynchornous circuits
PDF
Charge-mode analog IC design: a scalable, energy-efficient approach for designing analog circuits in ultra-deep sub-µm all-digital CMOS technologies
PDF
Production-level test issues in delay line based asynchronous designs
PDF
Automatic conversion from flip-flop to 3-phase latch-based designs
PDF
Power optimization of asynchronous pipelines using conditioning and reconditioning based on a three-valued logic model
PDF
Designing efficient algorithms and developing suitable software tools to support logic synthesis of superconducting single flux quantum circuits
PDF
A variation aware resilient framework for post-silicon delay validation of high performance circuits
PDF
Power-efficient biomimetic neural circuits
Asset Metadata
Creator
Hand, Dylan
(author)
Core Title
An asynchronous resilient circuit template and automated design flow
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
02/14/2019
Defense Date
10/26/2018
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
asynchronous circuits,automated design flow,Blade,bundled data,CAD flow,circuit templates,low-power circuits,OAI-PMH Harvest,Razor,resilient circuits,voltage scaling
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Pedram, Massoud (
committee chair
), Beerel, Peter (
committee member
), Golubchik, Leana (
committee member
)
Creator Email
d_hand@mac.com,dhand@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c89-121177
Unique identifier
UC11675774
Identifier
etd-HandDylan-7072.pdf (filename),usctheses-c89-121177 (legacy record id)
Legacy Identifier
etd-HandDylan-7072.pdf
Dmrecord
121177
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Hand, Dylan
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
asynchronous circuits
automated design flow
Blade
bundled data
CAD flow
circuit templates
low-power circuits
Razor
resilient circuits
voltage scaling