Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Formal equivalence checking and logic re-synthesis for asynchronous VLSI designs
(USC Thesis Other)
Formal equivalence checking and logic re-synthesis for asynchronous VLSI designs
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
FORMAL EQUIVALENCE CHECKING AND LOGIC RE-SYNTHESIS FOR
ASYNCHRONOUS VLSI DESIGNS
by
Hsin-Ho Huang
Doctor of Philosophy
UNIVERSITY OF SOUTHERN CALIFORNIA
FACULTY OF THE USC GRADUATE SCHOOL
(ELECTRICAL ENGINEERING)
December 2016
Copyright 2016 Hsin-Ho Huang
Dedication
To my parents, my family and my boyfriend
for their continuous love and support.
ii
Acknowledgements
I would like to oer my sincere gratitude to my advisor, Professor Peter A. Beerel,
for of his guidance and support, and for all the invaluable material that I learned
from him. I especially appreciate his enthusiastic dedication to my research and his
availability even during weekends and while traveling. I also thank Professor Chris
Chu, who guided the development of our mathematical programs based on gate-
sizing. In addition, I extend my appreciation to the rest of my defense committee
members, Professor Shahin Nazarian, and Professor Leana Golubchik for their
valuable technical feedback on my work. I am also very thankful to my master
degree advisor, Professor Youn-Long Lin, who introduced me to the VLSI/CAD
world.
Also, I acknowledge the industry and government support for my research. This
work was funded by a grant from Intel and a grant from NSF.
Next, I would like to thank my senior Arash Saifhashemi and my colleagues,
Dylan Hand and Matheus Trevisan, for their insightful feedback on my work.
Many ideas in this thesis are the result of hours of rewarding discussions with
iii
them. Moreover, I wish to thank the other students in our research group who
have helped me implement my ideas. Specically, Huimei Cheng, developed and
implemented the resynthesis heuristic algorithm and also helped me with the ex-
perimental results. I also appreciate the help of the USC Ming Hsieh Department
of Electrical Engineering sta, Annie Yu, and Diane Demetras, for their support
and help.
Finally, I am very thankful to everyone who supported me during graduate
school, my parents, my family and my boyfriend and especially my wonderful
friends who made the graduate life at USC a joyful and unforgettable experience.
I specially thank Wonkyung Na, Jessica Wang, Anthony Liu, Yueh-tung Chao,
Yu-Hsin Wang, Yi-Shan Lee, Yolanda Lee, Chunny Lai, Kevin Tien, Willy Chen
and Jessi Chao.
iv
Table of Contents
Dedication ii
Acknowledgements iii
List of Figures vii
List of Tables ix
Abstract x
Chapter 1: Introduction 1
1.1 Asynchronous Design Design Flows . . . . . . . . . . . . . . . . . . 3
1.1.1 Forms of Asynchronous Communication Channels . . . . . . 3
1.1.2 Asynchronous Design Templates . . . . . . . . . . . . . . . . 5
1.1.3 Previous Existing Design Flows . . . . . . . . . . . . . . . . 9
1.1.4 Design Challenges . . . . . . . . . . . . . . . . . . . . . . . . 13
1.2 Contributions of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.3 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
Chapter 2: Logic Equivalent Checking 19
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2 The Three-Valued Logic Model . . . . . . . . . . . . . . . . . . . . 20
2.2.1 3VL Operators . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2.2 3VL Finite State Machine . . . . . . . . . . . . . . . . . . . 22
2.2.3 3V FSM and Combinational Equivalence . . . . . . . . . . . 25
2.2.4 Decomposed SVC-based Approach . . . . . . . . . . . . . . 26
2.2.5 Verication Scope And Limitations . . . . . . . . . . . . . . 28
2.3 Binary Coded Three Valued Logic (BC3VL) . . . . . . . . . . . . . 31
2.3.1 Generating Data and Valid Bits for an Image Netlist . . . . 32
2.3.2 Generating Valid Bits for an Image RTL . . . . . . . . . . . 34
2.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.4.1 QDI Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.4.2 Blade Designs . . . . . . . . . . . . . . . . . . . . . . . . . . 38
v
2.4.3 Microprocessor Case Study . . . . . . . . . . . . . . . . . . . 40
Chapter 3: Resynthesis 42
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.2 Brute-force Approach . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.3 The Naive Brute-force Approach . . . . . . . . . . . . . . . . . . . . 46
3.4 Area Optimization Notation and Terminology . . . . . . . . . . . . 47
3.5 Proposed Model-Based Area Optimization Approach . . . . . . . . 52
3.5.1 Mixed Integer Geometric Program Formulation . . . . . . . 53
3.5.2 Extending to Blade latch-based design . . . . . . . . . . . . 55
3.5.3 Geometric Program Iterative Algorithm . . . . . . . . . . . 57
3.5.4 Calculating Gate Resistance and Capacitance . . . . . . . . 59
3.6 Virtual Resynthesis Cell Library Method . . . . . . . . . . . . . . . 60
3.7 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.7.1 Flop-based Designs . . . . . . . . . . . . . . . . . . . . . . . 65
3.7.2 Blade latch-based design . . . . . . . . . . . . . . . . . . . . 68
Chapter 4: Summary and Conclusions 79
4.1 Formal Verication of Asynchronous Design Flow . . . . . . . . . . 79
4.2 Optimization of Timing Resilient Design . . . . . . . . . . . . . . . 83
Bibliography 91
vi
List of Figures
Figure 1.1 Forms of asynchronous communication channels . . . . . . 4
Figure 1.2 (a)Four phase handshaking (b)Two phase handshaking . . 5
Figure 1.3 Asynchronous design templates . . . . . . . . . . . . . . . . 7
Figure 1.4 Proteus Flow . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Figure 1.5 Sample specications of Proteus
ow . . . . . . . . . . . . 11
Figure 1.6 Blade design
ow . . . . . . . . . . . . . . . . . . . . . . . 12
Figure 2.1 ALU and accumulator conditional operation . . . . . . . . 21
Figure 2.2 Primitive functions in three-value logic . . . . . . . . . . . 23
Figure 2.3 Asynchronous gates described in SVC . . . . . . . . . . . . 23
Figure 2.4 Finite state machine in three-value logic . . . . . . . . . . 24
Figure 2.5 Decomposed SVC . . . . . . . . . . . . . . . . . . . . . . . 26
Figure 2.6 Values of tokens on channel C in each iteration of an SVC
block (a) and the values of the corresponding variable C in
the image RTL during each clock cycle (b) . . . . . . . . . 28
Figure 2.7 An example of a token stalled across iterations . . . . . . . 29
Figure 2.8 The 3V network model of the asynchronous netlist . . . . . 30
Figure 2.9 Examples of BC3VL encoded cells: (a) logic cells, (b) RE-
CEIVE and SEND, (c): DFF . . . . . . . . . . . . . . . . 33
Figure 2.10 Top-level decomposition of the CPU . . . . . . . . . . . . . 41
Figure 3.1 Area improvements and error rate changes by brute-force
method of s1196 in ISCAS89 benchmark . . . . . . . . . . 45
Figure 3.2 Area improvements and error rate changes by naive brute-
force method of s9234 in ISCAS89 benchmark . . . . . . . 47
Figure 3.3 An example circuit with arrival times: Combinational gates
are in yellow and sequential gates are in red. T represents
the minimum arrival time of output of gates. . . . . . . . . 49
Figure 3.4 Three kinds of resilient designs in [1] . . . . . . . . . . . . . 52
Figure 3.5 Clock diagrams of master and slave latches . . . . . . . . . 56
Figure 3.6 Psuedo-code of relaxation-based iterative algorithm . . . . 57
Figure 3.7 Example of how high and low thresholds vary across iterations 58
Figure 4.1 Ring implementation in SVC and gate-level . . . . . . . . . 81
vii
Figure 4.2 Multiple communications in one iteration and solution in
SVC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
Figure 4.3 Histogram of slack and cycles of latch 14 in s1196 . . . . . 90
viii
List of Tables
2.1 BC3VL (Binary Coded 3VL) . . . . . . . . . . . . . . . . . . . . . . 32
2.2 LEC example statistics of QDI and Blade designs . . . . . . . . . . 37
2.3 LEC run times of QDI designs . . . . . . . . . . . . . . . . . . . . . 39
2.4 LEC run times of Blade designs . . . . . . . . . . . . . . . . . . . . 40
3.1 Area improvement and run-time of dierent threshold settings of
high EDL overhead . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.2 Virtual resynthesis cell library
op-based experimental results with
dierent input netlist settings . . . . . . . . . . . . . . . . . . . . . 72
3.3 Virtual resynthesis cell library latch-based experimental results with
dierent input netlist settings . . . . . . . . . . . . . . . . . . . . . 72
3.4 Circuit information after initial synthesis . . . . . . . . . . . . . . . 73
3.5 Area improvement of
op-based design (%): (IA: Iterative algo-
rithm, MIGP: Mixed integer geometric program, NBF: Naive Brute-
force, BF: Brute-force, VL: Virtual Library) . . . . . . . . . . . . . 73
3.6 Comparison of worst run-time over three EDL overheads of
op-
based design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
3.7 Error-rate (%) of
op-based design . . . . . . . . . . . . . . . . . . 75
3.8 Circuit information after retiming . . . . . . . . . . . . . . . . . . . 76
3.9 Area improvement of Blade latch-based design (%): (IA: Iterative
algorithm, MIGP: Mixed integer geometric program, NBF: Naive
Brute-force, BF: Brute-force), VL: (Virtual resynthesis cell library) 76
3.10 Comparison of worst run-time over three EDL overheads of Blade
latch-based design . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.11 Error-rate (%) of Blade latch-based design . . . . . . . . . . . . . . 78
4.1 EDST of s1196 running at 2.85 GHz in a 28nm technology . . . . . 87
4.2 Error type table of s1196 running at 2.85 GHz in a 28nm technology 87
4.3 Error correlation table of s1196 running at 2.85 GHz in a 28nm
technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
ix
Abstract
Asynchronous circuit design has long been considered a promising alternative to
synchronous design due to its potential for achieving lower power consumption,
higher robustness to process variations, and higher throughput. The lack of com-
mercial Computer-Aided-Design Tools, however, has been a major obstacle for its
wide-spread adoption. This thesis addresses two important CAD sub-problems for
asynchronous design: logical equivalence checking and re-synthesis.
Although some well-developed asynchronous design
ows exist, they rely on
extensive dynamic simulation with unique test-bench and coverage tools to show
functional correctness, increasing risk and hampering the widespread adoption of
this technology. To address this issue, we propose a method for logical equiv-
alence checking (LEC) of asynchronous circuits using commercial synchronous
tools. In particular, we verify the equivalence of asynchronous circuits modeled
either with Communicating Sequential Processes (CSP) in SystemVerilog or at a
micro-architectural level using conditional communication library primitives. Our
approach is based a novel three-valued logic model that abstracts the detailed
x
handshaking protocol and is thus agnostic to dierent gate-level implementations,
making it applicable to both Quasi Delay Insensitive (QDI) and bundle-data design
styles.
Process, voltage, and temperature variations force margins in synchronous de-
sign, particularly at low and near-threshold voltages. Bundled-data resilient de-
signs promises to remove these large margins while at the same time taking advan-
tage of average-case path activity at the cost of adding error detecting logic (EDL)
to near-critical-paths to detect timing errors within a resiliency window. Hence,
we propose a logic optimization strategy called resynthesis in which near-critical-
paths are sped-up with a tighter max delay constraint to reduce the amount of
EDL needed and lower error-rates at the cost of increasing logic area. We propose
several optimization approaches to solve this problem, including synthesis heuris-
tics and mixed integer geometric programming (MIGP) formulations targeting area
cost.
xi
Chapter 1
Introduction
As the demand for faster and more integrated mobile electronic circuits arises, low
power consumption becomes a more and more important design paradigm. With
the high popularity and versatility of mobile electronic devices, lower battery us-
age has become a key requirement. In addition to the battery usage, higher power
consumption generates more heat, which is detrimental to performance and relia-
bility of the system. Cost for packaging and cooling increases rapidly with power
dissipation. In today's electronic VLSI technology, power consumption concerns
are so signicant that the performance for microprocessors is now majorly limited
by power [2].
It has been reported that clock network accounts for about 20% to 40% of the
total power consumption in a typical synchronous circuit [3]. Asynchronous cir-
cuits, in which there is no central clock signal, have been proposed to achieve lower
power consumption, higher throughput, and higher tolerance of process variability.
In particular, removing the clock network and replacing it with local handshaking
signals, if carefully done, can reduce power. In particular, local handshaking sig-
nals provide modularity to a design such that dierent blocks can be individually
1
optimized for their the needed power, performance, area when they are active and
remain idle when they are not needed, saving dynamic power [4].
This feature has lead to many ecient asynchronous Network On Chip de-
signs [5{7] and other globally asynchronous locally synchronous solutions [8,9]. It
has particularly been useful in Neuromorphic computing in which the computing
platform is a massive 2D network of digital or analog neurons that send and re-
ceive messages that model neuron interactions via an asynchronous communication
fabric [10{12].
Moreover, with the semiconductor technology migrating into smaller geome-
tries, variability of the process increases, which makes it harder for designers to
distribute a global clock and global interconnect signals eciently in their circuits.
Synchronous designs must incorporate timing margin to ensure correct opera-
tion under worst-case delays caused by not only process variation but also voltage
and temperature variations and moreover cannot take advantage of average-case
path activity [1]. This is particularly problematic in low-power low-voltage designs,
as performance uncertainty due to PVT variations grows from as much as 50% at
nominal supply to around 2,000% in the near-threshold domain [13].
In past research and industrial eorts, it has been shown that asynchronous
techniques can improve the throughput of a circuit [14{17], while at the same time
being more robust to the process variability and environmental changes. Some
asynchronous circuits use completion detection [18] to distinguish the completion
2
of an operation in a pipeline stage while others use matched delay lines [19{21]. In
both cases, each pipeline stage informs neighboring stages as soon as its processing
is over. This is in contrast to the global worst case delay assumption in synchronous
circuits. In theory, completion detection and delay lines can provide both average
case delay and better tolerance against process variability.
The benets of asynchronous circuits, however, come at the expense of incor-
porating extra overhead, such as handshaking signals, completion detection trees,
distributed controllers, and at times matched delay lines and extra timing as-
sumptions. The extra overhead might even lead to a circuit with more area and
higher power consumption compared to synchronous implementations. Therefore,
it is essential for asynchronous design
ows targeting low power to carefully avoid
intensive overhead to be able to compete with the equivalent synchronous imple-
mentations.
1.1 Asynchronous Design Design Flows
1.1.1 Forms of Asynchronous Communication Channels
Asynchronous circuits are dened to be a network of processes communicating
through ports. When an input port and output port are connected, there is a
communication channel between them. Hence, processes send and receive messages
using their communicating channels.
3
(a) Bundled data Channel (b) 1-to-N Channel
Figure 1.1: Forms of asynchronous communication channels
In bundled data channel, the sender uses an extra signal, request, to imply
that the data has been updated on its output channel. The receiver monitors the
request signal and samples data when it is high. Then, the receiver acknowledges
the receipt of data by rising the acknowledgment signal. Since arbitrary bits of data
can be bundled together and be sent using only two extra signals, this protocol is
called bundled data. Figure 1.1a shows a block diagram of a Sender and Receiver
processes communicating on a bundled data channel. Handshaking protocols can
be two phase or four phase. In four phase handshaking when the communication
is done both request and acknowledgment lines will reset to zero. In two phase
handshaking each communication action happens when the logical value of the
handshaking signal changes. Figure 1.2 shows timing diagrams for four and two
phase handshaking.
Another common channel is the delay insensitive channel. Instead of using
a separate request signal, in delay insensitive channels the data itself implies its
validity. One form of delay insensitive data coding is called 1-of-N, in which data
4
Figure 1.2: (a)Four phase handshaking (b)Two phase handshaking
is assumed to belog
2
N bits encoded in N wires. Only one wire in a 1-of-N channel
can be high at each time. Figure 1.1b shows a 1-of-N channel.
1.1.2 Asynchronous Design Templates
In an asynchronous digital circuit, processes are not synchronized using a global
clock signal. Instead, some form of handshaking signaling is used, which imple-
ments the communication action on channels. In order to simplify the design, asyn-
chronous designers usually use asynchronous templates. An asynchronous template
has a standard form for a pipeline stage which, generally speaking, receives data,
performs computation on the input, and sends the result to the output. Here, we
introduce three asynchronous templates, rst using a dual-rail 1-of-2 channel, and
last two using the bundled data protocol.
One common template for 1-of-N channels is the pre-charged half buer
(PCHB) which implements gate-level pipelined quasi delay-insensitive (QDI) cir-
cuits [22]. Figure 1.3a shows schematics of a PCHB template. Each gate has an
5
input completion detection unit both at input and at output. The acknowledge-
ment signals (Lack and Rack) are active low signals. The operation of the buer is
as follows. After the buer has been reset, all data lines are low and acknowledg-
ment lines, Lack and Rack, are high. Consequently en signal is also high and the
functional logic is ready to evaluate on the inputs. When input arrives by one of
the input rails going high, the functional block evaluate and generates the output.
The RCD and LCD check for validity of all the outputs and inputs respectively and
the corresponding C-element output will go low, lowering the left-side acknowledg-
ment Lack. After the right environment asserts Rack low (pc = 0) acknowledging
that the data has been received the functional block enters the pre-charge phase
and resets its outputs to 0. After the left environment resets the inputs and the
right environment asserts Rack high the functional block is ready to compute on
the next set of inputs.
The biggest advantage of this template is its ability to achieve high-throughput.
In particular, the cycle time of this template depends upon the functional logic but
generally varies from 14 to 18 transitions which translates to over 1GHz in 65nm
technology [23]. QDI design are also very robust to variations in process, voltage,
and temperature [24, 25]. However, dual rail logic is area expensive and has high
switching activity due to a return to zero paradigm [26].
The second template is called Click [27]. The control circuitry of this template
follows a two phase protocol which works as following: initially, outputs of all
6
(a) PCHB template (b) Click template (c) Blade template
Figure 1.3: Asynchronous design templates
gates (2 AND + 1 OR + 1 FF) are 0. When L.req is high, the value of OR output
will be 1 and trigger the output of
op to be 1 that also make L.ack and R.req
turn to 1. After some delays, R.ack will be turned to 1 and L.req will become
0. It will trigger the
op of controller and change R.req and L.ack to be 0. Data
is stored in a
op which is triggered by controller and the request will be delay-
matched with the slowest combinational path in order to avoid setup failures at
the
ops. This template has been presented for linear pipelines, however it can
be extended to more complex pipelines with forks, joins, cycles, and conditional
communication. The template has low area overheads but the margins necessary
in the delay lines reduce the potential for signicant performance improvements in
many applications.
The third template, called Blade [20], implements with bundled data resilient
two phase handshaking channels. In order to handle error recovery, two extra
error control signals are added beyond the regular request and acknowledgement,
as illustrated in Figure 1.3c. The pipeline stages in Blade use single-rail logic
7
followed by Transition Detecting Time Borrowing (TDTB) based error detecting
logic and two recongurable delay lines (duration and ); instead of the one
delay line needed in Click. The stage-to-stage delay line is of duration and
controls when the TDTB goes transparent and begins to propagate data at the
output of the combinational logic to the next stage. The second delay line is of
duration and denes a time window during which late transitions that violate
this assumption (i.e., timing errors) are allowed, which is called the the timing
resiliency window (TRW). The Blade template improves average-case performance
by taking advantage of the fact that errors will have a low probability of occurrence.
In Click template, the cycletime will be limited by+ all the time to cover critical
paths; while Blade templates can run at when there are no errors. A case-study
on a Plasma CPU demonstrates over 20% average-case performance improvement
from taking advantage of this data-dependency [20].
More generally, a big advantage of both bundled data design styles is that
the data path can be designed from standard gates that can be found in any
synchronous library. That means that no custom asynchronous cell library design
is necessary to support the protocol and also that it might be easier to use existing
tools for an automated synthesis
ow. However, the bundled data resilient style
still requires a few custom cells to eciently build the error-detecting latch and
Blade control circuits.
8
The general disadavantage of bundled-data design styles is that they require
the use of delay-matched request lines and have complex timing assumptions that
need to be veried. These timing assumptions generally require more cumber-
some verication both pre- and post-layout. Moreover, in many applications the
delay lines need to be programmable and tuned post-silicon to mitigate process
variations. Finally, for two-phase design styles such as Blade the fact that succes-
sive tokens require requests with alternating values becomes a complication that
needs to be designed with care for complex non-linear pipelines with conditional
communication.
1.1.3 Previous Existing Design Flows
The design of asynchronous circuits often starts from a high-level specication
based on Communicating Sequential Processes (CSP) [28] or one of its variants [29].
The high-level description is decomposed into smaller communicating processes
until each process is suciently small or regular to be easily mapped to an existing
library of specialized asynchronous leaf cells and/or macros [23,30{32]. A netlist of
such small CSP cells is sometimes called an image netlist. These leaf cells can then
be implemented with a variety of asynchronous pipeline templates as we mentioned
in 1.1.2.
An industrialized asynchronous ASIC CAD
ow with PCHB template, called
Proteus. Proteus is a complete and commercially proven asynchronous ASIC
9
Figure 1.4: Proteus Flow
ow [23] shown in Figure 1.4. Its current academic version takes a high-level
description in a communicating sequential processes (CSP)-based [28] language
called SystemVerilogCSP (SVC) [33] at its front-end and generates a netlist of
asynchronous gates, which are then automatically placed and routed using com-
mercial physical design CAD tools.
Here are the four steps of Proteus
ow:
Converting to RTL (SVC2RTL): Convert the SVC specication, shown in
Figure 1.5a into a synthesizable single-rail block as shown in Figure 1.5b.
10
always begin
forever begin
L.Receive(d);
R.Send(d);
end
end
(a) Buer in SVC description
(b) Synchronous RTL specication
Figure 1.5: Sample specications of Proteus
ow
RTL synthesis: Synchronous logic synthesis tool recognizes asynchronous
RECEIVE and SEND cells in WRAPPER as hard macros and only synthe-
size RTL-BODY into a network of image library cells based design at a target
clock frequency with preset I/O delays and output loads.
Clustering & Pipeline Optimization: replace RECEIVE and SEND cells with
their real asynchronous equivalents and optimize power and performance of
synthesized WRAPPER.
Place & Route: instantiate the asynchronous netlist and give to a commercial
place-and-route tool for physical design.
11
Figure 1.6: Blade design
ow
The other existing asynchronous design
ow is called Blade [20] with bundled
data resilient template. Unlike Proteus, the basic Blade
ow starts with any syn-
chronous RTL specication and also generates a netlist of asynchronous gates. The
basic proposed CAD
ow is shown in Figure 1.6 and has ve steps:
Synchronous Synthesis: The synchronous RTL will be synthesized using in-
dustry standard tools to a standard
ip-
op based design at a target clock
frequency with preset I/O delays and output loads.
12
FF to Latch Conversion: The FFs will be converted to conventional master-
slave latches by synthesizing the design using a fake library of standardized
D-Flip Flops (DFFs) that can be easily mapped to standard-cell latches.
Latch Retiming: The latch-based netlist will then be automatically retimed
using a target resiliency window that denes the maximum time borrowing
allowed. It will be average-case aware and be optimized to not only hide
control overhead and minimize area eciency but also minimize error rates.
Resynthesis: The retimed netlist is then resynthesized to optimize the ex-
pected area and performance of the nal resilient netlist. In particular, a
naive brute-force approach in which all latches might trigger timing-error
are sped-up one by one is employed to nd a suitable candidate latch.
Blade Conversion: The resynthesized latch-based netlist will then converted
to the Blade template by removing clock trees and the near-critical normal
latches and replacing them with Blade controllers, delay lines, and error
detecting latches.
1.1.4 Design Challenges
This thesis solved two challenges in previous design
ows - one in the domain of
formal verication and the other in logic resynthesis.
13
Logical equivalence checking (LEC) has become an accepted requirement of
commercial synchronous design
ows to quickly nd bugs and add condence in the
taped-out netlist. However, previous existing asynchronous design
ows still rely
on extensive dynamic simulation with unique test-bench and coverage tools to show
functional correctness, increasing risk and hampering the widespread adoption of
this technology [34]. In formal verication chapter, we extended the application of
LEC to asynchronous circuits for a requirement for wide-spread adoption of this
technology.
However, asynchronous designs that are specied with CSP-like specications
support channels with conditional communications which means tokens are sent
along channels only some of the time. Thus, in addition to needing a Boolean
value of 0 or 1 to model the value of data, as in synchronous designs, we also
need to model the absence of tokens to be able to argue about the equivalence
of such circuits. Towards this goal, we propose to use the value N to represent
the case when a channel sends no token. However, the challenge becomes that
interpretation of the value N is not understood in synchronous commercial LEC
tools. Thus, a later chapter of this thesis addresses how to solve this issue.
In the domain of resilience, it is well understood that resilient designs can
achieve higher average-case performance by removing margins [35]. However, this
benet comes at the cost of area and power overhead of error detecting logic
14
which can be quite signicant. For example, [36] reported that 55% of the
ip-
ops in an ARM Coretex-M3 processor are suciently close-to-critical to require
them to be error-detecting. Moreover, [20] reports about 46% of its area overhead
over the baseline synchronous design was due to error-detecting logic. Choudhury
et. al. reported TIMBER
ip-
op error replay logic has an area overhead of
around 20% [37]. If these overheads are not properly managed the overall power
savings that is achievable may be far from optimal. Minimizing the amount of error
detecting logic is thus critical to achieve the maximum desired power reduction.
This thesis explores the domain of resynthesis to address this challenge. The
general idea is that some close-to-critical paths can be re-synthesized to a faster
target making them non-critical. This reduces the amount of error-detecting logic
needed while simultaneously may reduce the error-rate. However, these improve-
ments come at the cost of increasing logic area and it is this tradeo that represents
the underlying design space in resynthesis. In particular, the challenge in resyn-
thesis is that it is not obvious which combination of paths should be sped up to
achieve the best gains. Moreover, the performance of resilient designs varies across
dierent benchmarks that may exercise various critical paths with dierent relative
frequencies. Resynthesis thus needs to be guided by not only static timing anal-
ysis but also dynamic timing analysis with typical usage patterns which becomes
challenging to formally model and optimize.
15
Note that Resynthesis methods are applicable to both synchronous and asyn-
chronous resilient design templates but here the focus will be on the Blade tem-
plate. In the existing Blade [20]
ow, we explored a naive brute-force method
which near-critical paths were sped up one end-point at a time to nd a suit-
able single candidate end-point to speed up during logic resynthesis. This method
achieved signicant area and performance benets and was demonstrated to be
computationally practical but does not explore the benets of speeding up multi-
ple end-points simultaneously and lacks a formal model from which we can explore
any notion of optimality. Hence, this thesis introduces a gate-sizing based mixed
integer geometric program to guide logic resynthesis speeding up combination of
near-critical paths. However, due to the complexity of mixed integer geometric
program, the thesis also presents a heuristic algorithm and a low-complexity vir-
tual resynthesis cell library method which manipulates the synthesis cell library to
model the tradeo between logic area and EDL
ops/latches area.
1.2 Contributions of Thesis
In the area of formal verication, this work provides a formal framework for the
logical equivalent verication of such manual and automatic decompositions and
optimizations. Our contributions are:
Theory:
16
{ Model CSP-like specications and implementations of asynchronous cir-
cuits using three-valued logic and dene a notion of their logical equiv-
alence.
{ Enable the use of commercial synchronous LEC tools to prove the equiv-
alence of these circuits by showing how these three-valued models can
be encoded into binary-valued logic networks.
Application: Demonstrate the theory on several example designs using two
dierent asynchronous templates: The Proteus QDI template and the Blade
Resilient Bundled Data template. This involves transforming asynchronous
SVC and gate level netlists of the Proteus QDI and Blade BD
ows to binary-
encoded SVC and gate-level netlists that enable the use of synchronous com-
mercial LEC tools to verify their equivalence.
In the area of re-synthesis for resilient circuits, our contributions are:
Introduce a mixed integer geometric program (MIGP) framework to guide
re-synthesis. The program identies which paths logic synthesis tools should
speed-up in order to reduce the number of error-detecting latches as well as
reduce error-rates, reducing total area while increasing average performance.
We also propose heuristic relaxation algorithms to solve the MIGP reduce
computation times and increase the size of solvable problems, while still
achieving close to optimal results.
17
Develop low-complexity alternatives using a novel virtual resynthesis cell li-
brary to trick commercial logic synthesis tools to automatically select between
normal sequential gates and EDL
ops/latches.
Build all the above resynthesis approaches into Blade design
ow. Using this
ow, we will show extensive comparisons between the methods on a large
benchmark of circuits.
1.3 Organization
The remainder of this thesis is organized as follows. Chapter 2 presents our
logic equivalence checking model, its application to QDI circuits and bundled-
data/Blade designs. Chapter 3 shows our work on the mixed integer geometric
programming framework, heuristic algorithm of geometric programming and an
alternative library-based approach. Then compare all optimization methods of
resynthesis with ISCAS89 benchmark. Chapter 4 concludes the work, reviewing
the thesis's contributions and describing possible future work.
18
Chapter 2
Logic Equivalent Checking
2.1 Introduction
Existing research eorts in formal verication of asynchronous circuits have been
divided into three broad camps: hazard-freedom or conformance checking, equiv-
alence checking, and property verication. In particular, many techniques try to
verify hazard-freedom/conformance of a gate-level implementation of a leaf-level
CSP cell, a unique requirement to asynchronous design [38,39]. Others target no-
tions of equivalence between the two designs [40{42]. These approaches generally
cannot be used to compare CSP with decomposed versions because the decompo-
sition often introduces pipelining that changes the allowed sequence of events at
the external interface. Therefore, some researchers only check critical properties
on the nal decomposed design [43,44].
Our proposed approach is dierent from the previous work in the following
ways: rst, since it supports CSP-level behavioral designs and multiple gate-level
implementations based on various asynchronous templates. Secondly, compared
19
to [39], we explicitly support modules that have channels with conditional commu-
nication. Thirdly, our
ow is highly based on commercial synchronous LEC tools
which can validate moderate-sized and complex circuits in a reasonable time and
will likely continue to improve over time.
2.2 The Three-Valued Logic Model
An asynchronous system consists of a netlist of CSP processes which communicate
with each other via asynchronous channels. The CSP process can be specied using
a form of SystemVerilog, called SystemVerilogCSP (SVC) [33] or implemented as
a collection of asynchronous operators. We assume the behavior of each process
can be described in terms of iterations during which it repeatedly performs the
following three actions:
1. It receives from none, some, or all input channels. The receive is a blocking
action.
2. It performs calculations.
3. It sends to none, some, or all output channels. The send is a blocking action.
For example, Figure 2.1 shows a design in which at each iteration, the ALU
conditionally performs an operation if the value received from E is 1. In each
iteration, the SPLIT module unconditionally receives the input data but condi-
tionally sends it to the ALU or Accumulator (ACC) modules based on the value
20
Figure 2.1: ALU and accumulator conditional operation
of E. The ALU and ACC modules unconditionally wait for data on their inputs
before performing their operations. They may or may not receive an input in a
given iteration. The MERGE module conditionally receives input data from the
ALU or ACC module based on the same E value and unconditionally sends out
the value to the output.
Unlike synchronous circuits where each primary input/output has a Boolean
value of 0 or 1, asynchronous channels may either not communicate at all, or
communicate a value of 0 or 1. We thus model the behavior of CSP netlists using
three-valued (3VL) logic that is based on the more general notion of multi-valued
[45]. In particular, 3V logic variables can take a value from the setT =f0; 1; Ng
where the value N models the condition of no communication action on a channel.
1
1
The value N should not be confused by the traditional concept of a don't care.
21
2.2.1 3VL Operators
Basic 2VL functions and operators can be dened based on 3V variables as shown
in Figure 2.2. The^ and_ operators represent logical functions similar to Boolean
logical AND and OR functions as long as none of the inputs is N, otherwise the
output of these functions isN. The inverting operator: is the same as the Boolean
inverter when the input is 0 or 1 , but its output is N its input is N . The output
of the equivalence operator is always 0 or 1 and never N. The c
operator
is a general representation of multiple inputs and outputs logical function. Each
output has its set of inputs which is subset of all inputs. If none of its inputs isN,
the output value will be the same as Boolean logic function; otherwise, the output
value will be N.
To model conditional communication, we dene the RECEIVE operator (de-
notedr) to behave like the identity function when enable input E is 1 . Whereas
when E is 0 , the output R is 0 , irrespective of the value of L as in Figure 2.3a.
The SEND operator (denoteds) behaves like the identity function when E is 1 .
Whereas when E is 0 or N , its output is N as in Figure 2.3b.
2.2.2 3VL Finite State Machine
We model the system iteration-based behavior using a 3VL nite state machine
(FSM) as dened below.
22
A
^ 0 1 N
0 0 0 N
B 1 0 1 N
N N N N
(a) A^B (AND)
A
_ 0 1 N
0 0 1 N
B 1 1 1 N
N N N N
(b) A_B (OR)
A
0 1 N
0 1 0 0
B 1 0 1 0
N 0 0 1
(c)AB (EQUIVALENCE)
: A
0 1
A 1 0
N N
(d):A (NOT)
Outputs
Inputs C
O
1
O
2
... O
n
(I
1
;I
2
;:::;I
n
)
8I
i
2I(O
j
)6=N f(8I
i
2I(O
1
)) f(8I
i
2I(O
2
)) ... f(8I
i
2I(O
n
))
9I
i
2I(O
j
) =N N N ... N
(e) (O
1
;O
2
;:::;O
n
) =f C
(I
1
;I
2
;:::;I
n
)(CombinationalLogic)
E
r 0 1 N
0 0 0 N
L 1 0 1 N
N 0 N N
(f) LrE (RECEIVE)
E
s 0 1 N
0 N 0 N
L 1 N 1 N
N N N N
(g) LsE (SEND)
Figure 2.2: Primitive functions in three-value logic
always begin
forever begin
E.Receive(e);
if (e==1) L.Receive(d);
else d=0;
R.Send(d);
end
end
(a) RECEIVE
always begin
forever begin
E.Receive(e);
L.Receive(d);
if (e==1)
R.Send(d);
end
end
(b) SEND
Figure 2.3: Asynchronous gates described in SVC
23
Denition 2.1 (3VL Finite state machine). A 3VL nite state machine is a 6-
tuple:
(;S;;S
0
; ;);
where:
is a nite non-empty set of input minterms.
S is a nite non-empty set of states.
:S !S is the next state function.
S
0
S is the set of initial states.
is a nite non-empty set of output minterms.
:S ! is the output function.
Q
F
n
0 1 N
D
0 0 0 0
1 1 1 1
N N N N
(a) DF
n
Q (non-persistent FF)
Q
F
p
0 1 N
D
0 0 0 0
1 1 1 1
N 0 1 N
(b) DF
p
Q (persistent FF)
Figure 2.4: Finite state machine in three-value logic
The input and output minterms and the next state function are derived from the
structure of the design. The input and output minterms are 3V assignments to the
primary input and output signals of the design. The state is the 3V assignment to
24
designated state variables. The next state function is obtained from the structure
of the logic driving the state variables but must be modeled with care.
In particular, consider the update of the Accumulator state in Figure 2.1 during
iterations in which the ALU operation is selected. Since there is no communication
with the Accumulator, the input variables to the Accumulator are N. This value
naturally propagates to the input of the Accumulator state variables. However,
to accurately model the physical circuit, we do not want the Accumulator state
variables to update to N in the next state, as would be the typical assumption in
traditional FSM models. Rather, we want it to preserve its original state value
because the asynchronous tokens emanating from this state are preserved until
consumed by the next valid inputs to the accumulator. We refer to such special
state variables as persistent which by denition if starts from a non-N value, will
never update to N. Persistent state variables (P-states) in our model are similar
to
ip-
ops in synchronous circuits whose clock in a given cycle is gated and hence
the
ip-
op will not update its state.
2.2.3 3V FSM and Combinational Equivalence
Two FSMs are equivalent if starting from their respective initial states, they will
produce the same output sequence when they are given the same input sequence
[46].
25
In Boolean logic, if two sequential circuits share the same set of inputs, outputs,
and state-holding elements (e.g.,
ip-
ops), then it can be shown that it is sucient
to check their combinational portions for equivalence [47].
To extend this notion to 3V FSMs with our notion of persistent states, we
create persistent state operator F
p
in Figure 2.4 which if the initial value of a
persistent state is non-N, it will never update to N. We also create F
n
operator
for non-persistent state variables.
2.2.4 Decomposed SVC-based Approach
To model our SVC-based specications using 3VL operators, we automatically
decompose them into a special form that explicitly exposes the conditional com-
munication and state variables, referred to as an image RTL [23]. This form natu-
rally denes the 3VL FSM by decomposing every SVC module into a clock-driven
RTL-Body wrapped in SEND and RECEIVE operators as shown in Figure 2.5.
Figure 2.5: Decomposed SVC
26
The SVC description of the RTL-Body contains all the logic including state vari-
ables of the original SVC module with additional logic to control the enable signals
of the RECEIVE/SEND modules. The SEND and RECEIVE are asynchronous
modules modeled by the 3VL operatorsr ands dened in Figure 2.2. The RTL-
Body can be modeled by 3VL combinational operator c
and states operators F
p
and F
n
. In particular, in each iteration, based on the value received on E, the RE-
CEIVE primitive may or may not receive from L, but it always sends a value on R.
The SEND primitive always receives from L, but based on the value received from
E, it may or may not send on R. Since we moved all conditional communication
actions into the surrounding RECEIVE/SEND modules, the RTL-Body becomes
an unconditional module, i.e., at each iteration it unconditionally receives on all
inputs and unconditionally sends on all outputs, therefore, the transformation of
the RTL-Body into a synthesizable and synchronous FSM is straight-forward [23].
Each iteration of the SVC description is thus conceptually mapped into one
clock cycle of the RTL image, as shown in Figure 2.6. If there is no communica-
tion action on a channel C at iteration i, as illustrated in Figure 2.6a, the value
of the corresponding variable C in the image RTL at clock cycle i is N, as illus-
trated in Figure 2.6b. This eectively translates our verication problem into the
synchronous domain.
27
(a) Asynchronous SVC
(b) Synchronous image RTL
Figure 2.6: Values of tokens on channel C in each iteration of an SVC block (a) and
the values of the corresponding variable C in the image RTL during each clock cycle (b)
2.2.5 Verication Scope And Limitations
As mentioned earlier, in this work we do not verify the low-level handshaking con-
trollers, but we only verify the equivalence at the CSP-level using the 3VL model.
Verifying the handshaking controllers is not within the scope of this thesis. Our
work is somewhat analogous to what is done in synchronous circuits, where logical
equivalence is tested under the assumption that the clock tree correctly operates
and all timing assumptions, including setup and hold times, are met. In particular,
by comparing the 3VL model of SVC descriptions, we check the logical equivalence
assuming the handshaking circuitry used to implement the asynchronous commu-
nication between modules is functioning correctly. This abstraction facilitates the
verication of SVC decomposition and channel-based optimizations and makes
the verication independent of the variety of potential gate-level implementations.
28
(a) Iteration 1: the token on C
1
can move no further than
C
3
(b) Iteration 2: both tokens have arrived. AND can complete
its iteration
Figure 2.7: An example of a token stalled across iterations
The gate-level implementations can be separately veried using a variety of known
techniques (e.g., [39]).
In addition, our 3VL model cannot emulate the
ow-control nature of asyn-
chronous processes, therefore, it cannot capture behaviors in which tokens are
stalled at the inputs of asynchronous blocks across iterations. This limitation is
formalized as iteration stall freedom (ISF) in [48] and is illustrated in Figure 2.7.
In this example, the environment module ENV alternates between sending tokens
on its two outputs on each iteration. The problem is that the rst token stalls at
the AND module, as illustrated in Figure 2.7a, waiting for the token on its second
input which only comes during the second iteration, as illustrated in Figure 2.7b.
To model this behavior accurately, the value of this token must be stored across
29
(a) First cycle (b) Second cycle
Figure 2.8: The 3V network model of the asynchronous netlist
iterations which, because the input of the AND gate is not a state variable, does
not happen in our 3VL FSM model.
The corresponding 3VL model, however, does not re
ect this behavior. Figure
2.8a and 2.8b show the rst and the second iterations in the 3VL model respectively.
In both iterations, the value of C
4
is N rather than 0. The reason for this dierence
is that the asynchronous AND gate can delay the Receive on C
3
(and hence stall
the BUF gate) until the second token arrives on C
2
, which does not take place
until the second iteration of the S module. During this stall, the BUF keeps its
output token on C
3
. In the 3VL model, however, each node is a function and
cannot keep a value on their outputs if the value of their inputs change. Therefore,
the value 1 on C
3
in Figure 2.8 in the rst iteration will not be stored and cannot
be reused in the second iteration.
In short, this limitation amounts to requiring that the eective state of the sys-
tem must be captured in only the state-holding elements and in particular not in
stalled tokens. We refer to the systems that do not have this complication as Pri-
mary Input ISF (PIISF) to emphasize the fact that the stall-freedom requirement
is limited to primary inputs; that is, tokens propagated via state-holding elements
30
are allowed to stall because the state-holding elements are able to capture their
values across iterations. In practice, a large class of asynchronous systems, includ-
ing an asynchronous CPU and its decomposition described in our experimental
results section, are PIISF. Moreover, many non-PIISF systems likely consist of
sub-processes that satisfy PIISF and can thus be decomposed, optimized, and
veried separately. Further, non-PIISF systems can be formally veried to be
equivalent for all input patterns that satisfy PIISF.
Also, in asynchronous design, every channel can communicate more than once
per system iteration; however, due to synchronous LEC tools does not support
multiple communications in one iteration, we constrain every channel either com-
municate once or none per iteration or synchronous LEC tools will overwrite results
of rst, second, ... communications with the results of last communication.
2.3 Binary Coded Three Valued Logic (BC3VL)
In order to use commercial Boolean equivalence checking tools, the next step of
our verication procedure is to encode each variable of a 3VL FSM using two bits:
valid (v) and data (d) as shown in Table 2.1. The resulting network is called a
Binary Coded Three Valued Logic (BC3VL) network.
31
Table 2.1: BC3VL (Binary Coded 3VL)
BC3VL
3VL
Valid Data
1 0 0
1 1 1
0 0 N
0 1 N
2.3.1 Generating Data and Valid Bits for an Image Netlist
Before we generate data and valid bits for an image netlist, we need to transfer
the gate-level netlist into an image netlist. For QDI designs, the image netlist
is obtained by removing handshaking logic and converting the dual-rail datapath
into single-rail wires that represent the channels between the SEND, RECV, and
logic gates. For Blade designs, handshaking logic must also be removed but the
datapath is already single-rail and thus need not be modied. After generate the
image netlist, we can start the BC3VL conversion. Figure 2.9 shows the BC3VL
encoding for a generic unconditional logic gate such as an AND, as well as a
RECEIVE, a SEND, and a DFF cell. Each port is encoded using two bits: valid
(v), and data (d). Each DFF is modeled with two bits: one for the data and one
for valid signal.
The valid and data bit outputs of every other library cell are a function of valid
and data bits of their inputs as dened below.
32
(a) (b) (c)
Figure 2.9: Examples of BC3VL encoded cells: (a) logic cells, (b) RECEIVE and
SEND, (c): DFF
A gate implementing an unconditional function o =f(i
1
;i
2
;:::;i
n
):
o:v =i
1
:v^i
2
:v^^i
n
:v
o:d =f(i
1
:d;i
2
:d;:::;i
n
:d)
A state variable function: s =f(i
1
;i
2
;:::;i
n
):
non-persistent:
s:v =i
1
:v^i
2
:v^^i
n
:v
persistent:
s:v = 1
s:d =f(i
1
:d;i
2
:d;:::;i
n
:d)
33
A RECEIVE cell:
r:v = (e:v^:e:d)_ (l:v^e:v^e:d)
r:d =l:d^e:d
A SEND cell:
r:v =l:v^e:v^e:d
r:d =l:d
Notice that the input data signal of RECV (l:d) and output data signal of
SEND (r:d) can be dual-rail for QDI implementation.
2.3.2 Generating Valid Bits for an Image RTL
We transform the RTL-Body into a BC3VL RTL using SystemVerilog preprocessor
macros to add declarations for valid bits for each primary input, primary output,
and state variables such that the resulting description can be used as the input
of commercial Boolean equivalence checkers. In particular, for each primary input
(output) i (o), a new one-bit primary input i
v
(o
v
) is added. Further, for each
state variable s, a new one-bit state variable s
v
is added. The value of each added
34
state variable is set to 1 upon reset, since our SVC semantics requires that all
state variables have non-N (i.e. valid) values upon reset [48]. The values of o
v
and
s
v
, on the other hand, depend on the valid bit of primary inputs and other state
variables. If the state variable is persistent, however, the value of s
v
will always
be 1 . In particular, since the RTL-Body is unconditional, the valid bit of each
primary output o is the AND of the valid bits of all inputs in the support (i.e., the
set of variables on which o actually depends) of o. For example, for an ALU with
primary inputs I1, I2, and OP, the valid bit of the primary output O should be
added to the RTL-Body description as:
O:v =I
1
:v^I
2
:v^OP:v
Our tool, called SVC23VLRTL estimates the support functions for each primary
output automatically. When it parses the SVC code, it records all inputs of the
statements or assignments as support functions of their outputs. After parsing,
it
attens all support functions, leaving only primary inputs and state variables.
This support set may overestimate that of some possible implementations which
can cause a false failure during equivalence checking. For this reason SVC23VLRTL
also enables the support set estimate to be overridden by the user.
35
2.4 Experimental Results
To evaluate the eectiveness of the proposed approach, we rst examine its ap-
plication on small computational blocks and then provide a case study verifying
the top-level decomposition of an asynchronous microprocessor whose design is
inspired by the seminal work of Martin [29]. We can check the logical equiva-
lence between two SVC images, SVC image and gate-level image netlist, and two
gate-level image netlists.
We tested our proposed BC3VL method on several computational modules
comparing dierent levels of their design:
1. ALU: A four-mode hierarchical arithmetic unit that implements AND, ADD,
SUB, and MULT operations.
2. ACCMULT: An accumulator/multiplier dual-mode arithmetic unit. Using
conditional communication, in each mode, data is only sent to the sub-unit
that is performing useful calculation.
3. LCM: A block that calculates the least common multiple of two inputs using
Euclid's iterative algorithm. This block has state variables.
4. DES-X: The DES-X algorithm is a variant of the Data Encryption Stan-
dard (DES) in which a technique called key whitening is used to increase
the complexity of a brute force attack. In particular, it takes 16 iterations
36
to compute the encrypted data and adds key-whitening on rst and last
iteration, as suggested by the following equation.
DES-X(M) =k
2
DES
k
(Mk
1
) (2.1)
Table 2.2 shows more detailed information of these circuits.
Example # of Inputs # of Outputs Data Width
# of Gates
(QDI/Blade)
ALU 2 1 32 2100/-
ACCMULT 5 2
16 1800/543
32 7500/1500
LCM 2 1
8 1950/363
16 12000/1200
32 70000/4400
DES-X 5 1 64 4842/2153
Table 2.2: LEC example statistics of QDI and Blade designs
We separate detailed comparison information of these circuits into QDI design
and Blade design subsection because of some implementation limitations. Note
that we choose less data width for the SVC and image netlist comparison because
verication exceeded our 12 hour timeout limit and some designs we need to do
partition comparison to avoid abort in LEC tool which run-time exponentially
increases.
37
2.4.1 QDI Designs
For the ALU, we rst compared a 32-bit ALU described at the SVC level with
a decomposed implementation. The two ALU designs were given to the Cadence
Conformal verication tool and were declared to be equivalent. Secondly, we com-
pared the equivalence of these designs after each was fully synthesized into image
library cells. Further, we perform automatic decomposition based on operand iso-
lation [49] on the ALU design for power optimization and conrmed the result
with the original ALU and the decomposed ALU image netlist.
For the ACCMULT, LCM and DES-X examples, we not only veried SVC level
with gate-level image netlist but also veried the results at image netlist level before
and after reconditioning, an automatic power optimization technique which moves
logic through conditional communication primitives to save power [48]. Table 2.3
shows the resulting CPU run times for all these comparisons.
2.4.2 Blade Designs
In the Blade design
ow, we start with SVC
op-based design, then synthesized
and replaced every
op with master and slave latches. Later we retime the slave
latches to meet our timing constraints and insert asynchronous resilient controllers.
To verify logical equivalence of Blade designs, we separate the task into two steps:
rst, we check equivalence between the SVC
op based design and the gate-level
38
Golden Revised
Level
Data Width Run-Time (s)
G R
ALU
Decomposed SVC 32 1.64
Decomposed Gate 32 77.76
Operand Isolation Gate 32 3.42
ALU
Decomposed Gate 32 77.62
Operand Isolation
ACCMULT Reconditioning
Gate 32 1854
SVC Gate 16 3.14
LCM Reconditioning
Gate 32 2625
SVC Gate
16 12960
8 1.2
DES-X Reconditioning
SVC Gate
64
2.95
Gate 2.455
Table 2.3: LEC run times of QDI designs
netlist of un-retimed latch based design. Here, we use the latch fold command in
Conformal to fold each master and slave latches pair back to a
op. Secondly, we
check equivalence between the un-retimed latch based design and retimed latch
based design with transparent slave latches. Recall in the Blade design
ow, only
master latches represent the architectural state and the slave latches are re-timed
to improve performance. We compare the ACCMULT, LCM and DES-X examples
both SVC with gate-level and gate with gate levels. Note that we did not evalu-
ate the ALU design example with the Blade
ow because it has no state-holding
variables. Table 2.4 shows the CPU run-time of these comparisons and only AC-
CMULT SVC to gate level, we used partition comparison which run-time is much
higher than the others. Overall, LEC of Blade design is faster than QDI with the
39
same data width because QDI is dual-rail and when we map circuit to gate level,
QDI has much larger number of gates than Blade design in table 2.2.
Golden Revised G Level R Level Data Width Run-Time (s)
ACCMULT
unretimed SVC Gate
16
1000
retimed Gate 1
retimed Gate 32 2
LCM
un-retimed SVC Gate
8
1.23
retimed Gate 0.75
DES-X
un-retimed SVC Gate
64
2.16
retimed Gate 1.5
Table 2.4: LEC run times of Blade designs
2.4.3 Microprocessor Case Study
As a more complex case study, we re-implemented Caltech's rst asynchronous
microprocessor [29] in SVC with both QDI and Blade channel interface. Guided
by the known limitations of the formal verication process, we ensured our decom-
position and top-level implementation have one-to-one correspondence of state
variables. This means that instead of having additional local state variables, we
choose to synchronize computation through message passing on additional chan-
nels, as illustrated in Figure 2.10.
Interestingly, our initial decomposition falsely failed to conform to the top-level
implementation because of some unreachable states of the design. However, by
adding constraints to the source Verilog to tell the verication tool to not consider
40
Figure 2.10: Top-level decomposition of the CPU
these unreachable states, a common practice in synchronous LEC
ows [50], the
tool proved equivalence in 31 seconds.
41
Chapter 3
Resynthesis
3.1 Introduction
Resilient designs oer the promise to remove increasingly large margins due to
process, voltage and temperature variations while at the same time take advantage
of average-case path activity gracefully slowing down in the presence of timing
errors. However, to do this resilient schemes must add expensive error detecting
logic (EDL) to near-critical-paths to detect and
ag timing errors when execution
completes within a resiliency window. Thus the error detecting logic must represent
a relatively small fraction of the design and the error-rate must be relatively small
to achieve the expected power and performance benets. Some EDA techniques
have been proposed to minimize the probability of timing errors: [51{54]. Liu et
al. [51] proposed to reshape the delay distribution of near-critical paths to reduce
timing errors. Dynatune [52] optimizes throughput by selectively choosing low V
th
gates to speed up near-critical paths. Ye et al. [53] and Kahng et al. [54] both use
clock skew scheduling to reduce timing errors.
42
One proposed EDA technique to minimize EDL overhead and error-rates is
resynthesis: speeding up near-critical-paths during logic synthesis with a tighter
max delay constraint to reduce the EDL and subsequently error-rate at the cost
of increasing logic area. During resynthesis many types of optimizations may
be explored including gate sizing, logic restructuring, and repeater insertion. We
introduce four alternatives optimizations: brute-force (all combinations) approach,
naive brute-force speed-up-one-path-at-a-time approach, a mixed integer geometric
programming (MIGP) framework based on gate sizing and a virtual resynthesis
cell library approach to trick commercial logic synthesis tools speed-up an eective
subset of near-critical paths.
The easiest way of optimization is to speed up all combinations of near-critical
end-points and pick the best area improvement which is brute-force. However,
when circuit size grows, the run-time complexity would be too high. Hence, we
introduced the latter three alternative approaches.
In our proposed naive brute-force approach, near-critical paths are sped up one
end-point at a time to nd a suitable single candidate end-point to speed up during
logic resynthesis. Naive brute-force is much faster than brute-force because it does
not explore the benets of speeding up multiple end-points simultaneously which
the area improvement might be far away from optimal.
Hence, we came up a mathematical model: mixed integer geometric program-
ming (MIGP) framework which includes both the normal combinational/sequential
43
logic area as well as the required EDL area overhead. Moreover, it understands
the shared logic across paths and thus can more accurately guide which particular
near-critical-paths to resynthesize compared to the naive brute force method.
However, the run-time of mathematical model still quite long when circuit size
grows. We, lastly, introduce virtual resynthesis cell library approach. We modied
the cell libraries to trick the synthesis tools to accurately and automatically select
between normal sequential gates and EDL
ops/latches. This method has low-
complexity with only one synthesis run and will eectively minimize total area of
the circuit, including the area overhead of the error detecting logic.
The following sections in this chapter reviews the brute-force approach with
all combinations of end-points sped-up, the naive brute-force approach, area op-
timization using our MIGP framework, a low-complexity library based approach,
and comparisons of the achieved area improvements. Section 3.5.2 then discusses
extensions to support latch-based resilient design.
3.2 Brute-force Approach
Before running our naive brute-force approach, we have the synthesis tools generate
a report of near-critical paths whose end-points need to be terminated with EDL.
The brute-force approach is to speed up all combinations of such end-points at
a time by constraining each combination of end-points to complete before the
timing resilient window begins and resynthesizing the design. However, when the
44
Figure 3.1: Area improvements and error rate changes by brute-force method of s1196
in ISCAS89 benchmark
number of near-critical paths grow, the run-time of brute-force approach will be
exponentially increase and not practical.
In gure 3.1 we show all combinations of speeding up near-critical end-points of
s1196. After synthesis, there are 3 near-critical end-points which requires 7 com-
binations resynthesis run; however, two combinations in the plot are overlap each
other which you can only see 6 points. The best point, highlighted in red in Figure
3.1, yields a 12.49% area improvement, and 0.25% improvement in error rate. Note
that the potential benets of this resynthesis approach will depend heavily on the
initial starting frequency, i.e. a design that is already heavily constrained cannot
easily be constrained further to achieve area and performance benets.
45
3.3 The Naive Brute-force Approach
The same as brute-force approach that we have the synthesis tools generated a
report of near-critical paths whose end-points need to be terminated with EDL.
However, the naive brute-force approach is to speed up only one such end-point
at a time by constraining this end-point to complete before the timing resilient
window begins and resynthesizing the design.
Although the combinational area may increase due to tighter constraints on
certain paths, this overhead can be oset if multiple
ops/latches that were slated
to become error-detecting are no longer on near-critical paths. Unfortunately, the
high degree of shared paths in the combinational logic makes it challenging to
estimate the reduction in EDL, i.e. constraining one latch/
op may also speed
up shared paths to many other latches/
ops. Moreover, the reduction of EDL
combined with faster combinational logic may lead to a reduced frequency of tim-
ing violations during simulation, which improves the average performance of the
circuit.
Without reliable methods of estimating these two eects, it is dicult to know
a priori which near-critical paths to further constrain; therefore, the brute-force
approach simply tests all near-critical paths one by one and is employed to nd a
suitable candidate.
46
Figure 3.2: Area improvements and error rate changes by naive brute-force method of
s9234 in ISCAS89 benchmark
The best point, highlighted in red in Figure 3.2, yields a 13% area improvement,
and 0.37% improvement in error rate.
3.4 Area Optimization Notation and Terminol-
ogy
This section explains the notation and terminology used in our MIGP resynthesis
approach. We are given a gate-level VLSI circuits with combinational gates (C)
and sequential gates (S), with gate sizes (z). Each gate has a nominal area (A), a
list of input/output pins (I/O) and a list of fanout gate and pin pairs (FO). Each
input pin of a gate has a nominal resistance (R), a nominal input capacitance
47
(Cin), and a fanin gate (FI). The delay D
i
k
of the k
th
pin of gate i with size z
i
is
modeled using an Elmore delay model:
D
i
k
=
R
i
k
z
i
X
j
l
2FO(i)
Cin
j
l
z
j
(3.1)
in whichz
i
= 1 represents the nominal size of the gate and is a constant scaling
factor, typically set to 0:69.
Every gate i has an arrival time (T
i
) at its output. For simplicity, we assume
an arrival time (T
i
) of 0 for primary inputs and that delay paths start at sequential
gates. We then calculate the arrival time of each combinational gate as follows:
T
i
max
8k2I(i)
(D
i
k
+ (T
FI(i
k
)
)) (3.2)
As an example, Fig.3.3 shows an illustration of a simple circuit and the arrival
times of all of its gates. Here, every pin of a gate has the same Cin, R and D.
The size of the speculative window (W) in resilient designs bounds the maxi-
mum increase in performance they can achieve when there is no timing error. If
the path end-point has an arrival time within the speculative window, it must be
terminated with an error detecting latch/
op. In Fig.3.3, assuming the synthesis
clock cycle (P) is 8 and we aim for a 30% resiliency window, the timing window
W is 2.4 time units wide spanning the times 5.6 to 8. Then, the sequential gates
48
Figure 3.3: An example circuit with arrival times: Combinational gates are in yellow
and sequential gates are in red. T represents the minimum arrival time of output of
gates.
G
9
and G
10
must be error detecting because the arrival times of their inputs (T
6
,
T
7
) is greater than 5.6.
Our approach assumes the relative additional cost associated with each latch
or
ip-
op that must be error detecting is X. In particular, our ideal area model
is:
min (
X
i2C;S
(A
i
z
i
) +X
X
j2S
e
j
) (3.3)
For each gatei, the logic area is proportional to its sizez
i
. The rst part of area
model is the sum of the total logic area of all gates assuming all sequential elements
are not error detecting. The second part of the area model is the additional area
associated with the error detecting logic (EDL). The variablee
j
is a binary variable
49
whose value is determined by whether the end-point of path ending in the j
th
sequential element must be error detecting or not.
In particular, the determination of whether the j
th
sequential element must be
error-detecting or not is based on the arrival time of its input. If the arrival time
is prior to the speculation window, the paths that end at this sequential element
are not close to critical and the sequential element need not be error-detecting,
i.e., e
j
= 0. Otherwise, the arrival time must be within the speculation window
[(PW );P] and the sequential element must be error-detecting, i.e.,e
j
= 1. More
mathematically, for all j2S, we have
e
j
=
8
>
>
>
>
<
>
>
>
>
:
1; if T
i
PW; 8i2I(j)
0; otherwise
(3.4)
We propose to solve this problem using geometric programming with some modi-
cations discussed in Section 3.5.
It is also important to emphasize that the specic structure of the error de-
tecting latches/
ops vary among resilient designs and consequently have dierent
associated overheads. Bowman [1] analyzed three types in Fig. 3.4. a) The simpli-
ed razor
ip-
op: a conventional master-slave
ip-
op with an additional shadow
latch and XOR that detects the dierence between the
op and latch [55]. b)
50
A transition detector and time borrowing (TDTB) latch: a conventional time-
borrowing latch, an XOR to detect errors, and a C-element to hold the value of
errors [56]. c) A time-borrowing latch with a shadow master-slave
op and a XOR.
Each of these structures yields an error signal. The error signal of multiple error
detecting latches/
ops within a pipeline stage must be combined with some type
of OR-gate to produce an error signal for an entire pipeline stage. Moreover, some
type of synchronizer is often needed in the control path to address metastability.
For example, in the asynchronous resilient scheme Blade proposed by Hand et al,
Q-Flops were used to sample the error signal in a metastability safe manner and
these also contribute to the EDL overhead [20]. Moreover, techniques to make
the TDTB more sensitive to glitches come at an additional area cost [56]. This
suggests there may be a tradeo between EDL overhead and robustness. For these
reasons, our experiments are conducted with a range of dierent values of X rep-
resenting low, medium, and high values of the EDL area overhead. Namely, we
choose X to be 0.5, 1, and 2 times the area of a minimum-sized sequential gate.
51
Figure 3.4: Three kinds of resilient designs in [1]
3.5 Proposed Model-Based Area Optimization
Approach
In this section, we solve the problem of minimizing logic and EDL area subject
to a performance constraint using geometric programming. Geometric program-
ming enables large-scale non-linear mathematical problems to be solved but re-
quired both the objective function and the inequality constraints to be posyn-
omials [57, 58]. Section 3.5.1 shows how we formulate the resynthesis problem
52
described above as a mixed integer geometric program. Section 3.5.3 then explains
how we eciently solve the mixed integer geometric program by relaxing the in-
teger variables to be real and using an iterative geometric program to nd, what
our experiments indicate are, close to optimal integer solutions.
3.5.1 Mixed Integer Geometric Program Formulation
In Section 3.4 we described a mathematical model that uses binary variables e
j
to
determine if sequential elementj need be error-detecting governed by the equation:
T
i
We
j
PW; (3.5)
where i is the data input of the j
th
sequential element. That is e
j
can be 0 only
if the arrival time at i is prior to the speculative window. Moreover, if e
j
is 1, the
arrival timei must still be before the cycle timeP . Unfortunately, the subtraction
on the left hand side of the constraint makes the constraint not posynomial [58].
To address this problem we perform a change of variables, introducing a new
variable ne as follows:
e
j
= 2ne
j
; 1ne
j
2;8j2S (3.6)
53
The delay constraint now becomes
T
j
+Wne
i
(P +W ); (3.7)
which is posynomial.
Substituting ne
j
into the EDL portion of the objection described in Equation
3.3 yields:
X
X
j2S
(2ne
j
);
which is unfortunately also non-posynomial. We thus make an approximation to
the objection function, creating the complete mixed integer geometric program as
follows:
Minimize (
X
i2C;S
(A
i
z
i
) +X
X
j2S
(
2
ne
j
1))
Subject to:
D
i
k
=
R
i
k
z
i
X
j
l
2FO(i)
Cin
j
l
z
j
;8k2I(i)8i2C;S (3.8)
8
>
>
>
>
<
>
>
>
>
:
T
i
max
8k2I(i)
f(D
i
k
+T
FI(i
k
)
)g;8i2C
T
i
=D
i
;8i2S
(3.9)
T
j
+Wne
i
(P +W );8i2S; j2FI(i) (3.10)
54
Bounds:
8
>
>
>
>
<
>
>
>
>
:
LB
i
z
i
UB
i
;8i2C
z
i
= 1;8i2S
(3.11)
1ne
i
2;8i2S;ne
i
2Z (3.12)
0T
i
P;8i2C;S (3.13)
More specically, the EDL part of the modied area cost function for sequential
elementi is changed from the non-posynomial form 2ne
i
to the posynomial form
(2/ne
i
- 1). This keeps the cost function the same for all possible (integer) values
of ne
i
.
In particular, when e
i
is 1, ne
i
will be 1 and (2/ne
i
- 1) will remain 1 as e
i
.
When e
i
is 0, ne
i
will be 2 and (2/ne
i
-1) will be 0 as e
i
.
Note that the constraints in Equations 3.8 - 3.9 implement the same Elmore
delay model and arrival time described in Section 3.4. Moreover, Equations 3.11 -
3.12 show the bounds of all variables that can be set to avoid unrealistic changes
in size. Lastly, T
i
needs to be less or equal to P for all combinational elements i
to maintain the desired clock period.
3.5.2 Extending to Blade latch-based design
In Section 3.5.1 we described all equations and constraints for
op-based design
but note that in the Blade
ow, each
op is converted to two latches and the latest
55
Figure 3.5: Clock diagrams of master and slave latches
are re-timed partially allowing time-borrowing. This enables the control overhead
associated with the blade template to be hidden from the critical cycle time [59].
Unfortunately, re-timing may also change the structure of the gate-level de-
sign and make the optimal
op-based design described above far from an optimal
starting point for latch-based designs. Therefore, we modied timing constraints
in MIGP to directly support latch-based design with time-borrowing partially al-
lowed. This involves new timing constraints to support time-borrowing latches.
Based on the clock timing diagram 3.5 of master (MS) and slave latches (SS), we
only need to change 3.13 to 3.14 for latch-based design which B is the window that
we allow time-borrowing on slave latches.
8
>
>
>
>
<
>
>
>
>
:
0T
i
P;8FO(i
k
)2C;MS
0T
i
(B +W );8FO(i
k
)2SS
(3.14)
56
iteration = 0; L = 1; H = 2;
while(L < H){
if all (ne_j == 1 || ne_j == 2) break;
L = lth
*
iteration + 1; H = 2 - hth
*
iteration;
foreach j in sequential gates {
if(ne_j <= L) ne_j = 1;
else if(ne_j >= H) ne_j = 2;
} iteration++;
}
%cross each other
L = lth
*
(iteration - 1) + 1; H = 2 - hth
*
(iteration - 1);
Middle = (L + H) / 2;
foreach j in sequential gates{
if(ne_j <= Middle) ne_j = 1;
else if(ne_j > Middle) ne_j = 2;
}
Figure 3.6: Psuedo-code of relaxation-based iterative algorithm
3.5.3 Geometric Program Iterative Algorithm
The integral constraint onne
i
generally adds signicant computational complexity
to the mathematical program because they are handled using computationally
expensive branch-and-bound techniques [57]. To address this, we propose a more
ecient solution allowing these variables to be any real value between 1 and 2
within an iterative outer loop. In particular, after each iteration we use a high
threshold and low threshold to force somene
i
variables to binary values for future
iterations. After setting some variables to be integral, we squeeze the high and low
thresholds closer together and repeat. We keep running the geometric program
iteratively until the high threshold and low threshold cross or all ne variables are
57
Figure 3.7: Example of how high and low thresholds vary across iterations
set to integer values. The psuedo-code of this relaxation-based algorithm is shown
in Figure 3.6.
Figure 3.7 shows an example of how the high and low thresholds are varied
across iterations. If the high threshold step (Hth) is 0.1 and low threshold step
(Lth) is 0.2, then if value of ne variables from 1
st
run is greater than 1.9 (less
than 1.2), the ne variables will be xed to 2 (1) for the next iteration. In the
second iteration, we force ne variables greater (less) than 1.8 (1.4) to be 2 (1).
In this example, there is a maximum of 4 iterations at which time the high and
low threshold cross. When high and low threshold cross, we nd the mid-point by
averaging high threshold and low threshold of previous iteration. Then, if variable
ne is greater than mid-point, ne is set to 2. Otherwise, it is set to 1.
58
The threshold steps play an important role on the quality of the results. It-
erations with small threshold steps might have more similar critical paths as the
solution with integer variable of ne. However, this comes at the cost of needing
more iterations and thus higher runtimes. On the other hand, larger threshold
steps can nish faster but have higher chance to force ne variables to have huge
dierence from values of integer program.
To further analyze the impact of dierent threshold settings, Table 3.1 shows
the area improvement, dierence of area improvement from integer program, and
clocked run-time of three dierent threshold settings for the high EDL overhead
case: A (Hth = 0:1, Lth = 0:4), B (Hth = 0:05, Lth = 0:2), C (Hth = 0:01,
Lth = 0:1). In some circuits, the area improvement of the iterative program
is better than the integer program. This may be because of dierences in logic
synthesis optimizations other than gate sizing, such as restructuring and repeater
insertion.
3.5.4 Calculating Gate Resistance and Capacitance
Note that in the geometric program, we use the same Elmore delay model as
discussed in Section 3.4. For each gate in the original synthesized netlist, we
obtain its nominal area (A) from the synthesis library. For each pin of each gate,
we obtain its nominal input capacitance from the synthesis library and its nominal
pin-to-pin delay from the initial synthesis timing report. We then use our Elmore
59
Table 3.1: Area improvement and run-time of dierent threshold settings of high EDL
overhead
Circuit Area Improvement % Dierence from MIGP Clock Run-time
A B C A B C A B C
s1196 11.43 11.43 11.43 0 0 0 29s 46s 121s
s1238 7.83 8.8 7.83 -0.97 0 -0.97 49s 99s 408s
s1423 23.03 22.3 23.18 -2.57 -3.3 -2.42 41s 83s 214s
s1488 -0.21 -0.21 -0.21 2.87 2.87 2.87 52s 85s 197s
s5378 23.36 23.36 23.36 -1.77 -1.77 -1.77 2.6m 3.7m 7.6m
s9234 31.67 31.67 33.22 0 0 1.55 1.5m 3.6m 5.6m
s13207 7.75 8.75 7.95 -1 0 -0.8 10m 12m 26m
s15850 16.33 16.18 13.99 0.15 0 -2.19 15m 22m 61m
s35932 24.39 24.39 24.39 0 0 0 3hr 3hr 3hr
s38417 23.40 23.40 21.16 TO TO TO 3.2hr 5hr 9.6hr
s38584 13.05 14.11 TO TO TO TO 8.3hr 14.4hr TO
delay model with z = 1 to back-calculate the nominal resistance (R) of each pin
of each gate. Based on this pin-to-pin model, the geometric program can calculate
the delay with dierent sizes. Eq. 3.9 is the same as we described in Section 3.4
for both combinational and sequential gates.
3.6 Virtual Resynthesis Cell Library Method
We proposed an alternative optimization approach that involves creating virtual
resynthesis cell libraries that essentially trick the synthesis to automatically select
between normal sequential gates and EDL
ops/latches. In particular, we dupli-
cated all sequential gates in the cell library. One group is considered as normal
sequential gates with setup time modied to respect a specied timing resiliency
60
window; the other group represents EDLs and has their area modied to include
the expected EDL area overhead. This method has low-complexity requiring only
one synthesis run and will eectively minimize total area of the circuit, including
the area overhead of the error detecting logic.
Because the details of the optimization algorithm of the synthesis tool are
unknown, we tried several dierent input netlist settings. After regular synthesis,
all end-point are mapped to regular sequential gates with the un-modied cell
library, regardless if the end-point should end with an EDL or not. The rst
input netlist setting we explored (VLN) is where we do not change the input
netlist before reading in the virtual library. Thus, all end-points will be mapped to
nonEDL sequential gates before we re-compile the netlist. Because these nonEDL
gates now have setup-times that mimic the TRW, timing at these endpoints may
be violated. Hence, synthesis tool needs perform timing optimization to x the
violations. The second input netlist setting (VLE) is where change all end-points to
end with EDLs. These EDLs have higher area but do not have higher setup-times.
The synthesis tool now can optimize area by smartly mapping non-near-critical
end-points back to nonEDL sequential gates. The third input netlist setting (VLR)
is where we map the sequential gates to either nonEDLs and EDLs based on their
timing. If the end-point is near-critical, the sequential gate will be mapped to
EDLs; otherwise, it stays as a nonEDL.
61
Table 3.2 and 3.3 show our experimental results for
op and latch-based de-
signs, respectively. In both cases, we applies re-synthesis to the three dierent
input netlist settings for the ISCAS89 benchmark suite. If we examine average
area improvement across the benchmark suite, we see that the average area im-
provement of VLN with both
op and latch-based design is larger than the other
settings in both the high and medium overhead cases. The dierence in the low
overhead case is relatively small. This can be explained by the observation that
the synthesis tool focuses on timing optimization more than area optimization.
Note that based on these results for comparison of the virtual resynthesis library
approach to other resynthesis approaches, we will use the VLN setting.
3.7 Experimental Results
We implemented our algorithm using Perl and TCL scripts that interface the Syn-
opsys Design Compiler framework to the YALMIP MIGP solver [57] for MATLAB
and evaluated it on ISCAS89 benchmark circuits. We compare area improvement of
resynthesis obtained through the brute-force, naive brute-force, mixed integer geo-
metric program, iterative algorithm and virtual resynthesis cell library approaches
with three dierent estimates of EDL overheads.
As mentioned earlier, all our purposed approaches speed up some near-critical
end-points through resynthesis. For brute-force, we speed up each combination of
62
near-critical end-points and take the best area improvement among all combina-
tions; while naive brute-force, we speed up each near-critical end-point indepen-
dently and take the best area improvement among all end-points. Moreover, our
mixed integer geometric program and iterative algorithm determine which near-
critical paths should be sped up to minimize area. For these four approaches, we
addset max delay timing constraints on those paths/end-points to constrain them
to be indeed non-near-critical after re-synthesis with Design Compiler. However,
since the synthesis tool does not understand EDL area overhead, it might slow
down existing non-near-critical paths to optimize logic area and these paths might
then require EDL and its associated overhead. Hence, we also force those non-
critical paths to remain non-critical using additional set max delay constraints
for brute-force, MIGP and IA. We did not force those non-critical paths to remain
non-critical in naive brute-force method because we want to keep the same settings
as in [20].
For the virtual resynthesis cell library method, we duplicate all sequential cells
and separate them into two groups. One group we add timing resilient window
W to its setup time table (regular sequential gates) and the other group we add
EDL overhead X to its area (EDL sequential gates). We let the synthesis tool
choose between regular and EDL sequential gates. If the synthesis tool decides an
end-point should end with a regular sequential gate, because of the modied setup
time, the required arrival time will be PW which will be non-critical. On the
63
other hand, if synthesis tool decides to end the end-point with an EDL sequential
gate, it will automatic add with EDL area overhead into total area.
For the geometric programs, we calculate a gate's upper and lower sizing bounds
by comparing the current size to the minimum/maximum size of gates with the
same functionality. For example, let the area of gate i in the gate-level netlist is
A
i
and assume all gates with same functionality have a minimum size with area
(min A
i
) and maximum size with area (max A
i
). Then, LB
i
and UB
i
will be
calculated in Eq.3.15.
LB
i
=
min A
i
A
i
; UB
i
=
max A
i
A
i
; (3.15)
Moreover, for the iterative algorithm, we set Hth = 0:05 and Lth = 0:2 because
it achieves reasonable runtime and similar results to the optimal integer program,
as we will show in Table 3.1.
For all experimental results, we set W = 0:3P , modeling a resiliency window
that is 30% of the original clock period. For naive brute-force and brute-force, we
at least speed up one near-critical end-point and if the overall area increase over all
runs, we report the improvement as negative numbers. For MIGP and IA, if the
near-critical end-point list for resynthesis is dierent from before resynthesis, we
will run resynthesis and report the area improvement even it is negative; however,
if the expected end-point list is the same as before resynthesis, MIGP and IA will
64
not do resynthesis and report area improvement as 0%. For the VL approach,
resynthesis will be done rst and if the end-point list after resynthesis is dierent
from before resynthesis, we report the area improvement even it is negative too
but if the list remains the same, we report 0% improvement.
In the following, we rst report the results for
op-based resilient designs and
then for latch-based blade circuits.
3.7.1 Flop-based Designs
Table 3.4 shows the size of chosen benchmark circuits and how much EDL is needed
after normal logic synthesis, prior to starting our resynthesis algorithm.
In Table 3.5, we show the achieved area improvements from our iterative algo-
rithm and mixed integer geometric program, the previously-proposed naive brute
force method [20], brute force method and virtual resynthesis cell library method.
The logic synthesis tool reports logic area and how many paths are near-critical
based on value of W that need to terminate with EDL. We then calculate the re-
sulting area by summing up logic area and total EDL area overhead and report the
area improvement by comparing calculated areas before and after resynthesis for
geometric programmings and brute-force approaches. For the virtual resynthesis
cell library method, since we already added EDL area overhead into cell library,
the logic area from synthesis report already includes the EDL area. For the naive
65
brute-force approach, we speed up near-critical paths one at a time and we only
show the best area improvement among all of them, as described in [20]; while
for the brute-force method, we speed up each combination of near-critical path at
a time and report the best area improvement among all combinations. Note due
to the complexity of the brute-force method, it completed only on the smallest
circuits in the benchmark suite.
When we compare results from the GP with those obtained via brute-force, the
GP achieved the second best area improvement of all brute-force combinations for
benchmark circuits s1196 and s1238. Moreover, originally there are two critical
end-points (A and B) of s1238 so there are three combinations explored in the
brute-force approach: speed-up A, speed-up B, and speed-up A and B. The results
showed that speeding-up A also speeds up B and speeding-up B also speeds up A,
but speeding-up B yields higher area improvement than speeding-up A or speeding-
up both A and B. However, GP only guides resynthesis by indicating both end-
points A and B should be sped up. It does not know that speeding-up only A
or B can also speed up the other. Hence, when GP guided resynthesis, we add
both set max delay constraints on end-points A and B which gave worse area
improvement than speeding up B only. For s1488, resynthesis tool improved area
even when we added tighten timing constraints which we did not expect and not
model into our GP.
66
When we compare results between iterative algorithm and naive brute-force,
except for circuit s1488, our iterative algorithm achieves larger area improvement
than the naive brute-force approach with all dierent EDL area overheads. Care-
ful analysis of the s1488 re-synthesized circuits indicate that the synthesis tool
in this case unexpectedly reduced area using non-gate-sizing techniques including
re-structuring which is not modeled in our approach. On average, our iterative al-
gorithm achieves 9.53% larger area improvement than brute-force with high EDL
area overhead, 4.08% larger improvement with medium EDL area overhead, and
1.11% larger improvement with low EDL area overhead. Moreover, the area im-
provement of the iterative algorithm is similar to that of the mixed integer geomet-
ric program but faster by an average of 5 times. In fact, for several of the largest
circuits the MIGP timed out (TO) after 24 hours of wall clock time. This sug-
gests our iterative algorithm is an eective approach to solve the MIGP. Note that
both the geometric programming, naive brute-force and brute-force approaches
use multi-threaded computation so Table 3.6 reports both worst clock and CPU
run-time among the three overheads. Although our iterative algorithm is slower
than naive brute-force, the run-times are still reasonable.
Finally, we compare the results between the iterative algorithm and the vir-
tual cell library method. Although the tradeo between logic area and EDL area
overhead is made explicit in the virtual cell library method, design compiler does
not seem to perform well on sequential cell optimization. For the high EDL area
67
overhead case, the virtual cell library method only averagely achieved 53% of the
area improvement of the iterative algorithm. For the medium EDL area overhead
case, it achieved 63% and for the low EDL area overhead, it achieved 78% of the
area improvement of the iterative algorithm. Among all EDL area overheads, the
virtual cell library method achieves around 2/3 of the area improvement of itera-
tive algorithm. That said it it is an average of 135 times faster than the iterative
algorithm. Thus, it is run-time ecient way to optimize resilient design.
Speeding-up near critical paths may not only improve area but also improve
performance by reducing the error-rate. Although our geometric program only
targets minimizing area, we still see an average improvement of 2.8% in error-rate
and 10.4% more than the naive brute-force approach in Table 3.7. Even though
we reduce the error-rate to 0 in some circuits, for other circuits, such as s38584,
the new error-rate remains high. This is interesting as it motivates future work
exploring adding a notion of error-rate into the cost function to simultaneously
target area and performance. In fact, more generally, it would be interesting to
minimize power consumption for a given performance considering voltage scaling.
3.7.2 Blade latch-based design
Table 3.8 shows the size of the chosen benchmark circuits, including the number
of combinational gates, master latches (MS) which are not retimed and will not
time-borrow and slave latches (SS) which are retimed and will time-borrow and
68
how much EDL is needed after retiming in the Blade design
ow, prior to starting
our resynthesis algorithm of the latch-based design. Except for s1238, number of
master latches is the same as number of
ops in
op-based design. In s1238, we
need to add latches at primary outputs to x timing and those latches might end
up with EDLs so number of master latches is larger.
Table 3.9, we show the achieved area improvements from our iterative algo-
rithm and mixed integer geometric program, the previously-proposed naive brute
force method [20], brute force method and virtual resynthesis cell library method
on Blade latch-based design which only master latches might be ending up with
EDLs and allows time-borrowing on slave latches. We achieved similar improve-
ments as the
op-based design for all circuits that can complete within 24 hours.
The iterative algorithm achieves larger improvements than the naive brute-force
and virtual library method. With high EDL overhead, IA has 10.81% larger area
improvement than NBF and 0.84% larger than VL; with mediuam EDL overhead,
IA has 6.5% larger area improvement than NBF and 6.31% larger than VL; with
low EDL overhead, IA is 2.93% better NBF but 9.08% worse than VL. From area
improvements with IA and MIGP of s1196, they are far away from the optimal
improvement (achieved by brute-force). This may be explained by the fact that
after brute-force resynthesis the regular logic area unexpected became smaller than
before resynthesis even though we tightened the timing constraints to reduce the
69
number of EDLs. In other words, the logic synthesis engine during brute force
resynthesis of this example unexpectedly found a more area ecient solution that
was also faster. For s1488, new near-critical end-points list remains the same as be-
fore resynthesis through MIGP, IA and VL approaches, so their area improvements
are 0%. As we mentioned previous, we force all non-near-critical paths remaining
non-near-critical in brute-force method but not in naive brute-force. In all of cases,
area improvements with this forcing setting are better than without; however, in
s1488, area grows less in naive brute-force than brute-force. Nevertheless, speeding
up near-critical end-points in s1488 makes area cost worse; hence, we will not do
the optimization on s1488.
The run-time comparison of all approaches on Blade latch-based designs is
reported in Table 3.10. The virtual resynthesis cell library method is still the
fastest because it only requires one synthesis run. For the iterative algorithm and
mixed integer geometric programming, although we only add constant complexity
constraints from the
op-based design to latch-based design, the average run-time
of the Blade latch-based designs is doubled compared to the
op-based designs. If
we want to run larger circuits than s35854, we may need to either choose dierent
threshold settings or use the virtual library method.
When we optimize the area of
op-based design using the GP, most of the
error-rates went down to less than 10% (recall Table 3.7). For the Blade latch-
based design using the GP approach, the trade of error-rate reduction follows the
70
same trend of area improvement; however, most error-rates remain over 20%, as
seen in Table 3.11. Hence, adding a notion of performance into the GP objective
function may be an important area of future work.
71
Area Improvement %
Circuit High Overhead Medium Overhead Low Overhead
VLN VLE VLR VLN VLE VLR VLN VLE VLR
s1196 0.53 5.47 4.45 3.98 6.57 6.23 4.2 9.7 7.94
s1238 3.86 5.07 3.34 4.74 4.41 3.61 3.37 5.35 1.26
s1423 12.54 1.93 8.56 5.94 2.98 5.08 1.99 3.36 3.24
s1488 -7.11 3.91 3.91 -2.28 3.30 3.30 1.94 5.27 5.27
s5378 10.64 -0.65 1.56 5.56 0.06 1.98 2.99 0.86 1.23
s9234 24.92 5.91 4.78 13.17 3.91 3.84 5.43 1.77 1.84
s13207 9.45 -5.12 5.71 7.01 1.06 4.09 5.98 2.61 4.86
s15850 9.93 0.16 4.23 5.22 0.34 3.43 2.3 0.8 3.05
s35932 0.17 2.22 1.35 -0.7 -0.34 -0.03 -1.92 -1.91 -0.96
s38417 13.4 -2.81 1.75 6.59 -0.91 1.83 2.91 0.55 1.73
s35854 17.03 8.16 6.36 9.4 4.37 4.07 5.43 1 2.6
average 8.67 3.60 4.18 5.33 3.54 3.48 3.15 4.39 2.92
Table 3.2: Virtual resynthesis cell library
op-based experimental results with dierent
input netlist settings
Area Improvement %
Circuit High Overhead Medium Overhead Low Overhead
VLN VLE VLR VLN VLE VLR VLN VLE VLR
s1196 1.26 -15.53 7.94 -0.33 0.5 10.26 -2.15 6.33 9.74
s1238 4.58 -32.63 -14.56 -11.29 -22.36 -15.65 -13.44 -22.25 -18.25
s1423 4.05 -12.2 -12.57 -24.83 -3.26 -3.11 -41.68 3.12 -3.51
s1488 0 0 0 0 0 0 0 0 0
s5378 25.27 -31.72 -0.34 11.74 -19.68 -3.54 -1.73 -11.2 -2.34
s9234 -8.5 -17.25 -7.95 -10.21 -11.29 -6.5 -11.1 -7.11 -6.03
s13207 17.37 -54.69 -14.93 5.69 -29.91 -7.77 0.47 -15.23 -7.55
s15850 20.66 -43.21 -0.48 7.46 -26.07 -1.28 -0.63 -14 -0.43
s35932 20.52 -101.37 1.24 11.46 -55.72 1.41 6.06 -28.71 1.6
s38417 34.52 -14.22 -4.79 18.54 -9.03 -7.74 4.8 -5.21 -10.49
s35854 33.21 -25.06 -2.3 17.84 -15.88 -2.83 5.81 -9.22 -2.18
average 13.9 -18.22 -4.43 2.37 -9.35 -3.34 -4.87 -5.19 -3.58
Table3.3: Virtual resynthesis cell library latch-based experimental results with dierent
input netlist settings
72
Table 3.4: Circuit information after initial synthesis
Circuit
Circuit Size
Area
# Error-
# of C # of S Total of EDL rate
s1196 331 18 349 343 3 0.25%
s1238 415 18 433 267 2 0%
s1423 454 74 528 490 34 0%
s1488 318 6 324 248 6 4.35%
s5378 809 164 973 1028 47 0.79%
s9234 530 125 655 789 64 0.52%
s13207 1562 460 2022 2496 27 0.18%
s15850 1935 448 2383 2686 77 3.31%
s35932 5443 1728 7171 9582 390 14.98%
s38417 5752 1490 7242 8719 451 5.59%
s38584 6684 1248 7932 7974 201 69.78%
Table3.5: Area improvement of
op-based design (%): (IA: Iterative algorithm, MIGP:
Mixed integer geometric program, NBF: Naive Brute-force, BF: Brute-force, VL: Virtual
Library)
Area Improvement %
Circuit High Overhead Medium Overhead Low Overhead
IA MIGP NBF BF VL IA MIGP NBF BF VL IA MIGP NBF BF VL
s1196 11.43 11.43 8.05 12.49 0.53 9.15 8.44 6.07 9.54 3.98 8.16 8.16 5.03 8.16 4.20
s1238 8.8 8.8 8.64 11.1 3.86 5.16 5.16 6 9.1 4.74 7.7 7.7 4.6 7.7 3.37
s1423 23.03 25.11 16.8 TO 12.54 10.89 10.62 8.96 TO 5.94 2.91 3.56 4.67 TO 1.99
s1488 0.39 2.1 4.58 7.99 -7.11 -2.52 -4.65 3.33 3.33 -2.28 -4.17 -4.17 2.71 2.71 1.94
s5378 25.13 25.13 7.95 TO 10.64 13.56 13.56 4.58 TO 5.56 6.32 6.32 2.47 TO 2.99
s9234 25.08 30.65 13.79 TO 24.92 11.23 TO 8.36 TO 13.17 4.82 TO 4.52 TO 5.43
s13207 8.75 8.75 4.83 TO 9.45 4.75 4.1 4.72 TO 7.01 3.05 3.05 4.66 TO 5.98
s15850 17.14 17.14 3.65 TO 9.93 8.88 TO 2.03 TO 5.22 4.09 TO 1.08 TO 2.30
s35932 24.39 24.39 6.73 TO 0.17 13.81 13.81 3.97 TO -0.7 7.32 TO 2.28 TO -1.92
s38417 18.29 TO 3.34 TO 13.40 9.22 TO 2.2 TO 6.59 3.99 TO 1.49 TO 2.91
s35854 16.27 TO -4.49 TO 17.03 8.76 TO -2.4 TO 9.40 4.35 TO -1.2 TO 5.43
73
Table3.6: Comparison of worst run-time over three EDL overheads of
op-based design
Run-time
Circuit IA MIGP NBF BF VL
CLOCK CPU CLOCK CPU CLOCK CPU CLOCK CPU CLOCK
s1196 1.3m 11.6m 1m 11.5m 1m 2.5m 1m 4m 0.5m
s1238 2.1m 19m 4m 55m 0.5m 2m 0.5m 2m 0.5m
s1423 1.7m 14m 19m 4hr 3.5m 28m TO TO 0.5m
s1488 1.7m 14m 1.5m 18m 1.5m 5m 2m 15m 0.5m
s5378 7m 1.3hr 1hr 13hr 5m 39m TO TO 0.5m
s9234 2.2m 17.3m 14m 3hr 6m 53m TO TO 0.5m
s13207 11.5m 1.77hr 2.8hr 32.5hr 3m 25m TO TO 0.7m
s15850 18.5m 2.78hr TO TO 8m 64m TO TO 0.7m
s35932 2.7hr 22.19hr TO TO 45m 5hr TO TO 1.5m
s38417 4.8hr 32hr TO TO 1hr 6hr TO TO 1.6m
s35854 12.3hr 57.7hr TO TO 32m 2.7hr TO TO 1.4m
74
Table 3.7: Error-rate (%) of
op-based design
Old Error-rate
Circuit Error- Overhead IA MIGP NBF BF VL
rate Change New Change New Change New Change New Change New
s1196 0.25
H -0.25 0 -0.25 0 -0.08 0.17 -0.25 0 -0.16 0.09
M -0.25 0 -0.25 0 -0.08 0.17 -0.25 0 0.08 0.33
L -0.25 0 -0.25 0 -0.08 0.17 -0.25 0 -0.25 0
s1238 0 H/M/L 0 0 0 0 0 0 0 0 0 0
s1423 0 H/M/L 0 0 0 0 0 0 0 0 0 0
s1488 4.35
H 2.98 7.33 -4.35 0 0.56 4.91 -4.28 0.07 3.15 7.5
M 2.98 7.33 -2.56 1.79 0.56 4.91 3.5 7.85 5.7 10.05
L 2.98 7.33 2.98 7.33 0.56 4.91 3.5 7.85 7.32 11.67
s5378 0.79
H -0.79 0 -0.79 0 -0.29 0.59 TO TO -0.34 0.45
M -0.79 0 -0.79 0 -0.29 0.59 TO TO -0.26 0.53
L -0.79 0 -0.79 0 -0.29 0.59 TO TO -0.08 0.71
s9234 0.52 H/M/L -0.52 0 -0.52 0 -0.37 0.15 TO TO -0.38 0.14
s13207 0.18
H -0.18 0 -0.18 0 0 0.18 TO TO -0.18 0
M -0.18 0 -0.18 0 0 0.18 TO TO -0.15 0.03
L -0.17 0.01 -0.17 0.01 0 0.18 TO TO -0.15 0.03
s15850 3.31
H -1.03 2.28 -1.15 2.16 0.4 3.71 TO TO -1.64 1.67
M/L -1.67 1.64 TO TO 0.4 3.71 TO TO -1.65 1.66
s35932 14.98 H/M/L -14.98 0 -14.98 0 0 14.98 TO TO -14.98 0
s38417 5.59
H -3.89 1.7 TO TO 26.08 31.67 TO TO 38.19 43.78
M -3.64 1.95 TO TO 45.69 51.28 TO TO 40.85 46.44
L -3.81 1.78 TO TO 45.69 51.28 TO TO 41.08 46.67
s35854 69.78
H -10.88 58.9 TO TO 1.11 70.89 TO TO -8.73 61.05
M -14.67 55.11 TO TO 1.11 70.89 TO TO -7.2 62.58
L -7.63 62.15 TO TO 1.11 70.89 TO TO -4.15 65.63
75
Table 3.8: Circuit information after retiming
Circuit
Circuit Size
Area
# of Error-
# of C # of MS # of SS Total EDL rate
s1196 331 18 19 368 332 5 18.29%
s1238 415 41 18 474 266 23 19.44%
s1423 454 74 93 621 549 50 14.67%
s1488 318 6 23 347 235 6 23.64%
s5378 809 164 187 1160 959 69 78.87%
s9234 530 125 169 769 690 66 1.63%
s13207 1562 460 520 980 2334 85 25.8%
s15850 1935 448 518 2901 2648 170 9.42%
s35932 5443 1728 689 7860 8933 288 87.49%
s38417 5752 1490 1662 8904 8020 830 89.98%
s38584 6684 1248 1318 9250 7387 675 68.93%
Table3.9: Area improvement of Blade latch-based design (%): (IA: Iterative algorithm,
MIGP: Mixed integer geometric program, NBF: Naive Brute-force, BF: Brute-force), VL:
(Virtual resynthesis cell library)
Area Improvement %
Circuit High Overhead Medium Overhead Low Overhead
IA MIGP NBF BF VL IA MIGP NBF BF VL IA MIGP NBF BF VL
s1196 8.21 8.21 7.22 15.97 1.26 7.52 7.52 5.3 13.61 -0.33 7.14 7.14 4.25 12.81 -4.25
s1238 1.49 1.49 3.05 TO 4.58 -2.59 -2.59 2.72 TO -11.29 -5.55 -5.55 2.47 TO -13.44
s1423 15.98 14.86 8.99 TO 4.05 11.7 12.16 6.73 TO -24.83 8.57 10.33 5.92 TO -41.68
s1488 0 0 -1.96 -6.91 0 0 0 -5.23 -10.64 0 0 0 -7.1 -12.78 0
s5378 10.59 5.67 5.71 TO 25.26 5.51 3.38 3.42 TO 11.74 2.01 1.8 1.84 TO 1.73
s9234 19.1 20.81 5.56 TO -8.5 11.58 11.04 1.98 TO -10.22 4.19 0.06 1.16 TO -11.18
s13207 14.97 TO 1.76 TO 17.36 8.34 TO 0.94 TO 5.69 2.93 TO 0.65 TO 0.46
s15850 22.05 TO 7.06 TO 20.66 12.29 TO 4.29 TO 7.46 5.1 TO 2.44 TO -0.63
s35932 14.73 TO 0.36 TO 20.52 8.48 TO 0.32 TO 11.46 4.78 TO 0.3 TO 6.06
s38417 19.51 TO 3.5 TO 34.52 12.36 TO 2.27 TO 18.54 6.3 TO 1.34 TO 4.79
s35854 35.53 TO 2.03 TO 33.2 20.32 TO 1.31 TO 17.84 10.76 TO 0.78 TO 5.81
76
Table 3.10: Comparison of worst run-time over three EDL overheads of Blade latch-
based design
Run-time
Circuit IA MIGP NBF BF VL
CLOCK CPU CLOCK CPU CLOCK CPU CLOCK CPU CLOCK
s1196 1.5m 6.6m 0.8m 10m 1m 3m 3.6m 33.6m 24.18s
s1238 1.3m 6.5m 0.5m 4.6m 3m 20m TO TO 23.73s
s1423 4.4m 25.48m 28m 5hr 5m 35m TO TO 29.51s
s1488 1.13m 4.98m 0.6m 7.85m 1.5m 5m 6m 1hr 24.52s
s5378 10.8m 50.77m 2hr 22hr 7m 60m TO TO 27.89s
s9234 5.53m 32.93m 41m 8hr 6m 53m TO TO 26.88s
s13207 41.5m 4.2hr TO TO 7m 75m TO TO 47.44s
s15850 24.7m 2.4hr TO TO 8m 2hr TO TO 45.47s
s35932 15hr 64hr TO TO 30m 3hr TO TO 3.2m
s38417 17hr 64hr TO TO 3hr 9hr TO TO 65.83s
s35854 24hr 120hr TO TO 2hr 7hr TO TO 69.78s
77
Table 3.11: Error-rate (%) of Blade latch-based design
Old Error-rate
Circuit Error- Overhead IA MIGP NBF BF VL
rate Change New Change New Change New Change New Change New
s1196 18.29
H 0 18.29 0 18.29 -7.48 10.01 -9.99 8.3 -11.01 7.28
M -1.4 16.89 -1.4 16.89 -7.48 10.81 -5.21 13.08 0.88 19.17
L -1.4 16.89 -1.4 16.89 -7.48 10.81 -5.21 13.08 8.94 0
s1238 8.34
H -8.91 10.53 -8.91 10.53 -0.02 19.42 TO TO 56.58 76.02
M -8.91 10.53 -8.91 10.53 -0.02 19.42 TO TO 20.68 40.12
L -8.91 10.53 -8.91 10.53 -0.02 19.42 TO TO -9.57 9.87
s1423 14.67
H -14.67 0 -14.67 0 -14.53 -0.14 TO TO -14.67 0
M -14.67 0 -14.67 0 -6.19 -8.48 TO TO -14.67 0
L -14.67 0 -14.67 0 -6.19 -8.48 TO TO -14.67 0
s1488 23.64
H 0 23.64 0 23.64 -0.59 23.05 -3.31 20.33 0 23.64
M 0 23.64 0 23.64 -0.59 23.05 -3.31 20.33 0 23.64
L 0 23.64 0 23.64 -0.59 23.05 -3.31 20.33 0 23.64
s5378 78.87
H 12.04 90.91 15.02 93.89 -78.86 0.01 TO TO -64.49 14.38
M 12.04 90.91 15.02 93.89 -78.86 0.01 TO TO -62.22 16.65
L 12.04 90.91 15.02 93.89 -78.86 0.01 TO TO -59.11 19.76
s9234 1.63
H 0.04 1.67 0.05 1.68 -1.62 0.01 TO TO -0.18 1.45
M -0.09 1.54 0.05 1.68 -1.62 0.01 TO TO -0.1 1.53
L -0.09 1.54 0.08 1.71 -1.62 0.01 TO TO 0.01 1.64
s13207 25.8
H -25.78 0.02 -25 0.8 -25.79 0.01 TO TO -25.77 0.03
M -25.78 0.02 -24.86 0.94 -25.79 0.01 TO TO -25.77 0.03
L -25.76 0.04 -24.79 1.01 -25.79 0.01 TO TO -25.78 0.02
s15850 9.42
H -3.15 6.27 TO TO -3.14 6.28 TO TO -8.15 1.27
M -5.62 3.8 TO TO -3.14 6.28 TO TO -7.68 1.74
L -6.48 2.94 TO TO -3.14 6.28 TO TO -6.13 3.29
s35932 87.49
H 15.02 72.47 TO TO 0.01 87.5 TO TO -13.02 74.47
M 15.02 72.47 TO TO 0.01 87.5 TO TO -13.08 74.41
L 15.02 72.47 TO TO 0.01 87.5 TO TO -13.01 74.48
s38417 89.98
H 0 89.98 TO TO 0.04 90.02 TO TO -87.85 2.13
M 0 89.98 TO TO 0.04 90.02 TO TO -89.98 0
L 0 89.98 TO TO 0.04 90.02 TO TO -88.64 1.34
s35854 68.93
H -60.21 8.72 TO TO 0 68.93 TO TO -18.46 50.47
M -41.52 27.41 TO TO 0 68.93 TO TO -17.8 51.13
L -53.03 15.9 TO TO 0 68.93 TO TO -29.64 39.29
78
Chapter 4
Summary and Conclusions
This thesis addresses two asynchronous design challenges, formal verication and
optimization of timing resilient design.
4.1 Formal Verication of Asynchronous Design
Flow
We proposed a new formal approach for verifying the correctness of three key steps
in asynchronous design
ows with Proteus and Blade: architectural decomposi-
tion, synthesis, and micro-architectural optimizations. We used three-valued-logic
to model conditional communication which we translate to a binary representation
to employ the use of commercial synchronous equivalence checkers. Experimen-
tal results demonstrated the approach is powerful enough to enable push-button
verication of moderate-size computational blocks and
exible enough to verify
more complex decompositions with reasonable manual intervention. However, the
approach has limitations that present interesting topics for future work.
79
First, our current verication approach does not test whether the gate-level
asynchronous control logic implements the specied handshaking and conditional
communication primitives. Instead, our 3VL model of the circuit abstracts these
details away. This is reasonable because many other researchers have focused on
gate-level verication of asynchronous control, proving hazard-freedom and con-
formance to a specication [38, 39]. However, coupling the two verication tasks
into a coherent formal statement of the correctness of the circuit remains an open
task. Ideally, this notion of correctness would extend our notion of equivalence
from the gate-level to the CSP level. It should consider not only safety properties
such as hazard-freedom but also liveness properties such as deadlock and livelock
freedom. Ensuring liveness may be complicated as it can involve the interaction
of controllers across pipeline stages and depends on maintaining enough dynamic
slack in the system for tokens to continually move forward. In gure 4.1 shows a
ring example in SVC and gate-level with three token buers, one slack-less copy,
and a SEND. Our current verication model will identify these two designs are
equivalent; however, all tokens in gate-level design are blocked because of the lack
of slack and the design deadlocks. Hence, liveness verication is an important
future work of asynchronous designs.
The second limitation of our approach involves the restrictions on the class of
CSP processes we can support. In particular, we currently limit the CSP processes
to communication through a channel at most once per system iteration. This
80
always begin
a = 3’b110;
forever begin
temp = a[0];
a[0] = a[2];
a[2] = a[1];
a[1] = temp;
A.Send(a[1]);
end
end
(a) Ring in SVC
(b) Ring in gate-level
Figure 4.1: Ring implementation in SVC and gate-level
limitation facilitates our decomposed SVC approach and the subsequent usage of
synchronous logical equivalence checking tools. In particular, it guarantees that
the decision to communicate on a channel is a combinational function of the al-
ready specied state in the system. However, in asynchronous design, channels
communicate on demand and depending on how we dene an iteration, an asyn-
chronous channel can send tokens no, 1, or multiple times. One possible solution is
to introduce state variables to break the iterations into sub-iterations each which
contains at most one communication per iteration. However, when we compare
two designs, this solution may lose the one-to-one correspondence of state variables
needed for logical equivalence checking. To see this, consider Figure 4.2a which
81
includes a channel C that communicates twice in one iteration with dierent value
of variable c. If we directly synthesize this SVC code, the addition calculation of
c will be overwritten of multiplication and result will be dierent from what we
intend. Hence, we need to introduce state variable to separate the calculations
and communications into iterations as shown in Figure 4.2b. The challenge is that
two CSP processes should be considered equivalent if there exists any extended
state encoding for which they are logically equivalent. Moreover, when comparing
CSP versus a gate-level design that implements the state encoding, the verication
tool should rst identify the extended state encoding used in the gate-level design
and then somehow apply the same encoding to the CSP model before applying
the logical equivalence test. Figure 4.2c shows that if SVC and gate-level designs
are using dierent extended state encoding method which addition is mapped to
state = 0 in SVC; while to state = 1 in gate-level, verication tool will mistakenly
identify designs are not equivalent.
As described in Section 2.2.5, a third challenge is that local equivalence checking
cannot handle the situation when input tokens can be stalled across iterations.
Moving to sequential equivalence checking to support stalled tokens at inputs but
this signicantly adds to the complexity of the problem. This is challenging as
complexity remains a limitation. In fact, although conformal LEC tool provides
automatic partition when comparing larger designs, the run-time complexity of
82
always begin
forever begin
fork
A.Receive(a);
B.Receive(b);
join
c = a + b;
C.Send(c);
c = a
*
b;
C.Send(c);
end
end
(a) Twice communications
always begin
state = 1’b0;
forever begin
if(state == 1’b0) begin
fork
A.Receive(a);
B.Receive(b);
join
c = a + b;
C.Send(c);
state = 1’b1;
end
else begin
c = a
*
b;
C.Send(c);
state = 1’b0;
end
end
end
(b) One communication in SVC
(c) One communication in gate-level
Figure 4.2: Multiple communications in one iteration and solution in SVC
logical equivalence checking is still an issue that must be addressed moving forward
if this approach can be generally used on large-scale designs.
4.2 Optimization of Timing Resilient Design
We presented four dierent approaches to optimize timing resilient design through
resynthesis. First is brute-force which can obtain optimal results but is so complex
83
can only be used on very simple examples. Second is naive brute-force which has
lower time complexity because it only considers one near-critical end-point at one
time but loses any benet of optimizing multiple end-points simultaneously. Third
is mixed geometric programming using a gate-sizing based model. It models both
logic area and EDL overhead area and understands all the shared paths and can
thus can guide resynthesis to achieve in most cases tested close to optimal results.
The last one is a virtual resynthesis cell library method which has the lowest time
complexity among the four approaches. It manipulates the synthesis cell library
to have duplicate sequential cells which one group changes setup time adding up
with TRW window and the other group changes area adding up with EDL area
overhead.
From the experimental results section 3.7 and specically the area improvement
Table 3.5 and run-time Table 3.6, our proposed iterative algorithm is the best
optimization method among these four approaches. All circuits can be nished
within 24 hours and have close to optimal improvements. These results suggest
that a gate-sizing based model can indeed be used to eectively guide resynthesis.
However, for the larger circuits, the run-time of IA is still high and if we do not
support multi-thread computations, virtual resynthesis cell library will be a second
option that gives good area improvements and only fast run times associated with
only a single synthesis run.
84
Timing resilient design uses relatively expensive error detecting logic to im-
prove performance over regular design when variation increases. It is important to
emphasize that resilient designs can exhibit better average performance only when
few or no timing error happens. Currently, our proposed approaches only focus on
reducing the amount of error detecting logic to improve area cost; however, these
optimizations cannot guarantee the error-rate also reduces. Towards this goal, we
propose to model an approximate measure of error-rate and/or performance in
the MIGP objective function. In particular, error-rate is use-case dependent and
dened as the portion of simulation cycles in which timing errors occur. Errors
can be triggered from dierent sets of error-detecting sequential gates and the
number of such sets is exponential in the number of error-detecting sequentials.
Thus, modeling all sets of error-detecting sequential gates within the MIGP will
be unrealistic. An interesting area of future work is to develop heuristic algorithms
for error-rate reduction based on a modied MIGP that targets the reduction of
specic instances of errors.
Although we do not know how paths are shared before naive re-synthesis,
from the results of naive brute-force re-synthesis we can deduce which other
op-
s/latches share paths with current sped-up paths and then create error-detecting
sequential correlation tables (EDST). We can also record sets of error triggering
error-detecting sequentials for each simulation cycle of typical use cases. Based
on the EDST and sets of error-detecting sequentials per simulation cycles, we can
85
build error correlation tables that estimate the error-rate reduction obtained by
speeding-up all paths terminating with a particular set of error-detecting sequen-
tials. In Table 4.1, we show the error-detecting sequential correlation tables for the
s1196 latch-based design. After rst synthesis, ve end-points Latch 4, Latch 5,
Latch 12, Latch 16 and Latch 17 in s1196 need to have EDLs. We create this
correlation table from the results of naive brute-force. If we run resynthesis with
only set max delay on Latch 5, the remaining near-critical end-point list contains
only Latch 4, Latch 12 and Latch 16. This means speeding up Latch 5 also speeds
up Latch 17. Similarly, speeding up Latch 16 will also speed up Latch 17.
Table 4.2 presents how the error-rate is aected by dierent error types which
we dene as a set of near-critical sequential gates that cause errors. Note that num-
ber of error types is bounded by max(2
#EDLs
; simulation cycles) so given limited
memory, one should record and keep track of only the most frequent error types.
We then combine table 4.1 and Table 4.2 to create an error correlation table, Table
4.3, to guide modication of the geometric program's objective function. In Table
4.3, if we reduce error type T1, we speed up Latch 5 which will also speed up
Latch 17 according to Table 4.1; hence, the estimated total error-rate reduction
increases from 10.51% to 11.32%.
In a heuristic algorithm, we can choose a few high potential error types for
error-rate reduction from the error correlation table and run our modied MIGP
86
Error-detecting sequential correlation
Error-detecting sequentials Related error-detecting sequentials
Latch 4 None
Latch 5 Latch 17
Latch 12 Latch 17
Latch 16 None
Latch 17 None
Table 4.1: EDST of s1196 running at 2.85 GHz in a 28nm technology
Error Type
Type Name Sequentials causing errors Error-rate
T1 Latch 5 10.51%
T2 Latch 16 4.38%
T3 Latch 12 1.77%
T4 Latch 4 0.82%
T5 Latch 17 0.81%
Table 4.2: Error type table of s1196 running at 2.85 GHz in a 28nm technology
to approximate the area cost associated with the specic error-rate reduction.
Then, we can select the subsets that give us the best gain from both area and
error-rate and use these results to drive re-synthesis. Equation 4.1 shows the new
Error correlation
Error type Related type Total error-rate
T1 T5 11.32%
T2 None 4.38%
T3 T5 2.58%
T4 None 0.82%
T5 None 0.81%
Table 4.3: Error correlation table of s1196 running at 2.85 GHz in a 28nm technology
87
cost function where we can adjust the weights and to tradeo area and error
rate and, if desired, model true power saving.
Total area +Error Rate Reduction (4.1)
It would also be useful to model the energy consumption of the resilient system
which can be used as an objective function or constraint. Towards that end, we may
model energy as a function of a combination of the area and error rate. Note the
error rate term re
ects the fact that errors cause additional switching in the control
circuits responsible to recover from the error which expends more energy. Thus,
while increasing the area will increase energy due to higher switched capacitance,
the energy increase may be mitigated by the fact that the lower error rate acts to
reduce energy.
Energy =f
A2E
(Total area)+f
E2E
(Error Rate Reduction+Original Error Rate)];
(4.2)
where, A2E is an area to energy conversion function and E2E is error-rate to energy
conversion. To use this model of energy as a constraint, we may wish to constrain
the energy to be within some factor of the original energy, as follows.
EnergyOriginal Energy (4.3)
88
We should note, however, that optimizing performance using error-rate is some-
what complicated by the fact that the performance of a resilient circuit with high-
error rates can be improved by simply reducing the size of the resiliency win-
dow [60]. Similarly, a design with very low error-rates can be made faster by
increasing the size of the resiliency window but at the cost of possibly requiring
more EDL. The optimal error rate depends on the error-rate penalty of the design
and the distribution of delays but our previous work has shown it to be around
23%-44% for Blade designs [60].
For the current proposed error-rate implementation, the error type table as-
sumes a xed resiliency window. When the size of window is xed, the best per-
formance you can achieve is error-rate equals to 0%. However, according to [60]
this performance might be worse than other combinations of size of window and
error-rate. Hence, modeling error-rate with non-xedW is also useful future work.
Modeling error-rate of non-xedW , however, is complex because the error-rate
depends on dynamic timing determined by simulation which is not easily modelled
in our geometric program. We can generate the histogram of dynamic delay of
cycles of each end-point as in Figure 4.3 where the total simulation cycles is 10000
with a clock period of 0.35ns. When we sweep the resiliency window, it is easy
to know how many cycles will cause errors at one end-point. For example, if
W = 0:15ns, there are 913 cycles causing errors by latch 14. From the histogram,
we can know how many errors this end-point produces. But we cannot know which
89
Figure 4.3: Histogram of slack and cycles of latch 14 in s1196
end-points need to be sped up to reduce the error rate, because the errors are often
caused by a combinations of near-critical end-points. Each combination of end-
points is an error-type and once the resiliency window changes, the combination
of errors might also change. Thus, while optimal performance optimization is an
important future work of resilient design, it may likely be very challenging.
90
Bibliography
[1] K. Bowman, J. Tschanz, N. S. Kim, J. Lee, C. Wilkerson, S. Lu, T. Karnik,
and V. De, \Energy-ecient and metastability-immune resilient circuits for
dynamic variation tolerance," IEEE JSCC, vol. 44, no. 1, pp. 49{63, Jan 2009.
[2] H. P. Hofstee, \Future microprocessors and o-chip SOP interconnect," IEEE
Transactions on Advanced Packaging, vol. 27, no. 2, pp. 301{303, 2004.
[3] V. Tiwari, D. Singh, S. Rajgopal, G. Mehta, R. Patel, and F. Baez, \Re-
ducing power in high-performance microprocessors," in Design Automation
Conference, 1998. Proceedings, June 1998, pp. 732{737.
[4] M. Laurence, \Low-power high-performance asynchronous general purpose
ARMv7 processor for multi-core applications," pp. 304{314, 2013.
[5] A. Ghiribaldi, D. Bertozzi, and S. M. Nowick, \A transition-signaling bundled
data NoC switch architecture for cost-eective GALS multicore systems," in
Design, Automation Test in Europe Conference Exhibition (DATE), 2013,
March 2013, pp. 332{337.
[6] E. Kasapaki and J. Spars, \Argo: A time-elastic time-division-multiplexed
NoC using asynchronous routers," in Asynchronous Circuits and Systems
(ASYNC), 2014 20th IEEE International Symposium on, May 2014, pp. 45{
52.
[7] W. Jiang, K. Bhardwaj, G. Lacourba, and S. Nowick, \A lightweight early
arbitration method for low-latency asynchronous 2D-mesh NoC's," in Design
Automation Conference (DAC), 2015 52nd ACM/EDAC/IEEE, June 2015,
pp. 1{6.
[8] Y. Thonnart, P. Vivet, and F. Clermidy, \A fully-asynchronous low-power
framework for GALS NoC integration," in Design, Automation Test in Europe
Conference Exhibition (DATE), 2010, March 2010, pp. 33{38.
91
[9] R. Soares, N. Calazans, F. Moraes, P. Maurine, and L. Torres, \A robust
architectural approach for cryptographic algorithms using GALS pipelines,"
Design Test of Computers, IEEE, vol. 28, no. 5, pp. 62{71, Sept 2011.
[10] P. A. Merolla, J. V. Arthur, R. Alvarez-Icaza, A. S. Cassidy, J. Sawada,
F. Akopyan, B. L. Jackson, N. Imam, C. Guo, Y. Nakamura et al., \A million
spiking-neuron integrated circuit with a scalable communication network and
interface," Science, vol. 345, no. 6197, pp. 668{673, 2014.
[11] B. V. Benjamin, P. Gao, E. McQuinn, S. Choudhary, A. R. Chandrasekaran,
J.-M. Bussat, R. Alvarez-Icaza, J. V. Arthur, P. Merolla, K. Boahen et al.,
\Neurogrid: A mixed-analog-digital multichip system for large-scale neural
simulations," Proceedings of the IEEE, vol. 102, no. 5, pp. 699{716, 2014.
[12] S. B. Furber, D. R. Lester, L. Plana, J. D. Garside, E. Painkras, S. Tem-
ple, A. D. Brown et al., \Overview of the spinnaker system architecture,"
Computers, IEEE Transactions on, vol. 62, no. 12, pp. 2454{2467, 2013.
[13] R. Dreslinski, M. Wieckowski, D. Blaauw, D. Sylvester, and T. Mudge, \Near-
threshold computing: Reclaiming moore's law through energy ecient inte-
grated circuits," Proceedings of the IEEE, vol. 98, no. 2, pp. 253{266, Feb
2010.
[14] I. Sutherland and S. Fairbanks, \GasP: A minimal FIFO control," in Asyn-
chronus Circuits and Systems, 2001. ASYNC 2001. Seventh International
Symposium on, 2001, pp. 46{53.
[15] M. Ferretti, \Single-track asynchronous pipeline template," Ph.D. disserta-
tion, University of Southern California, 2004.
[16] M. Davies, A. Lines, J. Dama, A. Gravel, R. Southworth, G. Dimou, and
P. Beerel, \A 72-port 10G ethernet switch/router using quasi-delay-insensitive
asynchronous design," in Asynchronous Circuits and Systems (ASYNC), 2014
20th IEEE International Symposium on. IEEE, 2014, pp. 103{104.
[17] J. Teifel and R. Manohar, \Highly pipelined asynchronous FPGAs," in Pro-
ceedings of the 2004 ACM/SIGDA 12th international symposium on Field
programmable gate arrays. ACM, 2004, pp. 133{142.
[18] A. Martin and M. Nystrom, \Asynchronous techniques for system-on-chip
design," Proceedings of the IEEE, vol. 94, no. 6, pp. 1089{1120, June 2006.
[19] J. Cortadella, A. Kondratyev, L. Lavagno, and C. P. Sotiriou, \Desynchro-
nization: Synthesis of asynchronous circuits from synchronous specications,"
Computer-Aided Design of Integrated Circuits and Systems, IEEE Transac-
tions on, vol. 25, no. 10, pp. 1904{1921, 2006.
92
[20] D. Hand, M. Trevisan Moreira, H.-H. Huang, D. Chen, F. Butzke, Z. Li,
M. Gibiluka, M. Breuer, N. Vilar Calazans, and P. Beerel, \Blade { a tim-
ing violation resilient asynchronous template," in Asynchronous Circuits and
Systems (ASYNC), 2015 21st IEEE International Symposium on, May 2015,
pp. 21{28.
[21] R. Diamant, R. Ginosar, and C. Sotiriou, \Asynchronous sub-threshold ultra-
low power processor," in Proceedings of PATMOS, June 2015.
[22] A. M. Lines, \Pipelined asynchronous circuits," California Institute of Tech-
nology, Tech. Rep. CaltechCSTR:1998.cs-tr-95-21, 1995 (Revised 1998).
[23] P. A. Beerel, G. D. Dimou, and A. M. Lines, \Proteus: An ASIC
ow for GHz
asynchronous designs," IEEE Design and test of Computers, vol. 28, no. 5,
pp. 36{51, 2011.
[24] R. O. Ozdag and P. A. Beerel, \A channel based asynchronous low power high
performance standard-cell based sequential decoder implemented with QDI
templates," in Proceedings of 10th International Symposium on Asynchronous
Circuits and Systems, 2004, pp. 187 { 197.
[25] R. Ozdag and P. Beerel, \An asynchronous low-power high-performance se-
quential decoder implemented with QDI templates," Very Large Scale Inte-
gration (VLSI) Systems, IEEE Transactions on, vol. 14, no. 9, pp. 975{985,
Sept 2006.
[26] P. Beerel, R. Ozdag, and M. Ferretti, A Designer's Guide to Asynchronous
VLSI. Cambridge University Press, 2010.
[27] A. Peeters, F. te Beest, M. de Wit, and W. Mallon, \Click elements: An
implementation style for data-driven compilation," in Asynchronous Circuits
and Systems (ASYNC), 2010 IEEE Symposium on, May 2010, pp. 3{14.
[28] C. Hoare, Communicating Sequential Processes. Prentice Hall, 1985.
[29] A. Martin, \Synthesis of asynchronous VLSI circuits," California Institute
of Technology, Department of Computer Science, Tech. Rep. CS-TR-93-28,
March 1991.
[30] D. Edwards and A. Bardsley, \Balsa: An asynchronous hardware synthesis
language," The Computer Journal, vol. 45, no. 1, pp. 12{18, 2002.
[31] C. G. Wong and A. J. Martin, \High-level synthesis of asynchronous systems
by data-driven decomposition," in Proceedings of 40th Design Automation
Conference, 2003, pp. 508{513.
93
[32] Y. Thonnart, E. Beigne, and P. Vivet, \A pseudo-synchronous implementation
ow for WCHB QDI asynchronous circuits," in ASYNC, 2012, pp. 73{80.
[33] A. Saifhashemi and P. A. Beerel, \SystemVerilogCSP: Modeling digital asyn-
chronous circuits using SystemVerilog interfaces," in Proceedings of Commu-
nicating Process Architectures - WoTUG-33, 2011, pp. 287{302.
[34] A. Yakovlev, P. Vivet, and M. Renaudin, \Advances in asynchronous logic:
From principles to GALS & NoC, recent industry aapplications, and commer-
cial CAD tools," in Design, Automation Test in Europe Conference Exhibition
(DATE), 2013, 2013, pp. 1715{1724.
[35] I. J. Chang, S. P. Park, and K. Roy, \Exploring asynchronous design tech-
niques for process-tolerant and energy-ecient subthreshold operation," IEEE
JSSC, vol. 45, no. 2, pp. 401{410, Feb 2010.
[36] I. Kwon, S. Kim, D. Fick, M. Kim, Y.-P. Chen, and D. Sylvester, \Razor-lite:
A light-weight register for error detection by observing virtual supply rails,"
Solid-State Circuits, IEEE Journal of, vol. 49, no. 9, pp. 2054{2066, 2014.
[37] M. Choudhury, V. Chandra, K. Mohanram, and R. Aitken, \Timber: Time
borrowing and error relaying for online timing error resilience," in DATE,
March 2010, pp. 1554{1559.
[38] C. Nelson, C. Myers, and T. Yoneda, \Ecient verication of hazard-freedom
in gate-level timed asynchronous circuits," IEEE Transactions on Computer-
Aided Design of Integrated Circuits and Systems, vol. 26, no. 3, pp. 592{605,
2007.
[39] S. Longeld and R. Manohar, \Inverting martin synthesis for verication," in
ASYNC, 2013, pp. 150{157.
[40] R. Negulescu, \Process spaces and formal verication of asynchronous cir-
cuits," Ph.D. dissertation, 1998.
[41] P. A. Cunningham, \Verication of asynchronous circuits," Univeristy of Cam-
bridge, Tech. Rep. UCAM-CL-TR-587, 2004.
[42] H. K. Kapoor, \Delay-insensitive processes: A formal approach to the design
of asynchronous circuits," Ph.D. dissertation, 2004.
[43] T. Bui, T. Nguyen, and A.-V. Dinh-Duc, \Experiences with representations
and verication for asynchronous circuits," in Fourth International Conference
on Communications and Electronics (ICCE), 2012, 2012, pp. 459{464.
94
[44] D. Borrione, M. Boubekeur, E. Dumitrescu, M. Renaudin, J.-B. Rigaud, and
S. Sirianni, \An approach to the introduction of formal validation in an asyn-
chronous circuit design
ow," in Proceedings of the 36th Annual Hawaii In-
ternational Conference on System Sciences, 2003, 2003.
[45] R. K. Brayton and S. P. Khatri, \Multi-valued logic synthesis," in Proceedings
of 12th International Conference On VLSI Design, 1999, pp. 196{205.
[46] K. Gupta, Discrete Mathematics, 10th ed. Krishna Prakashan, 2009.
[47] D. K. Pradhan and I. G. Harris, Practical Design Verication. Cambridge
University Press, 2009.
[48] A. Saifhashemi, \Power optimization of asynchronous pipelines using con-
ditioning and re-conditioning based on a three-valued logic model," Ph.D.
dissertation, University of Southern California, 2012.
[49] A. Saifhashemi and P. Beerel, \Observability conditions and automatic
operand-isolation in high-throughput asynchronous pipelines," in Integrated
Circuit and System Design. Power and Timing Modeling, Optimization and
Simulation, ser. Lecture Notes in Computer Science. Springer Berlin Heidel-
berg, 2013, vol. 7606, pp. 205{214.
[50] E. Seligman and I. Yarom, \Best known methods for using cadence conformal
LEC at Intel," in Proceedings of CDNLive! Cadence User Conference, 2006.
[51] Y. Liu, R. Ye, F. Yuan, R. Kumar, and Q. Xu, \On logic synthesis for timing
speculation," in ICCAD. IEEE, 2012, pp. 591{596.
[52] M. Nakai, S. Akui, K. Seno, T. Meguro, T. Seki, T. Kondo, A. Hashiguchi,
H. Kawahara, K. Kumano, and M. Shimura, \Dynamic voltage and frequency
management for a low-power embedded microprocessor," IEEE JSCC, vol. 40,
no. 1, pp. 28{35, 2005.
[53] R. Ye, F. Yuan, H. Zhou, and Q. Xu, \Clock skew scheduling for timing
speculation," in DATE. IEEE, 2012, pp. 929{934.
[54] A. B. Kahng, S. Kang, J. Li, and J. Pineda De Gyvez, \An improved method-
ology for resilient design implementation," TODAES, vol. 20, no. 4, p. 66,
2015.
[55] D. Ernst, N. S. Kim, S. Das, S. Pant, R. Rao, T. Pham, C. Ziesler, D. Blaauw,
T. Austin, K. Flautner, and T. Mudge, \Razor: A low-power pipeline based
on circuit-level timing speculation," in Microarchitecture, 2003. MICRO-36.
Proceedings. 36th Annual IEEE/ACM International Symposium on, Dec 2003,
pp. 7{18.
95
[56] M. T. Moreira, D. Hand, N. L. V. Calazans, and P. A. Beerel, \TDTB error
detecting latches: Timing violation sensitivity analysis and optimization," in
Quality Electronic Design, 2015. ISQED '15. International Symposium on,
2015.
[57] YALMIP: Modelling language for advanced modeling and solu-
tion of convex and nonconvex optimization problems Available:
http://users.isy.liu.se/johanl/yalmip/.
[58] S. Boyd, S.-J. Kim, L. Vandenberghe, and A. Hassibi, \A tutorial on geometric
programming," Optimization and engineering, vol. 8, no. 1, pp. 67{127, 2007.
[59] A. Saifhashemi, D. Hand, P. Beerel, W. Koven, and H. Wang, \Performance
and area optimization of a bundled-data intel processor through resynthesis,"
in Asynchronous Circuits and Systems (ASYNC), 2014 20th IEEE Interna-
tional Symposium on, May 2014, pp. 110{111.
[60] D. Hand, H.-H. Huang, B. Cheng, Y. Zhang, M. Trevisan Moreira, M. Breuer,
N. Vilar Calazans, and P. Beerel, \Performance pptimization and analysis of
blade designs under delay variability," in Asynchronous Circuits and Systems
(ASYNC), 2015 21st IEEE International Symposium on, May 2015, pp. 61{
68.
96
Abstract (if available)
Abstract
Asynchronous circuit design has long been considered a promising alternative to synchronous design due to its potential for achieving lower power consumption, higher robustness to process variations, and higher throughput. The lack of commercial Computer-Aided-Design Tools, however, has been a major obstacle for its wide-spread adoption. This thesis addresses two important CAD sub-problems for asynchronous design: logical equivalence checking and re-synthesis. ❧ Although some well-developed asynchronous design flows exist, they rely on extensive dynamic simulation with unique test-bench and coverage tools to show functional correctness, increasing risk and hampering the widespread adoption of this technology. To address this issue, we propose a method for logical equivalence checking} (LEC) of asynchronous circuits using commercial synchronous tools. In particular, we verify the equivalence of asynchronous circuits modeled either with Communicating Sequential Processes (CSP) in SystemVerilog or at a micro-architectural level using conditional communication library primitives. Our approach is based a novel three-valued logic model that abstracts the detailed handshaking protocol and is thus agnostic to different gate-level implementations, making it applicable to both Quasi Delay Insensitive (QDI) and bundle-data design styles. ❧ Process, voltage, and temperature variations force margins in synchronous design, particularly at low and near-threshold voltages. Bundled-data resilient designs promises to remove these large margins while at the same time taking advantage of average-case path activity at the cost of adding error detecting logic (EDL) to near-critical-paths to detect timing errors within a resiliency window. Hence, we propose a logic optimization strategy called resynthesis} in which near-critical-paths are sped-up with a tighter max_delay constraint to reduce the amount of EDL needed and lower error-rates at the cost of increasing logic area. We propose several optimization approaches to solve this problem, including synthesis heuristics and mixed integer geometric programming (MIGP) formulations targeting area cost.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
An asynchronous resilient circuit template and automated design flow
PDF
Average-case performance analysis and optimization of conditional asynchronous circuits
PDF
Power optimization of asynchronous pipelines using conditioning and reconditioning based on a three-valued logic model
PDF
Radiation hardened by design asynchronous framework
PDF
Production-level test issues in delay line based asynchronous designs
PDF
Clustering and fanout optimizations of asynchronous circuits
PDF
Theory, implementations and applications of single-track designs
PDF
Clocking solutions for SFQ circuits
PDF
Advanced cell design and reconfigurable circuits for single flux quantum technology
PDF
Verification and testing of rapid single-flux-quantum (RSFQ) circuit for certifying logical correctness and performance
PDF
Power efficient design of SRAM arrays and optimal design of signal and power distribution networks in VLSI circuits
PDF
Security-driven design of logic locking schemes: metrics, attacks, and defenses
PDF
Improving efficiency to advance resilient computing
PDF
Automatic conversion from flip-flop to 3-phase latch-based designs
PDF
Designing efficient algorithms and developing suitable software tools to support logic synthesis of superconducting single flux quantum circuits
PDF
Design and testing of SRAMs resilient to bias temperature instability (BTI) aging
PDF
Optimal redundancy design for CMOS and post‐CMOS technologies
PDF
Electronic design automation algorithms for physical design and optimization of single flux quantum logic circuits
PDF
Graph machine learning for hardware security and security of graph machine learning: attacks and defenses
PDF
A generic spur and interference mitigation platform for next generation digital phase-locked loops
Asset Metadata
Creator
Huang, Hsin-Ho
(author)
Core Title
Formal equivalence checking and logic re-synthesis for asynchronous VLSI designs
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
09/26/2016
Defense Date
08/16/2016
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
asynchronous VLSI designs,CAD,OAI-PMH Harvest,resilient designs
Format
application/pdf
(imt)
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Beerel, Peter A. (
committee chair
), Chu, Chris (
committee member
), Golubchik, Leana (
committee member
), Nazarian, Shahin (
committee member
)
Creator Email
feilanmi@gmail.com,hsinhohu@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c40-305755
Unique identifier
UC11281606
Identifier
etd-HuangHsinH-4809.pdf (filename),usctheses-c40-305755 (legacy record id)
Legacy Identifier
etd-HuangHsinH-4809.pdf
Dmrecord
305755
Document Type
Dissertation
Format
application/pdf (imt)
Rights
Huang, Hsin-Ho
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
asynchronous VLSI designs
CAD
resilient designs