Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Gated Multi-Level Domino: a high-speed, low power asynchronous circuit template
(USC Thesis Other)
Gated Multi-Level Domino: a high-speed, low power asynchronous circuit template
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
GATED MULTI-LEVEL DOMINO: A HIGH-SPEED, LOW POWER
ASYNCHRONOUS CIRCUIT TEMPLATE
by
Kenneth J Shiring
A Dissertation Presented to the
FACULTY OF THE GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
MASTER OF SCIENCE
(COMPUTER ENGINEERING)
May 2008
Copyright 2008 Kenneth J Shiring
ii
Acknowledgements
I would like to extend a special thank-you to my graduate advisor, Dr.
Peter Beerel. His tireless patience and high academic standards made it
possible for me to achieve my longtime goal of completing high quality
Masters research. Peter has played no small part in taking asynchronous
circuits into the 21
st
century, and demonstrating time and time again that
they are ready for common usage.
I thank my research teammates Georgios Dimou and Arash Saifhashemi,
for their patience with me as I mastered the numerous details of designing
asynchronous circuits. Both have a high commitment to excellence, and
this is obvious in the quality of their work. I hope they continue to
influence the art of VLSI design in the same spirit as they conduct their
research.
I especially thank Andrew Lines, founder of Fulcrum Microsystems. He has
long been an advocate for USC asynchronous research, and was a
significant contributor to all of my efforts. His invention of the PCHB
template, and the many innovations which followed, created a
foundation upon which many engineers stand today.
iii
Lastly, but in no way least, I acknowledge my family, Paul, Jackie, Jon,
and Christina. They always believed in me, even when I didn’t believe in
myself. I owe the best parts of my life to them.
iv
Table of Contents
ACKNOWLEDGEMENTS ....................................................................................... II
TABLE OF CONTENTS ..........................................................................................IV
LIST OF TABLES ...................................................................................................VII
LIST OF FIGURES................................................................................................VIII
ABSTRACT ............................................................................................................X
1 INTRODUCTION ............................................................................................ 1
1.1 SYNCHRONOUS DESIGN .................................................................................1
1.2 ASYNCHRONOUS DESIGN ..............................................................................2
1.3 THESIS CONTRIBUTION ....................................................................................3
1.3.1 Automated De-Synchronization....................................................4
1.3.2 GMLD features .................................................................................5
2 PRIOR RESEARCH.......................................................................................... 8
2.1 SYNCHRONOUS CIRCUITS AND CLOCK GATING...............................................8
2.1.1 High Throughput...............................................................................8
2.1.2 Clock Gating ....................................................................................9
2.1.3 Intelligent Gating ...........................................................................12
2.2 AUTOMATED DE-SYNCHRONIZATION .............................................................15
2.2.1 De-synchronization Goals.............................................................15
2.2.2 De-synchronization Techniques...................................................17
2.3 ASYNCHRONOUS DESIGN ............................................................................23
2.3.1 Motivation.......................................................................................23
2.3.2 Fundamental Concepts ...............................................................24
2.3.3 Asynchronous Features.................................................................27
2.3.4 Summary .........................................................................................32
3 GATED MULTI LEVEL DOMINO CIRCUIT ..................................................... 34
3.1 MLD .........................................................................................................35
3.1.1 Data Path........................................................................................35
3.1.2 Control Path ...................................................................................40
3.1.3 Timing and Throughput.................................................................46
3.2 GMLD ......................................................................................................47
3.2.1 Data Path........................................................................................48
3.2.2 Control Path ...................................................................................49
3.2.3 FBIG Design.....................................................................................51
3.2.4 Forks and Joins ...............................................................................57
3.2.5 Enables ............................................................................................59
3.2.6 Correctness.....................................................................................62
3.2.7 Cycle Time ......................................................................................63
v
4 DE-SYNCHRONIZATION DESIGN FLOW..................................................... 67
4.1 INTRODUCTION............................................................................................67
4.1.1 EDA Integration..............................................................................67
4.1.2 Automatic De-Synchronization ...................................................68
4.2 DESIGN ENTRY AND SYNTHESIS ......................................................................70
4.3 CLUSTERING ALGORITHM .............................................................................71
4.3.1 Algorithm Details............................................................................72
4.3.2 Generality .......................................................................................73
4.4 SLACK MATCHING ......................................................................................74
4.5 TEMPLATE MAPPING ....................................................................................74
4.6 GMLD ENHANCEMENTS..............................................................................75
4.7 CONCLUSION .............................................................................................76
5 EXPERIMENTAL RESULTS.............................................................................. 77
5.1 THROUGHPUT ..............................................................................................77
5.1.1 Linear ...............................................................................................77
5.1.2 Fork and Join ..................................................................................78
5.1.3 Ring ..................................................................................................80
5.2 MICROPROCESSOR TEST DESIGN...................................................................82
5.2.1 The tv80 Design ..............................................................................82
5.2.2 Design Synthesis .............................................................................83
5.2.3 Test Case Stimulus..........................................................................84
5.2.4 Energy Analysis...............................................................................85
5.2.5 Proteus Tool Flow............................................................................87
5.2.6 Measuring Results...........................................................................88
5.2.7 Simulation Data..............................................................................89
5.3 DATA ANALYSIS ..........................................................................................89
5.3.1 Delays ..............................................................................................89
5.3.2 Energy..............................................................................................90
5.3.3 Cluster Impact Affecting Energy.................................................91
5.3.4 Netlist Properties Affecting Energy..............................................92
6 CONCLUSION............................................................................................. 95
6.1 THROUGHPUT ..............................................................................................95
6.2 ENERGY RESULTS .........................................................................................96
6.3 FUTURE RESEARCH .......................................................................................97
6.3.1 Throughput......................................................................................97
6.3.2 Joins .................................................................................................98
APPENDIX A: PETRIFY FBIG CONTROLLER STG ............................................... 101
APPENDIX B: GENERALIZED TIMING EQUATIONS .......................................... 102
6.4 DEFINITIONS..............................................................................................102
6.5 EQUATIONS...............................................................................................102
vi
APPENDIX C : SIMULATION DATA................................................................... 104
REFERENCES..................................................................................................... 105
vii
List of Tables
Table 1 : MLD Pipeline Signals........................................................................41
Table 2 : Dual Rail GMLD Signals...................................................................50
Table 3 : Join Request Behavior ....................................................................58
Table 4 : JOING Request Behavior................................................................60
Table 5 : Latencies of MLD and GMLD Templates .....................................78
viii
List of Figures
Figure 1 : Full Buffer Channel Net example..................................................32
Figure 2 : 4-Input Domino Cell ........................................................................36
Figure 3 : An example V_LOGIC cell.............................................................38
Figure 4 : MLD Stage Block Diagram.............................................................41
Figure 5 : HSE FBI Controller Description........................................................43
Figure 6 : FBI Controller STG.............................................................................45
Figure 7 : MLD Implied Neutrality Constraint ................................................46
Figure 8 : FBIG Block Diagram ........................................................................50
Figure 9 : FBIG Controller STG .........................................................................52
Figure 10 : FBIG Gate Equations ...................................................................57
Figure 11 : JOIN2 and JOING2 block diagrams .........................................61
Figure 12 : Control Token Regime Delays ...................................................64
Figure 13 : Data Token Regime Delays........................................................65
Figure 14 : Proteus Design Flow.....................................................................70
Figure 15 : Linear Pipeline ..............................................................................78
Figure 16 : Fork/Join Test Pipeline .................................................................79
Figure 17 : FBCN Model of Figure 16 ............................................................79
Figure 18 : Ring Test Pipeline .........................................................................81
Figure 19 : FBCN Model of Figure 18 ............................................................81
Figure 20 : Number of Clusters Affecting Energy Consumption ..............92
Figure 21 : Netlist Parameters Affecting Energy.........................................93
ix
Figure 22 : Energy Improvement by Total Clusters.....................................94
x
Abstract
Existing techniques that translate synchronous gate-level circuits into
asynchronous counterparts do not adequately support gated clocks and
consequently can incur unnecessary switching activity. This thesis
proposes to address this limitation by translating the gated clocked
structures into control circuits that triggers the evaluation of the datapath
evaluation only when necessary. In particular, we propose a new design
template called Gated Multi-Level Domino (GMLD) and a corresponding
de-synchronization design flow that supports the automatic translation of
a clock-gated synchronous netlist to a high-performance power-efficient
asynchronous circuit. We demonstrate that this new approach reduces
dynamic switching power with limited impact on area and maximum-
achievable throughput.
1
1 Introduction
1.1 Synchronous design
High speed synchronous design has been the subject of research and
development for nearly four decades. Techniques for maximizing
throughput are well known across both industry and academia. However,
since the turn of the century, it has become apparent that power
consumption is becoming a limiting attribute of modern VLSI design.
Indeed, several recent high performance microprocessor design efforts,
including IBM’s Power5 microprocessor [29], have determined that the
limiting factor for throughput is the dynamic power consumption of the
device. Accordingly, there is tremendous industry effort to minimize
dynamic power consumption.
1
An established technique for reducing dynamic power is the application
cycle. When wisely applied, it can significantly reduce the switching
activity in the circuit [53]. This technique has been used for many years,
and is now available as an integrated feature in many Electronic Design
1
Leakage power is also extremely important, particularly in nano-scale processes for
which on/off transistor current ratios have dropped [13]. Many of the techniques to
address leakage power are common to both synchronous and asynchronous designs [7]
but these are outside of the scope of this thesis.
2
Automation (EDA) tools (e.g., [49][12]). When a synchronous netlist
employs clock-gating, it implicitly provides extra information about the
conditions under which data must propagate through the circuit. The
principle question addressed in this thesis is if we can use this information
to enable conditional dataflow in an automatically generated
asynchronous equivalent circuit, thereby achieving less switching activity
in the data path.
1.2 Asynchronous Design
Asynchronous design is also concerned with high performance and
power efficiency. There have been promising measured results from
existing design styles [10][26][47][40], but challenges remain. High
throughput circuits have been explored in prior research, which also
benefit from average-case performance, a welcome improvement over
worst-case synchronous timing [10]. Techniques to lessen static and
dynamic power have attracted attention more recently [7].
Asynchronous circuits’ self-timed behavior and data-driven evaluation
make it well positioned toward low power [27][7], and we propose to
enhance this further with a mapping of clock-gating logic.
There have been a wide variety of proposed approaches to designing
asynchronous circuits. Among the many approaches, each one has its
3
own throughput, latency, dynamic power, and area characteristics.
Template-based asynchronous design (defined in Section 2.3.2.2) is well-
suited to high-throughput circuits. Most high-throughput styles are very
successful, with a great deal of supporting results [26][47][40][45][34].
However, techniques to automate the design of high-performance
asynchronous circuits involving the translation of synchronous circuits into
high-performance, domino-logic based templates do not take full
advantage of the data-driven nature of asynchronous circuits, and have
unnecessarily high switching activity in the presence of gated clocks. In
particular, in existing implementations the data-path re-evaluates even
when the clocks are disabled unnecessarily increasing switching activity.
The increasing ubiquity of clock gating techniques motivates a new de-
synchronization and asynchronous template design to properly support
these constructs. This research addresses this limitation.
1.3 Thesis Contribution
More specifically, this research contributes to the subjects of automated
de-synchronization of synchronous circuits, asynchronous dual-rail circuits,
and clock gating. Automatic de-synchronization of synchronous logic is a
successful technique, but lacked intrinsic support for clock gated
elements. Support for clock gating is an important feature to promote
4
general purpose usage in modern logic designs. Our intent is to design an
automatic approach to translate a synchronous clock gated design into
asynchronous domino logic. We show a general-purpose solution to
support transaction-level clock gating, which is the most difficult case to
translate. Simple bulk gating of logic simply translates to no data
evaluation, and is not covered in this research due to its obviousness.
1.3.1 Automated De-Synchronization
Our research is centered on an automatic de-synchronization of
synchronous logic. This process consists of a re-mapping of synchronous
logic onto a set of asynchronous cells, and removal of the clock network
while maintaining all causality relationships. We introduce new processing
steps in the de-synchronization process which translates clock-gated
elements, and implement these in new asynchronous cells. Furthermore,
we build upon an existing de-synchronization design flow, described in
Chapter 4, and augment it with our clock-gating features.
We intend the use of de-synchronization tools to enable the usage of a
high-throughput energy efficient asynchronous design style. We address
the issue of general design specification with a de-synchronization
process, and we intend this process to be well supported with a Computer
Aided Design (CAD) flow. A well-designed CAD flow allows the engineer
5
to create a design in minimal time, using skills already known from the
synchronous world. Furthermore, the added requirement of design-flow
transparency must not significantly harm an asynchronous circuit’s
optimality.
Our flow is adapted from Cortadella et. al. [11], whose flow is described in
detail in Section 2.2.2.4. We outline a form of de-synchronization which
begins with a traditional Register Transfer Level (RTL) description, and
proceed through a logic synthesis tool. We will provide a special
technology mapping process which transforms this synchronous design
into an asynchronous circuit, using custom tools to enable automation.
When complete, the circuit proceeds to a standard cell based Place-
and-Route (PAR) step, where it shares the same remaining steps to
manufacturing as standard synchronous design.
1.3.2 GMLD features
To take best advantage of the new de-synchronization techniques
presented here, we propose a new asynchronous circuit style which aims
to maintain high throughput and minimize dynamic power. This template
is called Gated Multi-Level Domino (GMLD). It is an evolution of an
existing circuit style, Multi-Level Domino (MLD). It features most of the
original speed characteristics of MLD, and adds an awareness of gated
6
clocks, which we use to reduce the switching behavior of each stage.
GMLD adds no additional timing assumptions to the original MLD timing
model (described in Chapter 3). In an ideal case, the local cycle time
would be unaffected (but not achieved in this work). We do add the
ability to disable the data path in cases where it is known that no input
data values will change. Furthermore, a possible decrease in local cycle
time is possible in cases where the data path is not required to evaluate,
potentially increasing the overall speed of the circuit.
The GMLD circuit style implements a fine-grained gating structure, where
individual flip-flops may be selectively gated inside a design. It
furthermore uses this information to avoid switching data path domino
logic, and thereby reducing dynamic switching power. This approach is
effective as a mechanism to enable a reduction in dynamic switching
power, especially in domino dual-rail circuits, which have a high pre-
charge and discharge energy cost.
In particular, our intention is to exploit the clock gating control logic to
create conditional data-flow in the corresponding asynchronous design,
preventing the evaluation of data path elements when inputs do not
change. The result is a new approach that combines the high-
7
performance benefits of domino-based asynchronous templates and the
low-power benefits of clock-gating in a de-synchronization-like flow that
easily integrates into otherwise synchronous CAD frameworks.
8
2 Prior Research
2.1 Synchronous Circuits and Clock Gating
High performance synchronous design is, and will continue to be, an area
of primary interest among academic and commercial interests. High
throughput tends to be the attribute of greatest value, in part because
consumers often make decisions based on this criterion. However, it is
now generally acknowledged that power consumption follows closely
behind throughput as a design goal, and is a limiting factor in many
designs. Indeed, more VLSI products are being used in portable and
hand-held devices, which necessitate battery-friendly power
consumption. In this paper, we specifically address dynamic power.
Leakage (static) power consumption is a serious issue which is another
area of research, but is not addressed here.
2.1.1 High Throughput
A high throughput synchronous design generally achieves its goal though
parallelism and pipelining. Fine-grain pipelining is often used, where
operations are divided into small pieces with flip-flops separating each
piece. This allows a smaller cycle time, and throughput improves linearly
with frequency. This model assumes that the pipeline is not data-limited,
9
and can be filled near capacity. When data limits the occupancy of the
pipeline, the benefits of this fine-grain pipelining becomes less useful. This
fine-grain pipelining leads to large loading on the clock signal, and can
cause a super-linear increase in power consumption.
Of course, whenever the number of cells in a pipeline increases, dynamic
power will increase as well. In the case of fine-grain pipelining, more flip-
flops increase the load on the clock, and the extra capacitive load will
lead to increases in dynamic power. To combat this trend, the well-known
technique of clock gating becomes invaluable as the level of pipelining
increases. It is recognized now among designers that stage-specific clock
gating is an effective method to reduce the switching load on the clock,
and prevent register elements from updating, and thus causing additional
switching activity in the register’s fanout logic cone [29]. Some designs
generate clock gating signals for every stage in the design, aiming to
maximally disable each register. This is a useful strategy as long as the
additional logic does not consume more power than is saved by
switching off registers.
2.1.2 Clock Gating
There are two primary methods by which flip-flops are gated [49]. The first
technique requires placing combinational logic in series with the global
10
clock signal. The simplest realization of this idea is to use an arbitrarily
complex gating function, and place an AND gate just before the clock
pin on a flip-flop. This idea has merit when large blocks of logic can be
disabled under the same conditions. Placing an AND gate on the clock
signal prevents the flip-flop cell from “seeing” the clock, which prevents
the flip-flop from updating state. Additionally, this also prevents any
switching activity from taking place on downstream loads of that output.
The second technique is a fine-grained approach, which uses a specific
clock-enable function for every stage of logic to be gated. This
technique potentially saves switching power during normal operation of
the circuit, and can be used in parallel with the coarse-grain technique.
2.1.2.1 Coarse Grain Gating
Sometimes it is desirable to disable large blocks of sequential logic
simultaneously. In this case, modifying the clock signal with combinational
logic is used. This allows the clock to be gated at a single point, and all
sequential elements inside the affected block share the same gated
clock signal. This technique is applicable to large System-on-Chip designs,
where blocks may have highly disjoint functions [35]. This technique is also
a natural fit with software-controlled power-down modes, because the
gating function can be the output of a software configured register. It
works especially well when portions of the chip perform orthogonal
11
functions, such as a processing unit for floating-point arithmetic. For
example, well-designed software can detect when floating-point code is
needed or not, and correspondingly power up/down the FPU logic.
As with the per-stage technique described above, this method also has
implementation risks. When the clock is enabled after becoming
disabled, it may create large current draws if many logic cells must be
charged simultaneously. This is known as the step-power problem. The
step-power issue may in turn cause ground bounce issues, noise, and may
possibly affect the integrity of the clock net itself [23]. Care must be taken
in both logic design and software control of any large clock gating
domains, so as not to violate electrical integrity of the chip.
2.1.2.2 Fine Grain Gating
The second method by which flip-flops can be gated is a technique
known as a synchronous load-enable [49]. This load-enable is
implemented as a discrete pin, with a specific function on each flip-flop
cell. Informally, they can be called EDFFs (enable D flip-flops). This
technique does not modify the clock net itself, rather it assumes that the
flip-flop element itself is architected to gate the clock internally. This can
be implemented either by preventing the sequential output from
12
changing, or by preventing the transistors inside the EDFF from seeing the
clock. From a functional standpoint, both are equivalent.
Traditionally, logic synthesis tools will insert a feedback multiplexer just
before the input to the DFF as a mechanism for maintaining state.
Functionally, this achieves the goal of preserving state, but does so in an
inefficient way. The EDFF implementation, as described above, serves as
a general replacement for this scheme. It has many advantages,
particularly setup time, area, load, and energy efficiency. In cases where
EDFFs are available in a synthesis technology library, logic synthesis tools
will use these cells instead of feedback multiplexers.
2.1.3 Intelligent Gating
Of primary interest is how we can select or generate a signal which will
successfully gate the clock to a flip-flop, while maintaining sequential
correctness. There have been several research efforts on this front,
notably [1] and [8]. Benini et. al. [8] defines the idea of an activation
function to describe the update conditions under which a flip-flop
changes state. However, Benini et. al. deal only with Finite State
Machines, where self-loops are more evident. A broader strategy can be
implemented by a technique known as precomputation [1].
Precomputation is achieved by analyzing the combinational logic of a
13
particular stage, and then computing a set of Observability Don’t Cares
(ODC) for inputs, driven from preceding flip-flops. These ODC values are
then used to generate an activation function for the preceding stage.
A key choice to make when generating an activation function is to
decide whether to use only pre-existing wires in a design, or to generate
new logic, which can possibly increase the probability that a DFF
becomes disabled. This becomes a tradeoff in area, which potentially
affects power as well. Alidina et. al.’s [1] approach always generates
new logic to implement this activation function. However, this is clearly
sub-optimal from an area perspective. Tiwari et. al. [51] choose a
different approach, by strictly re-using already existing signals in a given
design. However, Tiwari et. al.’s strategy fundamentally differs in the sense
that they attempt to gate all elements in a design, regardless of whether
they are combinational or sequential. In a purely synchronous design, this
does have an advantage in reducing the switching frequency of all
elements, not just outputs from EDFFs.
There are also now emerging several companies in the EDA industry
aiming to exploit these techniques. For example, Calypto Design Systems
sells a tool called PowerPro which can automatically add activation
14
function logic to an existing RTL design [14]. Calypto uses proprietary
sequential analysis methods, similar in nature to the techniques presented
in [1][8] and [51]. This provides a valuable resource for logic designers,
resulting in potentially better clock gating performance than engineer-
specified gating alone.
15
2.2 Automated De-Synchronization
Efforts at automating asynchronous circuit design can be classified into
two major groups. The first group uses a high-level synthesis technique, in
which a special-purpose language is used to generate an asynchronous
circuit. These languages are usually specially designed to well
encapsulate asynchronous design concepts in a way least intrusive to the
digital designer. The second group uses a de-synchronization approach,
in which a synchronous design is translated to an asynchronous form. The
value in the second approach is compatibility with existing EDA flows, and
avoids requiring a new design language to gain the benefits of
asynchronous circuits.
2.2.1 De-synchronization Goals
Three common challenges have emerged from efforts to create a viable
de-synchronization flow. Correctness, which represents the equivalence
of a synchronous circuit to an asynchronous one, is a fundamental
requirement. Area consumption is also an issue, since most asynchronous
circuits have a larger area footprint than synchronous ones. Lastly, CAD
integration is a key element, since industry acceptance often is based on
simplicity of design flow.
16
2.2.1.1 Correctness
The first and most basic challenge is one of correctness. Correctness
means that the asynchronous circuit must produce identical sequential
results as the synchronous version. One of the earliest proofs of
correctness for de-synchronization was done by Linder and Harden [33].
The proof is based on the assertion that if the asynchronous system is
properly initialized with tokens on all signals corresponding to sequential
elements in the synchronous circuit, then the circuit will operate correctly.
Both Linder et. al. [33] and Cortadella [19] show proofs of correctness for
de-synchronized circuits. Cortadella’s proof is referred to as flow-
equivalence, and uses a marked-graph model as a formalism upon which
the proof is based.
2.2.1.2 Area
The second challenge is area consumption. Many asynchronous design
styles result in larger area consumption compared to a synchronous
implementation. There are two major causes for this, handshaking control
and channel signaling style. Since asynchronous systems must preserve
their own local firing order, circuitry must be present to handshake
properly with all input blocks and all output blocks. This circuitry is usually
implemented in parallel with the regular datapath logic. In addition to
this, binary values are usually encoded differently from the purely 2-state
17
representation of synchronous circuits. Values must often encode
additional timing information, and this encoding may require more wires
than the synchronous version. These additional overheads of
handshaking control and wiring often yields worse area consumption.
2.2.1.3 CAD Flow
The last challenge, as summarized above, is CAD integration. In order to
advance the idea of using asynchronous logic as a viable design choice,
asynchronous logic design must be compatible with the industry EDA
design flows. Too many special asynchronous-specific design steps will
serve as a deterrent to general adoption. A robust asynchronous design
flow will be EDA tool aware, and EDA tool friendly.
2.2.2 De-synchronization Techniques
There have been notable efforts in circuit de-synchronization. We present
a summary of previous results here, and highlight differences with the de-
synchronization our presented approach.
2.2.2.1 High Level Synthesis
High level circuit synthesis strategy has been implemented in several novel
ways. One of the first is based on Communicating Sequential Processes
(CSP), and a simple refinement of this idea called Communication
Hardware Processes (CHP). Martin [37] proposed a straightforward
18
method for translating this CHP description into self-timed circuits. Martin
describes a refinement process for generating circuits from a CHP
description. Another significant method for synthesizing asynchronous
circuits directly are tools called Tangram [30][9] and Balsa [3][4][15][16].
These two languages take essentially the same approach of syntax-
directed translation to generate handshaking circuits.
2.2.2.2 Phased Logic
For the de-synchronization approach, there have been several distinct
methods proposed. One of the first and most successful efforts is Linder et.
al.’s Phased Logic [33]. Phased Logic is notable because of its successful
generation of delay insensitive circuits, and wide array of published
example designs [44][43]. Phased Logic uses Level Encoded Dual Rail
(LEDR) signaling between gates, and has a notion of Even and Odd
“phases” of computation. These concepts are used to preserve the
correct sequential evaluation of gates. This de-synchronization uses
transparent latches as state-holding elements, and uses standard gates
present in most technology libraries. There also exist tools which provide
debug and simulation support to this de-synchronization methodology
[25]. Phased Logic circuits have been successfully implemented in FPGAs
in [44][43], and also support the concept of early evaluation, where a
19
gate could fire if enough inputs had arrived to compute a known output
value [50].
2.2.2.3 Theseus NCL
A second major effort at de-synchronization is Theseus Logic’s Null
Convention Logic (NCL) [24]. NCL’s strategy is to convert all gate to gate
signaling into dual-rail channels, which include a null state to indicate
incompleteness of inputs. This null state captures timing information, and
gates with at least one input channel in the null state will not evaluate. A
gate produces an output when all inputs transition to a data state,
preserving sequential dependence. NCL uses VHDL as the design input
format, and specifies a particular style of RTL coding to map to NCL gates.
NCL designs undergo a 2 step translation process, where the first step
produces an NCL to 3NCL representation used for simulation, and an
3NCL to 2NCL step for final implementation. In an NCL design flow,
custom logic cells are used for the final implementation, which are
optimized for null signaling. NCL is notable in the sense that it is
compatible with, and in some cases friendly to much of the existing
synchronous design flow [32][31].
2.2.2.4 Handshake Circuits
Cortadella et. al. proposed another style of de-synchronization using
specialized handshake circuits [11][19][20][21]. This method directly
20
replaces the clock network with specialized delay-insensitive
asynchronous controllers. These controllers are designed by hand, using a
special synthesis tool such as Petrify [18]. Cortadella et al. also
demonstrate several different controllers, each with varying degrees of
concurrency. In the papers presented, the controllers are represented as
Signal Transition Graphs (STGs) [38]. The design flow for using these
asynchronous controllers also greatly resembles a standard synchronous
design flow. Synchronous circuits are written at the RTL level, and passed
through a logic synthesis tool. The synchronous netlist is then passed to a
custom translation tool, which removes the clock network and flip-flops,
replacing them with latches and asynchronous controllers. Timing analysis
is performed on the synchronous netlist, to determine the maximum delay
values between successive flip-flops. These delays are then preserved on
the new asynchronous controller handshake signals.
Andrikos et. al. expanded this research and performed a detailed power
and throughput analysis on a DLX microprocessor [2]. The results were
positive. Although the worst-case throughput decreased by 20%, there is
a 90% probability that the asynchronous circuit would outperform the
synchronous one, based on environmental operating conditions. In order
to achieve this de-synchronization, a 13.5% area penalty was taken, and
21
approximately a 17.5% power penalty. Given the superior average-case
performance of this circuit, the costs presented are quite reasonable.
2.2.2.5 Gate Transfer Level Mapping
Taubin et. al [47] proposed an approach that bears substantial similarity to
our proposed flow. Taubin et. al introduce the term “Gate Transfer Level”
(GTL) to describe pipeline stages. This reference is essentially equivalent
to our notion of template-based asynchronous design, where each leaf-
cell in the circuit implements self-timed behavior, and consistent
communications channel signaling. Taubin et. al use dual-rail channel
signaling, with a QDI behavior for all leaf nodes. The first step of the
design process is a synthesis pass, where a GTL description of a circuit is
generated by a commercial synthesis tool (Synopsys Design Compiler™).
The second step is a custom translation process implemented by a special
purpose CAD tool, called Weaver[47]. This step translates the single-rail
GTL circuit into a dual-rail QDI circuit. Lastly, the same commercial
synthesis tool is again used to translate the QDI dual-rail netlist into an
optimal technology cell implementation.
The Weaver tool performs several critical transformations which mimic the
process we present here. All signals between successive stages are
bundled into channels, and all channels use a four-phase handshaking
22
protocol, implementing QDI behavior. Smirnov et. al. in [46] also
introduce the ability to alter the slack of the asynchronous circuit,
removing and adding pipeline stages to optimize total system throughput.
However, no quantifiable results from slack matching are presented in
[46], and thus will not be compared to results herein.
2.2.2.6 Summary
All de-synchronization strategies described above were in many ways
evolutionary steps beyond each previous effort. The trend is toward
increasing automation, and increasing ability to use existing EDA tools to
perform the circuit implementation workload. It is our assertion that the
Proteus tool implements an incremental superiority to the previous de-
synchronization techniques. We present in Chapter 4 our de-
synchronization methodology, which uses a pipeline stage mapping to
custom technology cells. This process can be generalized to nearly all
known asynchronous templates.
23
2.3 Asynchronous Design
2.3.1 Motivation
Asynchronous circuits are beginning to attract attention as a viable
implementation for VLSI circuits. Many studies have been done on the
benefits of using asynchronous circuits, and collectively they make a
powerful argument for using asynchronous design [10][40][5]. Modern
applications require high throughput, low latency, and reasonable power
consumption. Asynchronous design is uniquely positioned to provide
those benefits, with some added flexibility over synchronous flows.
High data throughput is certainly a dominant criterion when selecting a
circuit implementation. Asynchronous circuits provide a very attractive
average-case performance in non-extreme operating conditions.
Furthermore, strategic selection of the asynchronous design style will result
in a very high throughput circuit. For example, very high throughputs can
be achieved with the Single Track Full-Buffer (STFB) design style [26]. At the
same time, this high performance can be achieved without the tedious
and time consuming timing-closure cycle of synchronous design flows.
While cycle time is ultimately dependant on logic depth and wire delay,
24
asynchronous design does not require multiple iterations to eliminate
setup and hold time requirements for a fixed cycle time [10][5].
Another motivating factor in adopting an asynchronous design style is
CMOS feature variability. It is becoming increasingly obvious to VLSI
engineers that process variability in small geometries are making the
synchronous design process quite difficult. This effect means that timing
margin is increasing by large increments for each new process generation
in synchronous design [42]. This margin can go largely unused, creating
great sub-optimality in circuit timing. Asynchronous neatly addresses this
problem due to its self-timed nature, inherently running at actual silicon-
dictated delays. In many cases this can almost double the effective
frequency [10][5].
2.3.2 Fundamental Concepts
The design and optimization of asynchronous circuits takes advantage of
special abstractions, in order to aid the designer. These abstractions are
key to understanding the behavior of existing circuits, and form a basis on
which to compare new circuit styles. A summary of these concepts is
described here, and the included references can be consulted for further
detail.
25
2.3.2.1 Token-Flow
Asynchronous computation is represented via token-flow, in which active
stages of logic are said to have a token. Tokens propagate from one
stage to another as logic evaluates, with the token marking the stage
currently evaluating. This token flow is implemented by local
handshaking, to properly enforce data dependencies between adjacent
stages. This handshaking usually takes one of two forms, either 2-phase or
4-phase handshaking [48]. This handshaking style is nearly always a
product of the design style used in the entire circuit. Handshaking signals
are bundled along with the data signals, and collectively they form a
channel. Channels are communications links which span different
pipeline stages [5].
2.3.2.2 Templates
In order to evaluate the effectiveness of any given asynchronous circuit
structure, it is important to characterize what kind of logic is used for
computation, and how different pipeline stages interact. The scheme
used in this research is a template-based design style, in which a
particular circuit is implemented in discrete stages, analogous to
synchronous design [39]. Each pipeline stage has a regular, repeating
structure, which communicates using a template-specific signaling
scheme. Template stages can implement either a full-buffer or half-buffer
26
capacity model [17]. Full-buffers can allow tokens on both its inputs and
outputs simultaneously, where half-buffers cannot. Template stages which
generate tokens on reset are known as TOKBUFs.
2.3.2.3 Timing Model
Different asynchronous design styles also have different timing models.
These timing models describe what assumptions can be made about the
ordering of various signal transitions. There are many models in theory,
however in practice most asynchronous circuits either use Quasi Delay
Insensitive (QDI) or Bounded-Delay (BD) timing models [48]. The QDI
model assumes that wires and gates may have arbitrary delays, except in
cases where a wire forks to more than one load. In this case, an
assumption called an isochronic fork is made, which bounds the arrival
time to all fanouts within one gate delay of each other. By comparison,
the BD model assumes that the circuit will function within a minimum and
maximum delay. Sometimes this model bounds only one side, requiring
for example only a minimum delay. In many cases, the circuit designer
may choose templates with less timing robustness for greater throughput,
or choose a QDI template for greater ease in timing closure with a small
tradeoff in throughput [5].
27
2.3.2.4 Delay Model
An important attribute to measure in asynchronous circuits is cycle time.
Similar to synchronous circuits, asynchronous circuits have a cycle time.
The local cycle time (LCT) is defined as the time needed for a single
pipeline stage to receive a token on its inputs, evaluate the logic, and
reset itself to process another input token [48]. It represents the delay
required to process one data unit, and ready itself for the next unit. This
delay is also subdivided into two components, a forward latency (FL) and
a backward latency (BL) [6]. FL is equivalent to the delay required for a
data token to propagate from a stage input to the next stage input. BL is
the delay from the end of the FL to the time the next input can be
processed. These delays have the relationship LCT=FL + BL. Since the
structure of these pipeline stages are often the same across technology
generations, it becomes useful to measure time as a function of transistor
switches. The term transitions is used to describe the time between one
signal change and another, with 1 transition equal to the switch time of
one transistor.
2.3.3 Asynchronous Features
Asynchronous circuits feature system-level behaviors which can differ
substantially from synchronous logic. It is important to highlight those
features which can affect the overall throughput, power, and area of
28
asynchronous design. In so doing, a relationship can be established to
existing synchronous behaviors, and hence form a good basis for
comparison.
2.3.3.1 Data-Driven Evaluation
Asynchronous circuit styles also lend themselves to good power and
energy efficiency, by virtue of their data-driven nature. Asynchronous
circuits only switch when new data is present, and propagates to the next
stage(s) when complete. When no data is present, an asynchronous logic
stage will not switch, and will only switch again when a new request is
presented by a driving cell. In most cases, this behavior leads to switching
activity linearly following utilization [10]. The lesser the utilization of the
total token capacity of the circuit, the lower the dynamic power
consumption. There have been references to this behavior as “perfect
clock gating” [52], although we show in Chapter 3 that this is not
necessarily true.
2.3.3.2 Implementation Choice
A useful feature of asynchronous circuits is the availability of various circuit
styles from which to choose. In addition to providing a single timing
model for implementing asynchronous circuits, template-based designs
offer regular structures and consistent signaling methods. Using a
template-based design style, several different circuit templates have
29
been designed and proven over the recent years [40]. An optimal circuit
style can be chosen from these templates to best suit the design goals of
the target circuit. Each template has different throughput, latency, area,
and power attributes which can be considered when selecting the
optimal mapping [5]. For example, the Single Track Full Buffer (STFB)
template offers very high throughput, but has poor area consumption
[26]. By comparison, the Bundled-Data template offers high area
efficiency, but at a cost of low throughput.
2.3.3.3 Dynamic Logic
Another important property of asynchronous circuits is the ease of using
dynamic logic. Dynamic logic offers some excellent advantages over
static logic, and offers superior speed with potentially less area and power
consumption [27]. Asynchronous circuits can efficiently exploit this logic
style, due to the extra state information encoded in the four-phase
handshake. The pre-charge and evaluate phases of a dynamic circuit
can be designed to match the evaluate and reset phases of an
asynchronous cell with virtually no difficulty. This confers the benefits of
dynamic logic without extraneous pre-charge and evaluate events that
can occur in a synchronous circuit.
30
2.3.3.4 Global Cycle Time
An important measurement for whole circuits is the global cycle time
(GCT) [6]. GCT represents the effective cycle time of an entire circuit. The
GCT is determined by all pipeline stages’ LCT delays, and how they
interact across communications channels. The GCT is the correct cycle
time to compare against a synchronous implementation, since it
represents the aggregate throughput of the whole circuit. GCT can be
optimized by rearranging the pipeline stages, and by distributing logic
gates across several stages. This introduces the concept of slack, which
represents the token capacity of a circuit. Two circuit concepts which
affect GCT are slack-elasticity and slack-matching, described below.
Most asynchronous circuits also have the very important property of being
slack-elastic. Presuming that the circuit to be modified does not use
arbiters (which will not occur in de-synchronized circuits), asynchronous
circuits can have their slack increased by an arbitrary amount [36].
Having a slack-elastic circuit allows an optimization called slack-
matching, which can be used to decrease the GCT of a circuit. It works
by adding buffer stages in strategic places to increase throughput, and
was formalized by Beerel et. al. in [6] by composing the problem as a
mixed-integer-linear program (MILP). This ability can be exploited during
the design implementation phase by selectively adding buffers in certain
31
sections of the circuit, which yield a near-ideal global cycle time. This
ability has the desirable effect of being able to effectively re-pipeline a
given design to achieve a wide range of final global cycles times. This
also frees the design engineer from making large tradeoffs in pipelining
granularity during the RTL design phase.
2.3.3.5 FBCN Model
The performance of an asynchronous circuit can be characterized using
a special marked-graph model, called a Full-Buffer Channel Net (FBCN)
model [6]. This model allows asynchronous designers to find critical cycles
in the design, and is the formalism upon which the slack-matching
optimization described above is based. FBCN models are a useful tool for
visualizing pipeline stage delays, and their interaction with neighboring
stages. An example of an FBCN model is shown in Figure 1. In this model,
a timed marked graph is drawn, which is a specialized Petri Net [38].
Transitions represent pipeline buffers, arcs represent forward latency or
backward latency, and places represent token presence. The reader is
referred to Beerel et. al. [6] for greater detail on regarding the symbology
and interpretation of the FBCN model.
32
8
2
8
8
2
8
2
8
2
8
2
2
2
8
Figure 1 : Full Buffer Channel Net example
2.3.4 Summary
Asynchronous circuits can be compared to synchronous circuits by the
same area, performance, and power metrics, despite their fundamental
differences. Powerful abstractions have been created for asynchronous
logic, which allows designers to focus on circuit function rather than
implementation. The use of template-based circuit design, the FBCN
model, and pipeline optimization make designing and optimizing
asynchronous circuits very straightforward. Indeed, many design choices
forced on the synchronous circuit designer, particularly pipeline
granularity, are removed in the asynchronous domain. Asynchronous
circuits exhibit environment-driven performance, eliminating the tedious
burden of fixed timing closure. With these advantages in mind, we
33
propose another design template for the circuit designer to choose from:
GMLD.
34
3 Gated Multi Level Domino Circuit
This thesis presents a new asynchronous template design. This template is
intended to provide high throughput, low latency, and low power. The
low power goal is intentionally subordinate to high throughput. Circuit
designs with high throughput requirements are the best candidates for
profiting from this technology, and thus high throughput remains the
primary research goal. A variety of asynchronous templates have been
already proposed addressing high throughput, and it is sensible to derive
a lower power template from one of these. With these guidelines in mind,
the Gated Multi-Level Domino (GMLD) template is described here in
detail, which is a derivation of the Multi-Level Domino (MLD) asynchronous
template.
Broadly speaking, this template is composed of two parts: a data path
and a control path. The data path implements the computational logic,
both combinational and sequential, represented in the original circuit.
The control path implements a 4-phase handshake to ensure correctness
and the preservation of logical dependencies. The two interact through
a small number of key control signals, described in detail below. The data
path is implemented using efficient dual-rail domino logic [27], and
35
channels are represented in dual-rail form. The control path uses a unique
dual-rail request scheme, described herein. A channel is defined by one
or more dual-rail signals, coupled with a dual-rail request and an
acknowledge.
3.1 MLD
The structure and behavior of an GMLD pipeline stage is substantially
similar to MLD. It is most useful then to first describe all the properties of
MLD, and proceed to describe the power saving refinements
implemented in GMLD.
3.1.1 Data Path
Data path logic inside an MLD stage has only a few hard requirements to
conform to MLD template rules. Logic in a given stage may have a fairly
arbitrary structure, and will likely be determined by a logic clustering
algorithm (presented in Chapter 4) without regard to logic function. It is
required that all bits be composed of 2 dual-rail wires, and that all logic
cells be implemented as domino logic. Each logic cell is enumerated in a
special domino cell library, which can be selected as a technology library
of any commercial logic synthesis tool.
36
3.1.1.1 Dual-Rail Domino Cells
Each domino logic cell primitive has a similar structure. Since these cells
are dual-rail, two circuit structures exist inside each cell, one for the true
rail and another for the false rail. A diagram of one domino logic cell is
shown below in Figure 2.
Figure 2 : 4-Input Domino Cell
Inputs to each leaf cell drive only NMOS transistors. There is one “level” of
NMOS transistors, in series with only one PMOS transistor. The PMOS
transistor, as in all implementations of domino logic, is driven by a special
pre-charge signal, driven in this case by the MLD control logic. The node
where the PMOS and NMOS stacks intersect is the internal node, which is
pulled high during pre-charge, and pulled low during evaluation. This
internal node then drives an inverter, which allows the output to drive
other cells. Along with the inverter, there is another small staticizer
inverter, used to keep the internal node stable when pulled high or low.
37
This inverter prevents electrical noise from erroneously switching the
internal node state.
3.1.1.2 V_LOGIC Domino Cells
Cells which drive bits which fan out to other MLD stages must be special
cells, known as V_LOGIC. V_LOGIC cells are nearly the same as regular
domino cells. In addition to some other control signal inputs, an additional
output signal is generated. This signal is the valid net, which indicates
when dual-rail output has been driven to a valid, non-neutral value. It is
logically equivalent to an OR of each dual-rail output signal. A diagram
of one V_LOGIC logic cell is shown below in Figure 3. In addition to
various domino logic cells, there are equivalent V_LOGIC versions of each
cell in the MLD technology library. These V_LOGIC cells are not needed
during logic synthesis, because they are determined only after each MLD
stage is clustered. The final drive cell can be converted to its V_LOGIC
counterpart after each stage has been defined. It will be common for an
MLD stage to have many output bits, and thus an aggregate valid signal
is generated for all outputs. This signal is a logical AND of all V_LOGIC
valid signals. There is a special cell designed to perform this function,
called a COMPLETE cell.
38
Figure 3 : An example V_LOGIC cell
An important distinction to make is the extra control signals on the
V_LOGIC cells. Unlike the regular domino cells, V_LOGIC’s pre-charge
and evaluate transistors are driven independently. Using this scheme,
V_LOGIC cells can be pre-charged and kept in that state indefinitely until
the control circuit asserts the evaluate signal, regardless of whether the
inputs have already arrived. This is useful to the control circuit, which uses
this property to correctly sequence the data path with preceding and
succeeding stages of logic.
3.1.1.3 TOKBUF Stages
There will be some MLD stages in a design which implement TOKBUF
functionality. For cells which fit this criteria, the data path will have special
cells which are substituted for V_LOGIC cells. All cells which drive stage
outputs will be TOK_BUF or TOK_EDFF leaf cells. These cells are
39
behaviorally simple, and implement nothing more than a logic buffer cell.
They exist to preserve a one to one correspondence with cells in a
synchronous netlist. They also have the functionality that a V_LOGIC cell
has; they have the valid signal, pc, eval, and reset. TOK_EDFF cells are
similar to TOK_BUFs, but have an additional input mimicking the Enable pin
of a flip-flop. When this pin is logic high, the TOK_EDFF updates state.
When logic low, the cell will drive the previous data. There is no restriction
on the amount or type of domino logic preceding the TOK_BUF or
TOK_EDFF cells in a TOKBUF stage.
3.1.1.4 Restrictions
We impose the following restrictions on data path logic are the following
in order to achieve the desired high performance:
• There must not be more than 64 wires as inputs to an MLD stage.
This is to prevent excessive loading on the domino cell pre-charge
signal.
• There must not be more than 64 wires as outputs to an MLD stage.
This is to prevent excessive loading on the VLOGIC control signals.
• Stages which contain flip-flops must have all flip-flops be the final
driving cell of all outputs. No MLD stage may have combinational
outputs mixed with flip-flop outputs.
40
3.1.2 Control Path
The controller of each MLD stage is the more complex component.
Accompanying the logic cells for each stage is one control path control
circuit. This circuit is implemented as a speed-independent block of logic,
which is custom-designed and used repeatedly for every stage in the
asynchronous circuit. This circuit will perform a 4-phase handshake with all
fanin and fanout MLD stages, and drive the pre-charge and evaluate
signals which govern the data path logic. By design, it will correctly
ensure all data dependencies are respected, under all conditions. There
are two versions of this controller, one for combinational stages, and
another for sequential stages. Both are described in detail below.
3.1.2.1 Full-Buffer Isolate Controller
The control circuit is called the FBI controller, after the Full-Buffer Isolate
(FBI) control scheme proposed by Singh and Nowick [45]. This scheme
describes the necessary behavior to implement an asynchronous full
buffer pipeline cell, where a token may be present on both the input
channel(s) and the output channel(s) concurrently. The FBI scheme
describes correct sequencing to ensure this data flow case behaves
correctly, and does not deadlock nor lose safeness.
41
V_LOGIC
Figure 4 : MLD Stage Block Diagram
MLD Signal Meaning
L.0 Left request
L.e Left acknowledge
In[1:0] Left data input rails
en Enable
pc Pre-charge
eval Evaluate
V Valid
R.0 Right request
R.e Right acknowledge
Out[1:0] Right data output rails
Table 1 : MLD Pipeline Signals
42
The FBI block diagram is shown in Figure 4. A table showing the relevant
control signals and their meaning is shown in Table 1. There is a
request/acknowledge signal pair on the left side to handshake with fanin
stages. Likewise, there is another request/acknowledge signal pair on the
right side to handshake with fanout stages. At various times during
handshaking, the FBI controller will drive the pre-charge and enable lines
to the data path to sequence computation. Control over the data path
evalutation can be thought of in two parts. The first part, the domino
logic, is controlled with the en signal. A single en signal is used to both
pre-charge and evaluate this logic. When this signal is low, the domino
logic is in pre-charge mode. When high, the domino logic is in evaluate
mode. The second part contains the V_LOGIC, which are the set of cells
which drive outputs from the stage. V_LOGIC uses a separate pre-charge
signal, pc. When pc is low and eval is low, the V_LOGIC pre-charges.
When pc is high and eval is high, V_LOGIC evaluates. pc must never be
low when eval is high, or else the PMOS and NMOS transistors will both be
conducting at the same time, resulting in a short from Vdd to Gnd.
The behavior of this system is best described using two representations: an
Handshake Expansion (HSE) description, and an Signal Transition Graph
(STG) description. HSE is described by Martin in [37]. An STG is described
by Chu in [17], and is simply a Petri Net which describes signal transitions.
43
Both are sufficient to completely describe, and implement, this controller.
We present both here for clarity.
3.1.2.2 FBI Controller Handshaking Expansion
[~_RESET]; L.e+, en-, eval-, _pc-; V-; [_RESET];
(en+, _pc+); eval+;
*([[V & L.0]; L.e-; en-; eval-; ([~L.0]; L.e+;
en+), ([~R.e]; _pc-;
[~V]; _pc+); [R.e]; eval+]) ||
*( [L.e]; L.0+; [~L.e]; L.0- ) ||
*( [V]; R.e-; [~V]; R.e+ ) ||
*( [en & eval]; V+; [~_pc]; V-)
Figure 5 : HSE FBI Controller Description
Shown in Figure 5 is the HSE description of the controller, along with the
HSE describing the behavior of the environment. It is important to note
that V and R.0 are the same signal in the control circuit. This HSE system
can be understood in 5 separate parts:
• First 2 lines: Defines the reset behavior of the FBI block.
• First parenthesis: Defines the behavior of the FBI control
block to environmental signals. In this block, R.0 does
not appear. V is considered equivalent to R.0.
44
• Second parenthesis: Defines the 4-phase handshake on
the Left Channel.
• Third parenthesis: Defines the 4-phase handshake on
the Right Channel.
• Last parenthesis: Defines the behavior of the V input.
The transitions of the V signal depend on both the pc
and eval signals.
3.1.2.3 FBI Controller Signal Transition Graph
A visual representation of this controller is shown in Figure 6. This diagram is
a Signal Transition Graph (STG) which describes the FBI controller. This STG
represents both TOKBUF controller and non-TOKBUF controller, although in
practice they are implemented separately. The reset sequence is not
shown for clarity. The initial marking of this STG corresponds to the point
after which the reset sequence has completed, and no tokens present on
both the right and left channels. This may be considered an “idle” state.
When an MLD stage is has completed its pre-charge and is waiting for a
new data token to arrive, both the domino logic and the V_LOGIC are in
the evaluate state. This allows data to propagate immediately from an
input through the data path, and arrive at any fanouts. The domino and
V_LOGIC are then pre-charged when the handshaking allows it. The
domino logic will pre-charge when data on the left side is acknowledged,
45
and the V_LOGIC will pre-charge as soon as the current stage is
acknowledged by all fanouts.
L.0+ V+
L.e-
en-
L.0- eval- R.e-
L.e+ pc-
en+ V-
eval+
pc+ R.e+
Figure 6 : FBI Controller STG
It is instructive to notice that R.0+ may occur before L.0+. This is a
desirable property, and implements the notion of early evaluation [50].
The FBI controller will ensure that the fanin and fanout handshakes
proceed correctly, and causality and safeness are preserved. This feature
46
decreases the GCT of the circuit, and reduces the burden of the FBI
control on LCT.
3.1.3 Timing and Throughput
An important attribute of any asynchronous template is the timing model
under which it operates. The MLD template is nearly QDI, with one
exception. It has one additional timing assumption, which is referred to as
implied neutrality. A diagram depicting this timing constraint is shown in
Figure 7.
V_LOGIC
V_LOGIC
Figure 7 : MLD Implied Neutrality Constraint
In this diagram, two timing paths are shown. One of the paths must be
faster than the other path at all times. This case will occur when pre-
charge transitions low (pc-). The transition to neutrality on all outputs (low)
must propagate to all fanouts before the R.0+ transition causes en+ to
occur in the fanout stages. In effect, this means that fanout stages must
47
not capture previous data on the data rails before they have been reset
to neutral. This requirement describes a one-sided timing assumption,
which should in practice be easy to meet. There is at least 4 transitions
from when the data rails go neutral at the source stage to when the
fanout stages will assert en+. This means that the wire delay on all data
rails must not exceed 4 transitions.
The local cycle time of an MLD stage is 18 transitions for a logic depth
(and width) of one. The forward latency is determined solely by the depth
of the data path logic. Each domino cell has a latency of 2 transitions, so
the FL becomes two times the logic depth. The backward latency of a
stage is 16 transitions. The presence of join C-elements and acknowledge
C-elements will increase the BL of a stage, by 2 transitions each.
3.2 GMLD
The central proposition of this research is to introduce a refinement to the
MLD asynchronous template. This new refinement is called GMLD, which
is a gated version of MLD. GMLD borrows most of its functionality from
MLD. The data path is largely unchanged, and the prime differences lie in
the control path. GMLD seeks to exploit the availability of the enable pin
on EDFFs on a synchronous circuit. This enable signal is used to disable
affected GMLD stages, causing them not to evaluate if the data inputs do
48
not change. This effect reduces dynamic switching power, and
potentially can reduce the forward latency to a constant value. The
representation of the enables and their behavior in an GMLD design is
described in detail later in this Chapter.
GMLD introduces an important distinction to the token-flow model of
asynchronous computation: two varieties of tokens. One kind of token is a
control token, which represents data flow without a re-evaluation of the
data elements. The other is a data token, which is equivalent to a
traditional asynchronous token. Control tokens preserve liveness and
safeness of an asynchronous system, allowing GMLD stages to fire in the
correct sequence. The fundamental difference is that a control token
always skips the evaluation phase of the data logic. Data tokens always
require the evaluation of the data logic.
3.2.1 Data Path
No modification of any data path elements are required to implement
the GMLD template. There is one behavior which is different, which
affects the data path and is not necessarily caused by it. GMLD
introduces the notion of persistent data. Persistent data means that when
an GMLD stage has finished evaluating, all output data rails remain in a
valid state, until that stage re-evaluates. This is distinct from MLD, where
49
the data rails are guaranteed to go neutral after fanout stages signal an
acknowledge. Persistent data is required for GMLD to support the correct
function of control tokens. When a control token passes through a stage,
the previously computed data value will be need to be driven to all
fanouts of this stage. By keeping the previous value intact on the data
output rails, correct data is implicitly stored and driven to all dependent
loads.
3.2.2 Control Path
The control path for GMLD uses a similar control circuit as the GMLD
template. To reflect the enhanced functionality, the GMLD circuit is
called the FBIG controller, for Full-Buffer Isolate Gated. This controller
performs the same general functions as the FBI controller, but with
enhancements to support the idea of control and data tokens.
The biggest difference between the two FBI designs is the FBIG controller’s
use of dual-rail request bits. The FBIG controller uses a 1-of-2 encoding on
the request lines to indicate two different types of requests. These
requests correspond exactly to the two token types. Table 2 below shows
the encoding on these signals.
50
Signal Meaning
L.0 Request with no data (control)
L.1 Request with update (data)
L.e Acknowledge
Table 2 : Dual Rail GMLD Signals
Both the left and right channels use the identical encoding. The FBIG
controller drives all other data path control signals similarly to the FBI
controller. A block diagram of the FBIG controller and all its inputs and
outputs is shown below in Figure 8.
Figure 8 : FBIG Block Diagram
51
3.2.3 FBIG Design
The most critical component to the GMLD template is the design of the
FBIG controller. At a high level, it must:
• Implement the same speed-independent timing as the FBI controller
• Correctly distinguish between a control request and a data request
• Not introduce any burdensome timing assumptions on the GMLD
template
• Prevent the data path from as much evaluation as possible when
receiving a control request
• Correctly require evaluation of the data path when at least one
fanin stage indicates an update request
• Exhibit a forward latency competitive with MLD
• Exhibit a local cycle time competitive with MLD
3.2.3.1 FBIG Controller Signal Transition Graph
The STG for the FBIG controller is shown below in Figure 9. This STG shows
only the repeating transitions of the FBIG controller, and excludes the reset
sequence. Places are shown as circles, and places not representing
choice or joins have no label. This STG represents a pure unique-choice
Petri Net [38]. The initial marking of the STG shown in Figure 9 represents
the post-reset state of a TOKBUF stage. For a non-TOKBUF pipeline stage,
52
the token marking the edge from R.1+ to R.e-/2 must be moved to p4.
Since the STG is the same between the two (excepting initial marking),
only one version is shown for clarity.
Figure 9 : FBIG Controller STG
3.2.3.2 L.0 Path
The FBIG STG can be understood in two parts. The first part are the
transitions caused by a control token, or L.0+ transition. In this case, all
53
fanins are indicating that no new data is being sent, so the current GMLD
stage does not need to re-evaluate. In this case, the FBIG controller will
quickly assert R.0+ afterwards. This is to propagate a control request on to
the fanout stages, promoting a low forward latency in this regime. Once
an R.0+ request is asserted to succeeding stages, the controller will
continue the handshake on the left side. Concurrently, the controller will
also handshake on the right, until the R.e+ transition occurs, and the right
channel becomes empty.
This control token operating regime is very simple and has a fast LCT,
described in greater detail below. The FBIG controller does not drive any
of the data path pre-charge or enable lines, and all data path logic
remains static during this handshake. Subsequently, it is most desirable to
have GMLD stages process control tokens instead of data tokens as
frequently as possible. Not only does processing a control token save
switching power in the current stage, but this control token will propagate
to all fanout stages as well. Having an intelligently driven enable bit to
promote frequent control tokens potentially improves many stages of
GMLD logic.
54
3.2.3.3 L.1 Path
The second and more sophisticated part of the FBIG behavior is the data
token regime. This sequence begins with a data update request, or L.1+
transition. L.1+ immediately causes two transitions to fire: en+ and pc-.
The en+ transition is critical for fast GMLD operation. It is imperative that
the en+ transition take place as soon as the current stage knows that at
least one fanin stage requires an update. If one or more fanins sends a
new data token, then the current stage must be recalculated (by
asserting en+). Data cannot propagate through the data path logic
unless the en signal is high. Consequently, this signal must be driven as
early as possible to avoid introducing additional delay in the forward
latency of the stage.
Additionally, the current stage must immediately pre-charge the V_LOGIC
as soon as it is known that a data token is being requested. Speed is
important in this case, because data may already be available for
evaluation in the data path. The intent is to pre-charge the V_LOGIC,
removing the existing, persistent data, and transition into an evaluate
state as fast as the FBIG controller will allow. The STG for the controller
shows this behavior; first by cycling pc from low to high, next by driving
eval low to enable evaluation. Because these transitions must occur in a
55
sequence, beginning this sequence early is important to obtain fast
forward latency.
The FBIG controller should also assert R.1+ with minimal delay. The
persistent data feature of GMLD, however, requires rigor to ensure QDI
timing requirements are enforced. With persistent data behavior, there is
still valid data on the data rails of the stage outputs when the FBIG
controller sees L.1+. It is important not to assert R.1+ too early, or else
fanout stages will potentially consume data having the previous incorrect
value. This can be prevented by waiting until the V output of the
V_LOGIC transitions low, indicating the data rails have been reset to
neutral. R.1+ can then be asserted, and fanout stages will capture the
next valid data driven by the current stage. It is important to note that this
behavior also stresses the implied neutrality constraint of the original MLD
design. If this simple timing assumption is met, it will prevent fanouts from
capturing stale data.
3.2.3.4 Gate Model
The FBIG controller is implemented using a tool called Petrify. Petrify is an
asynchronous synthesis tool which uses Signal Transition Graphs as a circuit
specification [18]. Petrify then produces several possible implementations
of this STG, depending on user preferences. It is capable of generating
56
both a generalized C-element netlist, or a primitive gate model. The
model used to simulate GMLD to generate results in Chapter 5 is the
primitive gate model. However, while this output is simple to implement in
gates, it is not timing nor area competitive. Consequently, there were
transition timings assigned to each equation consistent with a transistor
level implementation. Each equation could be re-implemented as a
custom transistor model; the MLD FBI circuit is implemented in this fashion.
To form a reasonable comparison between the two controllers, similar
reasonable delays are assigned to each gate level equation. An efficient
transistor FBIG model was desired for the simulations performed in this
research, but insufficient time was available to implement and test this
version of the FBIG controller.
During Petrify synthesis, additional states were required to create a
Complete State Coding (CSC) [18]. These states were added during
Petrify’s automatic synthesis. These extra states add to the GMLD local
cycle time (as tested), and in theory are candidates for optimization. It is
possible that a human optimized version of the FBIG controller may
remove one or all of these states, improving the measured LCT of a GMLD
stage. For the purposes of the measurements in this research, Petrify’s
output will be considered the reference implementation. Appendix A
57
contains an STG that has been annotated with the extra CSC transitions
required by Petrify.
The gate logic equations used to represent the FBIG controller are shown
in Figure 10.
[Le] = en' (csc1 + L0) (csc0' + V');
[en] = en csc1 + L1;
[eval] = csc0 V';
[pc] = L1' en' + V' + csc0 + R1 + R0 + Re';
[R0] = csc1 csc2' (Re en' R1' + R0) + Re R0;
[R1] = Re R1 + V' + eval;
[csc0] = V' pc R1 + csc0 (csc1 + en) + L0;
[csc1] = en' R0' csc1 + csc2 + V';
[csc2] = csc0' L0' + R0 csc2;
Figure 10 : FBIG Gate Equations
3.2.4 Forks and Joins
Supporting forks are very straightforward for both MLD and GMLD. A fork
occurs when the data rails of one stage terminate in at least two separate
fanout stages. When this case occurs, the acknowledgement signal
feeding the driving stage’s R.e port must reflect the acknowledgement
status of all fanouts. Aggregating the acknowledgements using a C-
58
element achieves this. Both MLD and GMLD have a single
acknowledgement signal, so a C-element is sufficient for both templates.
Joins require special handling in the GMLD template. GMLD requires
special behavior when a stage has multiple fanins. The request bit is really
implemented by a 1-of-2 channel, so aggregating multiple requests can
clearly not be accomplished using simple C-elements. Joins are
implemented using a special cell, simply called a JOIN. A JOIN element
uses the 1-of-2 inputs from all fanins, and generates a 1-of-2 request for a
single GMLD stage. The behavior of the JOIN is best illustrated using a
table of values, shown in Table 3.
Fanin Description Request Lines Fanins Output Request
All control tokens L[1:0] = 01 All R[1:0] = 01
At least one data token L[1:0] = 10 >=1 R[1:0] = 10
All data tokens L[1:0] = 10 All R[1:0] = 10
none L[1:0] = 00 All R[1:0] = 00
Table 3 : Join Request Behavior
The JOIN cell outputs are driven by an OR-functionality rule. If all inputs
are control tokens, the output will signal a control token. If one or more
inputs are a data tokens, then the output is forced to signal a data
59
request. A JOIN cell will be customized according to the number of fanins
supported. In the Proteus tool flow described in Chapter 5, JOIN cells from
2 to 4 fanins were used. GMLD stages requiring more than 4 fanin ports
used cascaded JOIN cells. As a rule, cells which have large fanin widths
will incur higher delays.
A diagram of a JOIN2 cell is shown in the next section below, in Figure 11.
3.2.5 Enables
GMLD requires the integration of a data path signal into the control path
functionality. Signals which are designated as enables for particular
stages must have their values introduced into the control path requests.
We call these special enables G Enables, or a gating signal. G Enables
are computed no differently from any other data path signal – the data
rails are driven as a stage output like any other output. Additionally, G
Enables can drive any amount of regular data path logic cells. One thing
that is required, is that they drive a special cell used to factor in the G
Enable value in the control path. This cell is a special cell in the GMLD
primitive library, called a JOING cell.
3.2.5.1 JOING Cells
GMLD drives a control request according to the value of a G Enable bit.
To facilitate this, GMLD uses a special cell to merge enable bit signals into
60
the control path. These cells are called JOING cells, for join-gated. These
cells have all the functionality of a JOIN cell, but have also an additional
two input wires for a dual-rail enable bit. The circuit designer, or the
clustering tool (such as Proteus), must assign this signal specifically to drive
the JOING cell. This signal may have other fanouts in addition to the
JOING. The behavior of the JOING cell is best illustrated in Table 4, in a
format similar to the JOIN.
Fanin Description G Enable value Inputs Value Fanins Outputs
All control tokens G[1:0] = 10 L[1:0] = 01 All R[1:0] = 01
At least one data token G[1:0] = 10 L[1:0] = 10 >= 1 R[1:0] = 10
All data tokens G[1:0] = 10 L[1:0] = 10 All R[1:0] = 10
All control tokens G[1:0] = 01 L[1:0] = 01 All R[1:0] = 01
At least one data token G[1:0] = 01 L[1:0] = 10 >= 1 R[1:0] = 01
All data tokens G[1:0] = 01 L[1:0] = 10 All R[1:0] = 01
none G[1:0] = 00 L[1:0] = 00 All R[1:0] = 00
Table 4 : JOING Request Behavior
A JOING cell maintains the OR-functionality of all fanin request inputs, but
adds another layer of AND-functionality afterwards. The logical OR of all
fanin requests is then ANDed with the G bit value. A G bit value of zero
will always result in a control request output, and a G bit value of one will
always result in a data request output. Since the G bit input is driven from
61
a data path cell, its data rails will not immediately return to neutral after
an acknowledge. This is the result of the persistent data property.
Accordingly, the JOING cell may drive its output requests to neutral only
after all request bits have transitioned to neutral. It must not wait for
neutrality on the G bit input. Figure 11 below shows the block diagram of
a JOING2 element.
JOIN2
L1[1]
L1[0]
L0[1]
L0[0]
JOING2
L1[1]
L1[0]
L0[1]
L0[0]
R[1]
R[0]
G[1]
G[0]
R[1]
R[0]
Figure 11 : JOIN2 and JOING2 block diagrams
The G Enable has propagates beyond the stage it directly gates,
affecting subsequent stages. When a stage’s data path is disabled with a
control token, this control token will propagate to all fanouts. This is a
desirable property, as it allows a single gating event to affect switching
activity for many successive stages. The limitation on this behavior is
caused by the JOIN cells. When multiple fanins drive an GMLD stage, the
stage must re-evaluate its data path when one or more inputs signal an
update (data token). In this case, control tokens may cease to
propagate when they must be joined with other data tokens. For this
62
reason, it is desirable to limit the number of fanins in each stage. The
greater number of fanins a stage has, the greater probability the stage will
be forced to process a data token, and kill any control token requests.
This becomes an important constraint during automated pipeline
optimization.
3.2.6 Correctness
This thesis does not offer a proof of correctness for the GMLD template.
Nonetheless, several important criteria of usefulness can be observed from
the simulation results. Correctness can be inferred from successful long
term simulation. A proof of flow-equivalence, similar to Linder and
Cortadella et. al.’s approach [33][19], is being composed through
separate research. The Full-Buffer Isolate nature of the FBIG controller
enforces safeness, which prevents multiple tokens from occupying a single
channel. Liveness is enforced by an automated de-synchronization tool,
and essentially means that there exists a sequence of evaluations which
allow any other circuit stage to fire again. Deadlock can occur when the
circuit reaches a state in which no token may flow any further. A proof of
deadlock freedom is not shown in this research as well, although one is
being established in a separate research effort at USC.
63
3.2.7 Cycle Time
GMLD has a similar cycle time as the MLD template in a linear pipeline.
The cycle time becomes more divergent when a larger number of fanins
are used per stage, which will be explained in greater detail below. The
measured cycle times are also a product of the automatic Petrify
synthesis of the STG specification. It is possible with greater human
optimization to reduce both the forward and backward latencies,
improving the cycle time. There are also two operating regimes for the
GMLD template, one for control tokens and another for data tokens.
Each will be described separately.
3.2.7.1 Control Tokens
The local cycle time (LCT) for a linear GMLD pipeline during a control
request is 16 transitions. The forward latency (FL) 4 transitions, and the
backward latency (BL) is 12 transitions. When control tokens propagate
through the stage, the forward latency is entirely determined by the
control path. This behavior offers an excellent opportunity to potentially
improve the evaluation speed of a circuit, if control tokens are issued
suitably often. In this regime, the presence of JOIN/G cells will increase
the FL, while the presence of merge C-elments will increase the BL. The
equations governing FL and BL are shown below in Figure 12.
64
FL
C
= 4 + ceil(log
2
(I))*2 + G*2
BL
C
= 12 + ceil(log
2
(O))*2
LCT
C
= FL
C
+
BL
C
Legend: I = Number of fanin channels
O = Number of fanout channels
G = 0 if no gating signal, 1 if stage is
gated
Figure 12 : Control Token Regime Delays
3.2.7.2 Data Tokens
The local cycle time for a linear GMLD pipeline is 32 transitions. It has a
forward latency of 6 for a minimum depth data path, and a backward
latency of 26 transitions. The data path does have a timing dependency
on the control path, which must be considered when estimating LCT. As
detailed in the FBIG controller section, the en+ transition must occur
before data path evaluation can begin. This means that the data path
may potentially wait for all request signals to arrive, if the delays on the
request paths are longer than the data rails. The FL is thus a maximum of
either the data path delays or the control path delays, including any join
elements. The equations measuring FL and BL are described in Figure 13
below. Note that these equations are simplified using time measurements
approximated in simulation (Chapter 4). Fully generalized versions of
these equations are listed in Appendix B.
65
D
data
= (n-1) * 2
+
2
D
ctrl
= (2*ceil[log
2
(I)] + 4*G) + 6
+
ceil[log
8
(Z)]*2
FL
D
= max(D
data,
D
ctrl
)
BL
D
= (ceil[log
2
(O)]*2)+ 26
LCT
D
= FL
D
+ BL
D
Legend: I = Number of fanin channels
O = Number of fanout channels
G = 0 if no gating signal, 1 if stage is
gated
Z = Number of dual-rail outputs
n = Depth of domino logic
Figure 13 : Data Token Regime Delays
3.2.7.3 Limitations
There exists a limitation in the throughput of the GMLD template. This
throughput limitation is caused by the forward latency of the control path.
By inspecting the FL defined in Figure 13, we can see that FL is dependent
on the number of fanins to a stage. For cases where stages have a large
number of fanins, and thus a large input join delay, this can very
negatively affect FL and subsequently the LCT. This problem gets
exacerbated when the arrival times of the fanin stages are disjoint. Cases
66
where one fanin channel has a late arrival time will withhold data path
evaluation for a significant amount of time, compared to the delays
exhibited by the MLD template. This LCT problem is explored in detail in
Chapter 5.
67
4 De-Synchronization Design Flow
4.1 Introduction
An active area of research in the asynchronous design community is a
robust form of synchronous circuit de-synchronization. We believe that
using de-synchronization over high level synthesis allows access to a vast
set of existing designs, as well as simpler integration with EDA tools. The
critical step in this process is the de-synchronization step, through an
automated tool. It is our goal to clearly demonstrate the benefits of
asynchronous design over an equivalent synchronous one, and eventually
achieve widespread usage of asynchronous circuits.
4.1.1 EDA Integration
VLSI engineers primarily specify designs at the Register Transfer Level (RTL).
This usually involves using High Level Description (HDL) languages like
Verilog or VHDL to describe the design. Therefore, in order to have a
reasonable chance at introducing an asynchronous methodology to VLSI
engineers, maximum compatibility with this input format is required. It
follows that a de-synchronization process which preserves an HDL level RTL
description of the design is the most likely enjoy good commercial
success.
68
Our objective is to provide compatibility with existing EDA tools, and use a
process familiar to VLSI engineers. Our design flow follows the established
synchronous flow of a high level description, technology implementation,
place and route, and timing closure. We would like engineers to be able
to build on their knowledge of synchronous circuits, and apply the same
basic techniques in an asynchronous design flow. Additionally, we avoid
transformations which are practically sub-optimal, unfriendly to
optimization, or un-repeatable. Indeed, a good criteria for our success
would be to disturb the existing synchronous design flow as little as
possible.
4.1.2 Automatic De-Synchronization
The GMLD and MLD templates introduced in this research are intended for
automatic de-synchronization. This property does not preclude the
designer from creating pipeline stages manually. Nevertheless, the local
cycle time associated with these templates means a significant number of
primitive gates can be clustered into a single stage, without adversely
impacting local throughput. With this affinity in mind, we endeavor to
create an automatic gate clustering tool which is capable of high quality
automatic pipeline optimization.
69
We develop a tool called Proteus for exactly this purpose. Proteus is
designed to perform a complete synthesis of a regular gate level
synchronous netlist, and produce a gate level asynchronous netlist.
Proteus currently supports three target asynchronous template designs:
Pre-charge Half Buffer (PCHB) [34], MLD, and GMLD. Proteus depends on
a standard RTL synthesis step, using any commercial logic synthesis tool.
At present, it will handle designs with only one clock domain, and one
reset, either synchronous or asynchronous. The tool can be conceived as
running in two phases. The first phase analyzes all gate primitives, and
determines a nearly-optimal clustering for the entire design. The second
phase takes the clustered design and generates an asynchronous
mapping.
The block diagram shown in Figure 14 shows the flow we propose to use
for de-synchronizing regular synchronous designs.
70
Figure 14 : Proteus Design Flow
4.2 Design Entry and Synthesis
A candidate design begins as an RTL description of the circuit. Since the
first tool flow step is synthesis using a standard EDA tool, either Verilog or
VHDL is supported. During synthesis, a special technology library is used as
the synthesis target. This technology library is a single-rail image library,
that has a matching asynchronous dual-rail implementation library. The
image library uses regular binary inputs and outputs on all cells, and
represents the combinational function for every leaf cell available in the
71
asynchronous library. Additionally, accurate timing and capacitance
information is annotated to the image library, so the synthesis tool can
perform load and timing based tradeoffs and improvements.
When the synthesis tool has completed, a Verilog gate level netlist is
produced. This gate level netlist will be hierarchically flat, and contain
single-rail instantiations of cells which are in the asynchronous library.
Behavioral constructs are not permitted in this file, because they cannot
implemented by Proteus. Wires cannot have more than one driver, and
no tri-state nodes are supported.
4.3 Clustering Algorithm
To enable the clustering algorithm to have good quality results, a cycle
time analysis (CTA) is first performed on the single-rail netlist. CTA
annotates the arrival times of all logic gates with an integer number,
equal to the number of primitive gates between that cell and it’s first flip-
flop fanin. For gates with many inputs, the lowest arrival time is used. This
process will allow Proteus to evaluate if two clusters are good candidates
for merging, if their arrival times are similar. During cluster formation,
Proteus uses a range of arrival times for each cluster, which spans the
earliest to the latest arrival time for gates in the cluster data path.
72
4.3.1 Algorithm Details
At the heart of the Proteus tool is the logic clustering algorithm. This
algorithm is the product of research by doctoral candidate Georgios
Dimou at USC. The clustering algorithm is fundamentally a greedy
algorithm, which is iterative. The clustering algorithm begins by placing
every logic cell in it’s own pipeline stage. It then computes the result of
merging each pipeline stage with every other stage, for all legal merges.
Constraints are placed on allowable merges by setting limits on design
attributes. Valid attributes to constrain include :
• Local cycle time
• Maximum channel fanin
• Maximum channel fanout
• Maximum data path logic depth
• Maximum data path logic width
• Delays of C-elements
The clustering algorithm continues to evaluate the benefits of merging
stages, until no valid merges are left.
The algorithm is currently structured to use an exhaustive search. For
every cluster in the circuit, Proteus will evaluate the effect of merging that
cluster with all other adjacent clusters. During each evaluation, all user-
73
directed attributes shown above are tested entirely. If any one constraint
violates a user specified limit, that possible merge is not considered. When
all possible valid merges which obey user constraints are accumulated at
the end of one iteration, the merge with the most favorable speed and
area characteristics is chosen. This move will be implemented even if the
speed or area improvement is negative, as long as it yields better results
than its competitors. This move is then implemented, and the circuit
structure updated. A new iteration will begin, using the same strategy.
When all possible merges are exhausted, the terminal condition is met.
4.3.2 Generality
The clustering algorithm uses user-selectable parameters in order to better
guide its clustering for specific asynchronous templates. Pre-charge Half
Buffer, for example, requires very small clusters to be effective at high-
throughput. MLD and GMLD require relatively large clusters to best exploit
their local cycle time. Templates with lower (faster) cycle times should
ideally need small clusters, and thus require different constraints on cluster
sizing. Proteus will endeavor to match the data path logic depth with the
ideal forward latency of each asynchronous template. Proteus is
designed to generate high quality results regardless of the target
template FL delay.
74
4.4 Slack Matching
The slack matching process is explained in Chapter 3. Slack matching is a
very important part of the asynchronous design flow, because it ensures
the resulting circuit has high throughput. Even with a nearly-optimal
pipeline optimization, there could be troublesome slack mismatches in the
circuit. Slack matching will correct any slack imbalances, and create a
circuit capable of running at a user-specified global cycle time. Proteus
performs a slack matching pass on the clustered netlist, and will use the
results from the slack matching algorithm to insert buffer clusters wherever
necessary. Proteus will add these buffer clusters to the circuit, and ensure
all wires are connected properly.
4.5 Template Mapping
When Proteus has optimized the clustering and performed slack matching
on the circuit, the final step of template mapping will occur. The circuit is
represented as a single-rail circuit until this point. Proteus will take this
single-rail representation and convert it to the channel style dictated by
the asynchronous template type. At present, only dual-rail templates are
supported, although in theory others can be used. Each wire is
encapsulated into a channel, which are created during clustering to
represent connections among logic clusters. Each cluster will be
75
implemented as a template stage, and all wires needed for handshaking
are added. In some cases, additional modifications are required by a
particular template style. Proteus will perform these modifications during
the template mapping stage. For example, the last level of domino logic
driving cluster outputs must be V_LOGIC, so Proteus will convert the last
domino cells to V_LOGIC variants.
4.6 GMLD Enhancements
GMLD required a two specific enhancements to the Proteus tool. The first
was an additional constraint to the clustering algorithm, affecting legal
merge candidates. The generation and usage of the G Enable signal
requires that the driving cluster, which sources the G Enable, be disjoint
from the load cluster, which sinks this signal. For every cluster which is
gated by a G Enable, the generation of this signal must not be inside the
data path logic of that stage. This constraint is accomplished by tracking
which clusters are gated by G Enables, and preventing any move from
being considered which violates this separation.
Secondarily, Proteus must emit an GMLD template circuit. GMLD is
substantially similar in structure to MLD, so much of the template mapping
code is reused. Additional subtleties such as instantiation of JOIN/G
blocks were relatively straightforward to implement.
76
4.7 Conclusion
Proteus is a powerful tool which enables automatic de-synchronization of
synchronous circuits. It achieves the goal of re-using existing, verified
synchronous designs, and exploiting the throughput and switching
benefits of asynchronous templates. It is compatible with many
commercial synthesis tools, and can be integrated without difficulty into a
production circuit design flow. The clustering algorithm achieves good
results, subject to user supplied constraints. Proteus’ design is easily
extended to support the GMLD template.
77
5 Experimental Results
5.1 Throughput
To measure the local cycle time of the GMLD template, several test
benches were used. In all test benches, the environment must be infinitely
fast (zero delay model), so that all delays occur from the transitions
associated with each GMLD stage. Each of the following test benches
was constructed by hand, rather than using Proteus. Since we know that
data path depth plays a large role in local cycle time, it is more
informative to use minimum depth data paths, to better gauge the
impact of the control path. Additionally, we know from inspection that a
control request is significantly faster than a data request. Accordingly, it is
more useful to measure data requests for each test bench, to determine if
GMLD is a competitive template choice in its slowest regime.
5.1.1 Linear
To measure the base LCT, a linear pipeline was constructed. The linear
pipeline measures the minimum LCT of a template, with no potential join
or fork penalties. A diagram of the linear pipeline test bench is shown in
Figure 15. GMLD operates in two regimes, a control token request and a
data token request. Each regime is tested separately. Control tokens
78
propagate more quickly than data tokens, due to their independence
from the data path logic. The measured behavior of a linear GMLD
pipeline is shown below in Table 5, along with the latencies of the MLD
template, for reference.
Figure 15 : Linear Pipeline
Table 5 : Latencies of MLD and GMLD Templates
5.1.2 Fork and Join
Another useful structure for testing template behavior is a pipeline fork
and join. This structure is defined by a stage having outputs which fork to
two branches, and then re-converge some number of stages later. For
the purposes of testing, it is useful to create an imbalance in the depth of
each branch. This imbalance will stress the penalties incurred by a large
Delay MLD GMLD – Control Token GMLD – Data Token
Forward Latency 2 4 6
Backward Latency 16 12 26
Local Cycle Time 18 16 32
79
forward latency of the asynchronous template used. The test bench
structure used for this test is shown below in Figure 16. From this fork/join
structure, we can generate a Full-Buffer Channel Net (FBCN) model of this
circuit [35]. Figure 17 below shows the FBCN model.
Figure 16 : Fork/Join Test Pipeline
Figure 17 : FBCN Model of Figure 16
The FBCN model shows us what the global cycle time of the circuit should
be. Forward places are denoted as circles, and backward places are
denoted with squares. For the fork node, 2 transitions are added to the
backward latency, because the acknowledges must propagate through
a C-element before reaching the fork node. For the join node, 2 is added
to the forward latencies of the fanins, because a request must propagate
80
through a JOIN element. The FBCN model indicates we should have a
critical cycle of 6 + 6 + 6 + 8 + 28 = 54. Simulation agrees with this model.
By comparison, MLD has a critical cycle of 2 + 2 + 2 + 4 + 18 = 28. This
represents a significant global cycle time penalty caused by GMLD. This
behavior is clearly undesirable, but appears difficult to avoid. GMLD has
the unfortunate restriction of having the control path affect the forward
latency, a trait that does not restrict MLD. GMLD fundamentally requires
an interaction between the data path (a G signal) and the control path.
This timing penalty is the result of this interaction, however unintentional.
5.1.3 Ring
To further illustrate the cycle time limitations of the GMLD control path, a
ring structure can be investigated. A test bench was constructed which
implements the ring pipeline shown in Figure 18. This ring contains 7 buffer
stages, and 1 TOKBUF stage. Only one token circulates the ring. The
global cycle time is the time it takes the token to begin at any stage, and
propagate until it next reaches the same point again. The FBCN model of
this ring is shown in Figure 19. The FBCN model indicates a GCT of 6 * 8 =
48 transitions for GMLD. The simulation model matches this value. MLD
has a GCT of 2 * 8 = 16 transitions. In this case, GMLD is 3 times slower than
MLD. This is a clear deficiency in the GMLD performance implementation.
81
Figure 18 : Ring Test Pipeline
Figure 19 : FBCN Model of Figure 18
The linear pipeline, fork/join, and ring structures allow us to accurately
measure the latencies of the GMLD asynchronous template. Given the
interactive nature of the data and control paths, it is important to fully
characterize the timing behavior of this template before, so we can
reason about its limitations in full scale circuits.
82
5.2 Microprocessor Test Design
In order to verify the functionality, performance, and power dissipation of
the GMLD template, we will choose a robust testbench. We prefer to
choose a pre-existing, verified, and understandable circuit on which to
evaluate Proteus, and the GMLD template. Since GMLD is intended as a
high-throughput asynchronous template, a processor core makes an
excellent candidate design. There are several such designs available for
non-commercial use on the OpenCores site [39]. Most of the OpenCores
CPU designs use existing instruction set architectures, and can be
reasonably well understood.
5.2.1 The tv80 Design
The design used for all GMLD testing is the “tv80” Verilog Z80 core. The
tv80 design is a Verilog port of an original VHDL Z80 design “t80” [28]. The
t80 design has been FPGA proven in several real applications, which
makes it an ideal base. The tv80 core is chosen because of its ideal size,
robust simulation tools, well-known instruction set architecture, and HDL
coding style capable of generating EDFF registers. Selecting a test design
among the OpenCores designs was based, in part, on the unusual
difficulty of finding a design which implies EDFFs during synthesis. Most
83
commercial synthesis tools require a specific construct in Verilog to imply
an EDFF, and surprisingly few designs are written to exploit this feature.
The tv80 is an 8-bit CPU, with some 16-bit instruction support. It contains 16
general purpose registers, and another 6 special purpose registers. It has
a 16-bit external address bus, and an 8 bit data bus. The tv80 project
source comes with a behavioral testbench which makes it easy to
simulate the tv80 design. The testbench will open a file with assembled
binary instructions, and load it into a simulation RAM accessible by the
processor core. This allows the testbench to easily run arbitrarily complex
assembled tv80 code. This feature is invaluable for testing a variety of
workloads on the tv80 implementation.
5.2.2 Design Synthesis
The tv80 design followed the tool flow presented in Chapter 4. The design
was synthesized using Synopsys Design Compiler ™ (DC), against the MLD
technology image library. DC was given a frequency target to meet,
because gate delays annotated in the image library are realistic. This
ensures optimization on the critical path. Critical paths do become
performance bottlenecks in de-synchronized circuits, so effort invested on
this step will pay dividends during asynchronous optimization. The default
84
wire load model was used during synthesis. The resulting output
contained 3232 wires, and 3548 MLD gates.
The synthesized tv80 design was successfully de-synchronized using the
Proteus tool. The original tv80 design used synchronous reset, but was
converted to asynchronous reset in the source RTL to eliminate
combinational dependencies on the reset signal. Furthermore, the Design
Compiler’s buffer insertion did not perform well using image library load
information, so all buffer cells were removed before Proteus began its
clustering analysis. After buffer removal, cycle time analysis shows a
critical cycle of 25 transitions. This means there is a loop inside the design,
which is composed of 25 gates. Efforts to reduce the GCT lower than 25
will not succeed, because throughput will be determined by the length of
this loop, plus any control overhead.
5.2.3 Test Case Stimulus
We can take advantage of the robust testbench supplied with the tv80
design, and generate various tests to run on the tv80 design. Another
powerful tool in this process is the Small Device C Compiler (SDCC) [22].
This compiler supports the compilation of standard C code on small
microcontrollers, such as a Z80 compliant core. Using SDCC to compile
code for the tv80 was very straightforward; the compiled output is
85
translated into a text file format which the tv80 testbench can read. Due
to the fact that these tests must be simulated at the gate level using a
Verilog simulator, they must out of necessity be very compact.
Three different test cases were compiled using the SDCC, reflecting very
different types of workload. The first testcase is a CRC16 calculator. It
computes the 16-bit Cyclic Redundancy Check for an array of numbers,
and then checks this computation against a known solution as a self-
check. The second testcase is a very single floating point calculation, with
a simple test before program exit. The last testcase is a selection sort
algorithm, which sorted several numbers. An initial version of this test
included a self-check before termination, but the version compiled for
energy estimation in this Chapter ran without it to cut down on simulation
time.
5.2.4 Energy Analysis
The GMLD template is designed with the intention of improving the
dynamic switching characteristics of the MLD template. Therefore, we
would like to measure the energy improvement the GMLD features offer
over an MLD implementation. To accomplish this, we can compare the
MLD and GMLD energy numbers over a variety of testcases and design
clusterings. In order to establish an equivalence in the final clustering
86
between the two templates, the Proteus clustering algorithm was
modified to use identical constraints for both. In practice, MLD has one
less restriction on legal clusters, based on the absence of the G Enable
partitioning rule (Chapter 3).
Many different power simulations were run on several variations of the
tv80 design, using the three synthetic testcases. In order to speed up
simulation and determine maximum design throughput, a special
asynchronous testbench was crafted to stimulate the asynchronous tv80
core. For each testcase, a simulation was first conducted on the
behavioral synchronous RTL, using the original behavioral testbench.
During simulation, cycle-accurate vectors of stimulus and expects were
recorded into a text file at the tv80 core I/O boundary. Once these
stimulation vectors and expected outputs are recorded, they can be
replayed in the asynchronous testbench. The asynchronous testbench
reads each vector, one at a time, and drives all tv80 asynchronous inputs
to the correct vector value, and observed the resulting output. A
testcase must sequentially match all outputs in order to pass. This allows
the asynchronous tv80 design to run at the fastest possible speed, without
having synchronization overhead at the core boundary. This also provides
87
a best-case measure of core throughput, assuming an infinitely fast
environment.
5.2.5 Proteus Tool Flow
To create different sample points for throughput and power, many
different Proteus netlists were generated. Two local cycle time points
were generated, one for a target LCT of 80, and another at 36. An LCT of
80 is very similar to leaving the LCT unbounded – it is highly unlikely that this
limit would constrain the clustering algorithm. An LCT of 36 is more
reasonable, preventing clusters with data path depths becoming too
large for an efficient forward latency match. The other variable
parameter was the fanin limit. Values of 0 (unbounded), 4, and 8 were
used. This parameter should have the largest effect on the GMLD power
consumption, because the probability of a cluster being disabled by a
control token drops linearly as the number of fanins grow. This principle
tells us that high fanin widths will defeat potential cluster gating, and
should be avoided if possible.
Simulation is done on the output of the asynchronous Proteus netlist. For
the results presented here, we use Cadence NC-verilog™ simulation tool.
When simulation of the design is run, waves of all signals are captured to a
Value-Change Dump (VCD) file. This wave file is then used as an input to
88
the final tool, Synopsys PrimeTime PX™. PrimeTime is capable of reading a
VCD simulation file, the Verilog gate netlist, and a technology library, and
computes an estimate of the average and peak power consumption.
This power estimate uses a library-derived wire load model, and switching
activity captured in the VCD file combined with the pin capacitances in
the technology library.
5.2.6 Measuring Results
In this case, we are specifically concerned with energy dissipation rather
than average power. While average power is an important metric, we
would like to focus on the reduction in signal switching caused by
conditional evaluation. Therefore, energy will be the criterion by which
we measure the power improvement of the GMLD template over MLD.
There is a significant consideration to bear in mind for the power estimates
shown in this research. At the time of this writing, the GMLD technology
library is not production-ready. This means that all cells have definitions,
but none of them are characterized. All cells did in fact have pin
capacitances, and based on transistor gate capacitance with no wire
load. Cells which have no transistor model, such as the JOING cells, use
best-estimate pin capacitances. These cells have no internal
capacitances, and thus cannot yield any internal power measurements.
89
This is a shortcoming of the results presented, but it is our belief that more
accurate measurements, such as post-place-and-route measurements,
will be similar to those presented herein.
5.2.7 Simulation Data
The full results from these simulations are shown in Appendix C. For every
combination of Proteus clustering constraints, all three synthetic testcases
are run. The results of these simulations are shown for both MLD and
GMLD, with an energy comparison between the two.
5.3 Data Analysis
We can learn many things about the behavior of GMLD from the tv80
simulations. The most noticeable effects of simulating this design are the
difference in delay and in energy consumption. There is a large
difference in average power, but we know already that any difference in
delay between the two template styles will yield similar differences in
average power.
5.3.1 Delays
It is clear from the simulation data that GMLD suffers from poor
throughput. GMLD simulations take between 5.5 to 8 times as long as
MLD, which is an alarming result. However, we do know from the ring
90
experiment in Section 5.1.3 that GMLD can be considerably slower in an
algorithmic loop. The tv80 does have an algorithmic loop, so it is likely that
a combination of poor clustering with poor control path overhead is
causing this large difference. This poor delay problem must be
adequately addressed if GMLD can be considered a viable alternative
high-performance template.
5.3.2 Energy
What is more encouraging is the energy improvement. By computing the
average power and knowing the total testcase runtime, we compute the
energy consumed for each testcase. We can then compare the energy
required for a test using an MLD circuit against the GMLD circuit. The
simulation data shows us that on average, GMLD uses 16.7% less energy
than an equivalent MLD implementation. This ratio will vary according to
the actual clustering produced by Proteus. Remarkably, the energy
improvement is extremely independent of workload. For each testcase,
the average energy improvement across all circuit topologies was nearly
identical. This indicates that energy consumption is largely independent
of instruction mix, and might depend more heavily on required micro-
architectural resources in the CPU, such as the program counter or
address generator.
91
5.3.3 Cluster Impact Affecting Energy
It is clear from the simulation data that the greater the number of logic
clusters, the greater the energy requirement. This relationship is intuitive.
For each new cluster, handshaking to all fanins and fanouts must occur.
This will directly increase the amount of switching activity, and directly
impact energy. It is illustrative to compare the increase in energy
consumption to the increase in clusters. If we set the lowest energy
consumption of all circuit topologies to be equal to a 1.0 baseline, we can
compare the relative energy increase of all other topologies. Figure 20
shows that as cluster count increases, so does energy. This relationship
appears, to the first order, to be sub-linear. There is an anomaly for one of
the Proteus generated netlists, which appears similarly in other
comparisons presented here. Remarkably, the energy increase was
identical for each testcase, across all circuit topologies. This graph
represents identical data for all three testcases.
92
Number of Clusters on Energy Consumption
1.00
1.05
1.10
1.15
1.20
1.25
1.30
1.35
1.40
1.45
1.50
0.00 200.00 400.00 600.00 800.00 1000.00 1200.00
Clusters
E n erg y In crease F acto r
Figure 20 : Number of Clusters Affecting Energy Consumption
5.3.4 Netlist Properties Affecting Energy
It is useful to view the effect of Proteus clustering on energy improvement.
Logic clustering produces three parameters of interest: cluster fanin,
cluster depth, and number of clusters. All three values are related, but
each property might affect energy consumption differently. The effect of
cluster fanin and cluster depth are graphed in Figure 21. Again we see an
anomaly corresponding with the previous graph, which is the same netlist
that produces the other deviant data samples. This graph shows us that
average fanin and depth decrease do not significantly affect the energy
93
reduction ratio. There is a mild reduction in energy ratio as fanin increases
and depth increases, however this effect is very small. This is likely the
result of the energy-unaware clustering algorithm Proteus uses.
Netlist Parameter Effect on Energy
0.00
2.00
4.00
6.00
8.00
10.00
12.00
14.00
16.00
18.00
20.00
2.00 3.00 4.00 5.00 6.00 7.00 8.00
Fanin/Depth
Energy Improvement (%)
Avg. Fanin
Avg. Depth
Figure 21 : Netlist Parameters Affecting Energy
Figure 22 supports this conclusion as well. The number of clusters increases
the energy reduction ratio, but by a very small amount. Having a greater
number of clusters has similar implications to having lower fanin and
smaller cluster depth. When the number of clusters increases, it is more
likely the average fanin width drops, along with cluster depth.
94
Total Clusters Effect on Energy
0.00
2.00
4.00
6.00
8.00
10.00
12.00
14.00
16.00
18.00
20.00
0.00 200.00 400.00 600.00 800.00 1000.00 1200.00
Clusters
Energy Improvement (%)
Figure 22 : Energy Improvement by Total Clusters
When these parameters and their effects on energy improvement are
graphed separately, we can see that none of the three have a significant
impact on energy improvement. It is likely that a power-aware clustering
algorithm would be needed to positively affect total circuit energy.
Again, we see an oddity in energy improvement for one of the data
points, which corresponds to a particular clustering topology produced by
Proteus.
95
6 Conclusion
GMLD is a partially successful refinement of the MLD asynchronous
template. We show that it is possible to exploit the availability of
synchronous flip-flop enable signals, and map these over to the
asynchronous domain. In doing so, we essentially create a conditional-
evaluation asynchronous template. Furthermore, we can generate this
template style automatically, using an automated de-synchronization tool
currently under development at USC.
6.1 Throughput
One of the most important design criteria for the GMLD template was high
throughput. Of the many candidate synchronous designs for automatic
de-synchronization, high throughput designs are the most likely to have
interest in an asynchronous design flow. Accordingly, we place an
emphasis on high throughput asynchronous template styles.
Unfortunately, there exists a performance limitation in the current
implementation of GMLD which prevents it from being competitive with
MLD.
The throughput dependency on the control path is not immediately
obvious from the original design. Furthermore, GMLD relies on at least
96
three levels of domino logic inside the data path, to offset the longer
control path delays. Proteus was not able to generate clusters with
sufficiently long data paths, as can be seen from the table in Appendix C.
The data path depths reported in that table are also maximum data
depths – Proteus does not distinguish individual circuit paths in the data
logic. It is likely that the average depth is shorter than the number shown
here, further exacerbating GMLD’s poor performance.
6.2 Energy Results
Power consumption has become a major issue for high performance
circuits. The VLSI industry increasingly finds itself limited by power
consumption, rather than area or external I/O. It is therefore important to
apply the many valuable lessons learned during synchronous design to
asynchronous circuits, to capitalize on prior research. Clearly, using a
synchronous load enable signal as an indicator for data path evaluation is
a viable idea for asynchronous circuits. This choice does not preclude the
use of other techniques, such as power gating, high VT transistors, or
voltage scaling in the asynchronous domain [7].
We are able to successfully use the notion of a gating update signal to
prevent the GMLD clusters from evaluating their data path. The
effectiveness of this feature is challenged by circuits with high fanin
97
widths, due to the OR-functionality nature of data path re-evaluation. This
technique does become more effective when coupled with an
aggressively fine-grained clock gating strategy in the synchronous
domain. While the tv80 design possessed some amount of gated flip-
flops, the enable logic was unsophisticated, and was not maximally
optimized. Aggressive manual optimization, or use of powerful tools such
as Calypto Systems’ PowerPro [14], may greatly improve the energy
efficiency of a de-synchronized circuit.
6.3 Future Research
We have identified two major domains in which GMLD can be improved,
in both throughput and join structures. Throughput remains the most
urgently needed improvement. In an ideal case, we would like to remove
the impact of the control path on latency, such as the case with MLD.
This is a difficult problem, because the evaluation of the data path
depends on a G Enable state driven by a fanin stage. This inherent
dependency of the data path on a control element makes the
orthogonal separation of the two difficult.
6.3.1 Throughput
There are two potential solutions which emerge to this problem. One is to
sufficiently separate the G Enable generation from its eventual use,
98
eliminating a problematic timing arc between the two. The enforcement
of this becomes troublesome, placing special burdens on the clustering
tool to move around logic. Furthermore, if the pipeline distance between
the G Enable and the target cluster becomes greater than one, extra
buffers must then be implemented to transport this bit forward, adding
extra area. Clearly, more analysis of the alternatives is needed for this
solution.
Another possible solution is to employ a strategy used by Ozdag and
Beerel in [41]. In [41], Ozdag and Beerel noted that loop structures could
cause performance bottlenecks, due to an imbalance in the speed at
which the request and data propagate. The solution was to insert special
loop-breaker elements in the loop cycles, which would propagate the
request signal at the same speed as the data. This concept may also be
applied to the GMLD circuits, which might be useful in speeding up critical
cycles.
6.3.2 Joins
In the ideal case, we would like to have simple join structures which do
not negate the effectiveness of control tokens. This is indeed a difficult
problem, due to the dependency of a single cluster on all of its fanins.
One possible strategy is to allow the G Enable to drive specific transistors
99
inside the domino data logic, acting as a more precise indicator for which
parts of the data path to evaluate. This will cause a higher area penalty,
and it may be difficult to determine which parts of the logic to gate with
this bit. Furthermore, this idea may invalidate the special ability of the
control path to assert a fast request to all fanouts, based on a control
token request type.
One useful idea is to have a notion of switching probability associated
with each G Enable bit. Each G Enable can be observed during
synchronous functional simulation, and an aggregate switching
probability can be computed over the lifetime of the simulation. The
clustering tool can then use these probabilities to determine how effective
using certain combinations of fanins will be with certain G Enables.
Extending this concept, correlations can be computed between sets of G
Enables. If certain portions of logic have high switching correlations, it
would then make sense to partition them together. This scheme allows
clusters with high fanins to have a better chance of having concurrent
control tokens on the fanin inputs, though more intelligent clustering. This
idea would require lots of experimentation to prove its usefulness.
100
The GMLD design presented here lays a solid foundation for further
refinement. We can say that using synchronous clock enables can be
used successfully by asynchronous circuits, and we are optimistic that
future implementations of energy-efficient design template can avoid the
excessive delay penalties we incur in GMLD.
101
Appendix A: Petrify FBIG Controller STG
102
Appendix B: Generalized Timing Equations
Definitions
Ii : Number of fanin channels to stage i
Oi : Number of fanout channels from stage i
Zi : Number of dual-rail outputs from stage i
Gi : 0 if stage i has no gating signal, 1 if stage i has G-enable
n : Maximum number of sequential domino cells in stage i data path
FLi : Forward latency of stage i
BLi : Backward latency of stage i
Djoin,i : delay of stage i join network
Dfork,i : delay of stage i fork network
Ddata,i : forward delay of stage i data path
Dctrl,i : forward delay of stage i control path
lVL : V_LOGIC delay
lC : 2-input C-element delay
lFL : FBIG control path FL for data update
lBL : FBIG control path BL
lgate : single gate delay
ldom : delay of a single domino cell
ljoin : 2-input JOIN delay
ljoing : 2-input JOING delay
lcomp : completion tree element delay
Equations
Djoin,i = ljoin * ceil[log2(Ii)] + ljoing * Gi
Ddata,i = (n-1) * ldom + lVL
DVL,i = ceil[log8(Zi)]*lcomp
Dctrl,i = Djoin,i + lFL + DVL,i
103
FLi = max(Ddata,i, Dctrl,i)
Dfork,i = ceil(log2(Oi)) * lC
BLi = Dfork,i + lBL
LCTi = FLi + BLi
104
Appendix C : Simulation Data
105
References
1. Alidina, M.; Monteiro, J.; Devadas, S.; Ghosh, A.; Papaefthymiou, M.
“Precomputation-Based Sequential Logic Optimization for Low
Power.” Proceedings of the 1994 IEEE/ACM international
conference on Computer-aided design. p. 74-81. 1994.
2. Andrikos, N.; Lavagno, L.; Pandini, D.; Sotiriou, C. “Fully-Automated
Desynchronization Flow for Synchronous Circuits.” Proceedings of
the 44th annual conference on Design automation, pp. 982 – 985,
2007.
3. Bardsley, A.; Edwards, D. A.; “The Balsa Asynchronous Circuit
Synthesis System.” Proc. Forum on Design Languages, Sept. 2000.
4. Bardsley, A.; Edwards, D. A.; “Synthesizing an asynchronous DMA
controller with Balsa,” Journal of Systems Architecture, Vol. 46, pp.
1309–1319, 2000.
5. Beerel, P. “Asynchronous Circuits: An Increasingly Practical Design
Solution.” Proc. International Symposium on Quality Electronic
Design, pp. 367 – 372, 2002.
6. Beerel, P.; Lines, A.; Davies, M.; Kim, N. H. “Slack Matching
Asynchronous Designs.” IEEE International Symposium on
Asynchronous Circuits and Systems, p. 184, 2006.
7. Beerel, P.; Roncken, M. “Low Power and Energy Efficient
Asynchronous Design.” Journal on Low Power Electronics, Invited
Paper. Nov. 2007.
8. Benini, L.; Siegel, P.; De Micheli, G. “Automatic Synthesis of Gated
Clocks for Power Reduction in Sequential Circuits.” IEEE Design and
Test of Computers, pp. 32--40, 1994.
9. van Berkel, K.; Kessels, J.; Roncken, M.; Saeijs, R.; Schalij, F.; “The VLSI-
Programming Language Tangram and Its Translation into
Handshake Circuits.” Proc. European Conference on Design
Automation (EDAC), pp. 384–389, 1991.
106
10. van Berkel, K.; Josephs, M.; Nowick, S. “Scanning the Technology:
Applications of Asynchronous Circuits” Proceedings of the IEEE
Volume 87, Issue 2, page(s):223 – 233, Feb 1999.
11. Blunno, I.; Cortadella, J.; Kondratyev, A.; Lavagno, L.; Lwin, K.;
Sotiriou, C. “Handshake Protocols for De-Synchronization.” Proc.
International Symposium on Advanced Research in Asynchronous
Circuits and Systems, pp. 149–158, Apr. 2004.
12. Cadence Design Systems. “Low Power in Encounter RTL Compiler.”
Product Version 7.1.2
13. Calhoun, B. H.; Verma, N.; Chandrakasan, A. “Sub-Threshold Design:
the Challenges of Minimizing Circuit Energy.” Proceedings of the
2006 International Symposium on Low Power Electronics and Design,
pp. 366 – 368, 2006.
14. Calypto Design Systems. “Calypto Systems PowerPro™.“ Oct 28,
2007. http://www.calypto.com/products/PowerPro_CG.html
15. Chelcea, T.; Bardsley, A.; Edwards, D.; Nowick, S. M. “A Burst-Mode
Oriented Back-End for the Balsa Synthesis System.” Proc. Design,
Automation and Test in Europe (DATE), pp. 330–337, Mar. 2002.
16. Chelcea, T.; Nowick, S. M. “Resynthesis and Peephole
Transformations for the Optimization of Large-Scale Asynchronous
Systems.” Proceedings of the 39th conference on Design
Automation, pp. 405-410, 2002
17. Chu, T. A. “Synthesis of Self-timed VLSI Circuits from Graph-theoretic
Specifications. “ Doctoral Thesis, Massachusetts Institute of
Technology, June 1987.
18. Cortadella, J.; Kishinevsky, M.; Kondratyev, A.; Lavagno, L.;
Yakovlev, A. “Petrify: A Tool for Manipulating Concurrent
Specifications and Synthesis of Asynchronous Controllers.” Proc. of
the 11th Conf. Design of Integrated Circuits and Systems, Nov. 1996.
19. Cortadella, J.; Kondratyev, A.; Lavagno, L.; Sotiriou, C. “A
Concurrent Model for De-synchronization.” Handouts of the
International Workshop on Logic Synthesis, pp. 294-301, 2003.
107
20. Cortadella, J.; Kondratyev, A.; Lavagno, L.; Lwin, K.; Sotiriou, C.
“From Synchronous to Asynchronous: an Automatic Approach.”
Proc. of International, Design and Test in Europe Conference and
Exhibition, pp. 1368 – 1369, 2004.
21. Cortadella, J.; Kondratyev, A.; Lavagno, L.; Sotiriou, C. “De-
synchronization: Synthesis of Asynchronous Circuits from
Synchronous Specification.” IEEE Transactions on
Computer-Aided Design of Integrated Circuits and Systems, pp.
1904--1921, Oct. 2006.
22. Dutta, Sandeep. “Small Device C Compiler.” Oct, 28, 2007. SDCC is
a retargettable, optimizing ANSI - C compiler that targets the Intel
8051, Maxim 80DS390, Zilog Z80 and the Motorola 68HC08 based
MCUs. http://sdcc.sourceforge.net/
23. El-Essawy, W.; Albonesi, D.H.; Sinharoy, B. “A Microarchitectural
Level Step-Power Analysis Tool.” Proceedings of the International
Symposium on Low Power Electronics and Design. p. 263 – 266, 2002
24. Fant, K.M.; Brandt, S.A. “NULL Convention Logic™: A Complete and
Consistent Logic for Asynchronous Digital Circuit Synthesis.”
Proceedings of International Conference on Application Specific
Systems, Architectures and Processors, pp. 261-273, Aug. 1996.
25. Fazel, K.; Thornton, M. A.; Reese, R. B. “PLFire: A Visualization Tool for
Asynchronous Phased Logic Designs.” IEEE/ACM Conference on
Design, Automation, and Test in Europe (DATE), pp. 1096—1097,
March 2003.
26. Ferretti, M. “Single-track Asynchronous Pipeline Template.” Ph.D.
Thesis, University of Southern California, Aug, 2004.
27. Harris, D.; Horowitz, M. A. “Skew-Tolerant Domino Circuits.” IEEE
Journal of Solid-State Circuits, 32(11), pp. 1702-- 1711, November
1997.
28. Hutchinson, G.; Wallner, D. Nov 6, 2007. “The TV80 is an 8-bit Z80-
compatible microprocessor core, written in Verilog. “
http://ghutchis.googlepages.com/tv80homepage
108
29. Jacobson, H.; Bose, P.; Zhigang Hu; Buyuktosunoglu, A.; Zyuban, V.;
Eickemeyer, R.; Eisen, L.; Griswell, J.; Logan, D.; Sinharoy, B.; Tendler,
J. “Stretching the limits of clock-gating efficiency in server-class
processors.” 11th International Symposium on High-Performance
Computer Architecture, p. 238- 242, Feb 2005.
30. Kessels, J.; Peeters, A. “The Tangram Framework: Asynchronous
Circuits for Low Power.” Proc. of Asia and South Pacific Design
Automation Conference, pp. 255–260. Feb. 2001.
31. Kondratyev, A.; Lwin, K. “Design of asynchronous circuits by
synchronous CAD tools.” IEEE Design and Test of Computers, vol. 19,
no. 4, pp. 107-117, July/August 2002.
32. Ligthart, M.; Fant, K.; Smith, R.; Taubin, A.; Kondratyev, A.
“Asynchronous Design Using Commercial HDL Synthesis Tools.” Proc.
International Symposium on Advanced Research in Asynchronous
Circuits and Systems, pp. 114–125, Apr. 2000.
33. Linder, D.; Harden, J. “Phased Logic: Supporting the Synchronous
Design Paradigm with Delay- Insensitive Circuitry.” Transactions on
Computers, Volume 45, Issue 9, Page(s):1031 – 1044, Sep 1996.
34. Lines, A. “Pipelined Asynchronous Circuits.” Masters Thesis,
California Institute of Technology, June 1995.
35. Luo, Y.; Yu, J.; Yang, J.; Bhuyan, L. “Low Power Network Processor
Design Using Clock Gating.” Proc. of the 42nd annual Conference
on Design Automation, pp. 712 – 715, 2005.
36. Manohar, R.; Martin, A. “Slack Elasticity in Concurrent Computing.”
Lecture Notes in Computer Science, Vol 1422, pp. 272 – 285, 1998.
37. Martin, A. “Compiling Communicating Process into Delay-Insensitive
VLSI Circuits.” Distributed Computing 1(4):226-234, 1986
38. Murata, T. “Petri Nets: Properties, Analysis and Applications.”
Proceedings of the IEEE Volume 77, Issue 4, pp. 541 – 580, Apr 1989.
39. OpenCores.org “A collection of RTL hardware cores licensed under
the LGPL license.” Oct, 28, 2007. http://www.opencores.org
109
40. Ozdag, R. “Template Based Asynchronous Design.” Doctoral Thesis,
the University of Southern California. Nov. 2003.
41. Ozdag, R.; Beerel, P. “Technical Report: High-Speed QDI
Asynchronous Pipelines.” General Technical Report. University of
Southern California. 74 p.
42. Papanikolaou, A.; Miranda, M.; Wang, H., Catthoor, F.; Satyakiran,
M.; Marchal, P., Kaczer, B.; Bruynseraede, C.; Tokei, Z. “Reliability
Issues in Deep Deep Sub-micron Technologies: Time-Dependent
Variability and its Impact on Embedded System Design.” 13th IEEE
International On-Line Testing Symposium, p. 121, 2007.
43. Reese, R.; Thornton, M.; Traver, C. “A Fine-Grain Phased Logic CPU.”
Proceedings of the IEEE Computer Society Annual Symposium on
VLSI, pp. 70-79, Feb. 2003.
44. Reese, R.; Thornton, M.; Traver, C. “A Coarse-Grain Phased Logic
CPU.” IEEE Transactions on Computers, Volume 54 , Issue 7, pp. 788
– 799, July 2005.
45. Singh, M.; Nowick, S. M. “Fine-Grain Pipelined Asynchronous Adders
for High-Speed DSP Applications.” Proceedings of the IEEE
Computer Society Annual Workshop on VLSI, p. 111, 2000.
46. Smirnov A.; Taubin A.; Karpovsky M.; Rozenblyum L. “Gate Transfer
Level Synthesis as an Automated Approach to Fine-Grain
Pipelining.” Workshop on Token Based Computing (ToBaCo), p.67-
77, June 22, 2004.
47. Smirnov, A.; Taubin, A.; Su, M.; Karpovsky, M. “An Automated Fine-
Grain Pipelining Using Domino Style Asynchronous Library.” Fifth
International Conference on Application of Concurrency to System
Design, pp. 68 – 76, June 2005.
48. Sparso, J.; Furber, S. “Principles Of Asynchronous Circuit Design: A
Systems Perspective.” Boston: Kluwer Academic Publishers, 2001.
49. Synopsys Corporation. “Power Compiler™ User Guide.” Version Z-
2007.03, June 2007.
110
50. Thornton, M. A.; Fazel, K.; Reese, R. B.; Traver, C. “Generalized Early
Evaluation in Self-timed Circuits.” Proceedings of the Design,
Automation and Test in Europe, Page(s):255 – 259, 2002.
51. Tiwari, V.; Malik, S.; Ashar, P. “Guarded Evaluation : Pushing Power
Management to Logic Synthesis/Design.” International Symposium
on Low Power Design, p. 221--226, April 1995.
52. Wikipedia, the Free Encyclopedia. “Clock Gating.” Oct, 28, 2007.
http://en.wikipedia.org/wiki/Clock_gating
53. Yeap, G. “Practical Low Power Digital VLSI Design.” Massachusetts:
Kluwer Academic Publishers, 1998.
Abstract (if available)
Abstract
Existing techniques that translate synchronous gate-level circuits into asynchronous counterparts do not adequately support gated clocks and consequently can incur unnecessary switching activity. This thesis proposes to address this limitation by translating the gated clocked structures into control circuits that triggers the evaluation of the datapath evaluation only when necessary. In particular, we propose a new design template called Gated Multi-Level Domino (GMLD) and a corresponding de-synchronization design flow that supports the automatic translation of a clock-gated synchronous netlist to a high-performance power-efficient asynchronous circuit. We demonstrate that this new approach reduces dynamic switching power with limited impact on area and maximum-achievable throughput.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
An asynchronous resilient circuit template and automated design flow
PDF
Clustering and fanout optimizations of asynchronous circuits
PDF
A low-power high-speed single-ended parallel link using three-level differential encoding
PDF
Theory, implementations and applications of single-track designs
PDF
Power optimization of asynchronous pipelines using conditioning and reconditioning based on a three-valued logic model
PDF
Automatic conversion from flip-flop to 3-phase latch-based designs
PDF
Average-case performance analysis and optimization of conditional asynchronous circuits
PDF
Variation-aware circuit and chip level power optimization in digital VLSI systems
PDF
Multi-phase clocking and hold time fixing for single flux quantum circuits
PDF
Improving the speed-power-accuracy trade-off in low-power analog circuits by reverse back-body biasing
PDF
Power-efficient biomimetic neural circuits
PDF
Radiation hardened by design asynchronous framework
PDF
Advanced cell design and reconfigurable circuits for single flux quantum technology
PDF
Production-level test issues in delay line based asynchronous designs
PDF
Designing efficient algorithms and developing suitable software tools to support logic synthesis of superconducting single flux quantum circuits
PDF
Library characterization and static timing analysis of asynchornous circuits
PDF
High power, highly efficient millimeter-wave switching power amplifiers for watt-level high-speed silicon transmitters
PDF
Power efficient design of SRAM arrays and optimal design of signal and power distribution networks in VLSI circuits
PDF
Formal equivalence checking and logic re-synthesis for asynchronous VLSI designs
PDF
Clocking solutions for SFQ circuits
Asset Metadata
Creator
Shiring, Kenneth J.
(author)
Core Title
Gated Multi-Level Domino: a high-speed, low power asynchronous circuit template
School
Viterbi School of Engineering
Degree
Master of Science
Degree Program
Computer Engineering
Publication Date
03/10/2010
Defense Date
11/26/2007
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
asynchronous,de-synchronization,low power,OAI-PMH Harvest
Advisor
Beerel, Peter A. (
committee chair
), Gupta, Sandeep K. (
committee member
), Parker, Alice C. (
committee member
)
Creator Email
shiring@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-m1042
Unique identifier
UC1197596
Identifier
etd-Shiring-20080310 (filename),usctheses-m40 (legacy collection record id),usctheses-c127-23581 (legacy record id),usctheses-m1042 (legacy record id)
Legacy Identifier
etd-Shiring-20080310.pdf
Dmrecord
23581
Document Type
Dissertation
Rights
Shiring, Kenneth J.
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Repository Name
Libraries, University of Southern California
Repository Location
Los Angeles, California
Repository Email
cisadmin@lib.usc.edu
Tags
asynchronous
de-synchronization
low power