Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Power optimization of asynchronous pipelines using conditioning and reconditioning based on a three-valued logic model
(USC Thesis Other)
Power optimization of asynchronous pipelines using conditioning and reconditioning based on a three-valued logic model
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
POWER OPTIMIZATION OF ASYNCHRONOUS PIPELINES USING
CONDITIONING AND RECONDITIONING BASED ON A THREE-VALUED
LOGIC MODEL
by
Arash Saifhashemi
A Dissertation Presented to the
FACULTY OF THE USC GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulllment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(ELECTRICAL ENGINEERING)
December 2012
Copyright 2012 Arash Saifhashemi
Dedication
To my parents and my family
for their continuous love and support.
ii
Acknowledgements
I would like to oer my sincere gratitude to my advisor, Professor Peter A. Beerel,
for of his guidance and support, and for all the invaluable materials that I learnt
from him. I specially appreciate his enthusiastic dedication to my research and
his availability even during weekends and while traveling. I also thank Professor
Massoud Pedram, my co-advisor, for his technical guidance. In addition, I extend
my appreciation to the rest of my defense committee members, Professor Melvin A.
Breuer, Professor Alice C. Parker, and Professor David Kempe for their valuable
technical feedback on my work. In particular, I appreciate Dr. Kempe's thorough
review of my dissertation draft. I am also very thankful to my master degree
advisor, Professor Hossein Pedram, who introduced me to the asynchronous world.
Also, I acknowledge the industry and government support for my research. I
was very lucky to regularly present my progress to Andrew M. Lines, Georgios
Dimou, Jon Dama, and Prasad Joshi at Intel Corporation. Also, I am indebted to
my manager, Hong Wang at Intel Labs, for providing the opportunity for exploring
the industrial applications of this thesis. This work was funded by a grant from
Intel and a grant from NSF.
Next, I would like to thank my colleague and ocemate, Mehrdad Najibi, for
his insightful feedback on my work. Many ideas in this thesis are the result of hours
iii
of rewarding discussions with him. Moreover, I wish to thank other students in
our research group who have helped me implement my ideas. Specically, Hsin-Ho
Huang, who developed and implemented the reconditioning heuristic algorithm and
also helped me with experimental results. Gang Wu, Yujun Cao, Chen Qian, Boyi
Huang, and Roberto Martin del Campo Vera, helped me implement the SVC2RTL
algorithm. I would also appreciate the help of the USC Ming Hsieh Department of
Electrical Engineering sta: Annie Yu, Diane Demetras, Shane Goodof, and Estela
Lopez for their support and help, and also Professor Gandhi Puvvada, whom I had
the pleasure to work with as a teaching assistant.
Finally, I am very thankful to everyone who supported me during graduate
school, my parents and my family and specially my wonderful friends who made
the graduate life at USC a joyful and unforgettable experience. I specially thank
Mahmood Shirooyeh, Ehsan Barjasteh, Sarah Armand, Azi Raou, Payam Bo-
zorgi, Vahid Arbab, Ali Kamranzadeh, Pouria Mojabi, Hamid R. Chabok, Mahdi
Youzbashi, and Ali Kazazi.
iv
Table of Contents
Dedication ii
Acknowledgements iii
List of Figures vii
List of Tables ix
Abstract x
Chapter 1: Introduction 1
1.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Chapter 2: Background: Proteus Flow 9
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Asynchronous Gates and Netlists . . . . . . . . . . . . . . . . . . . 10
2.3 SVC2RTL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4 Synthesis, Clustering, and Physical Design . . . . . . . . . . . . . . 18
Chapter 3: Three-Valued Logic 20
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2 Basic Denitions and Properties . . . . . . . . . . . . . . . . . . . . 22
3.2.1 3V Functions . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.3 3V Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.4 3V Model of an Asynchronous Netlist . . . . . . . . . . . . . . . . . 31
3.4.1 3VL Model Limitations and Iteration Stall-Freedom . . . . . 31
3.5 SEND Reconditioning . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.6 RECEIVE Reconditioning . . . . . . . . . . . . . . . . . . . . . . . 41
3.7 Observability Condition . . . . . . . . . . . . . . . . . . . . . . . . 48
3.7.1 Local Observability Partial Care . . . . . . . . . . . . . . . . 48
3.7.2 Global Observability Partial Care . . . . . . . . . . . . . . . 54
3.8 GOPC-Based Conditioning . . . . . . . . . . . . . . . . . . . . . . . 55
3.9 GOPC-Based RECEIVE Reconditioning . . . . . . . . . . . . . . . 62
v
Chapter 4: Operand Isolation Conditioning 64
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.1.1 A Motivating Example . . . . . . . . . . . . . . . . . . . . . 65
4.2 Problem Statement and Algorithm . . . . . . . . . . . . . . . . . . 68
4.2.1 An Algorithm for Finding OIEDs . . . . . . . . . . . . . . . 69
4.3 Pre-Layout Power and Cost Evaluation . . . . . . . . . . . . . . . . 71
4.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.5 Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . 75
Chapter 5: Reconditioning 76
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.2 Reconditioning Problem . . . . . . . . . . . . . . . . . . . . . . . . 78
5.3 Reconditioning Network Model . . . . . . . . . . . . . . . . . . . . 81
5.3.1 An Intuitive Example . . . . . . . . . . . . . . . . . . . . . . 82
5.3.2 The Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.4 Creating Possible Conditional Vectors . . . . . . . . . . . . . . . . . 89
5.4.1 An Algorithm for Finding the Longest PCSV . . . . . . . . 91
5.4.2 Correctness of the Algorithm . . . . . . . . . . . . . . . . . 93
5.4.3 Complexity of the Algorithm . . . . . . . . . . . . . . . . . . 97
5.4.4 Range of Motion . . . . . . . . . . . . . . . . . . . . . . . . 97
5.5 Finding Optimal Distances of Nodes . . . . . . . . . . . . . . . . . 98
5.6 Integer Linear Program . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.7 A Heuristic Approach for Reconditioning . . . . . . . . . . . . . . . 102
5.8 Sharing Conditional Communication Primitives . . . . . . . . . . . 110
5.9 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 110
5.9.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
Chapter 6: Another Application of 3VL Model: Formal Verication123
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
6.2 Formal Verication . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
6.2.1 Binary Coded Three Valued Logic (BC3VL) . . . . . . . . . 126
Chapter 7: Conclusion and Future Work 132
7.0.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
Bibliography 135
vi
List of Figures
Figure 2.1 Proteus
ow . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Figure 2.2 Asynchronous gates described in SVC . . . . . . . . . . . . 12
Figure 2.3 DataGeneretor and DataBucket described in SVC . . . . . 13
Figure 2.4 Conditional Accumulator in SVC . . . . . . . . . . . . . . . 14
Figure 2.5 1-of-N channels and four-phase handshaking . . . . . . . . 15
Figure 2.6 Description of rtl interface tasks . . . . . . . . . . . . . . . 16
Figure 2.7 Generated SystemVerilog description of the RTL-BODY for
the Conditional Accumulator example in Figure 2.4 . . . . 17
Figure 2.8 WRAPPER for CondAccumulator . . . . . . . . . . . . . . 18
Figure 3.1 A sample conditional SVC description . . . . . . . . . . . . 21
Figure 3.2 3VL model of an asynchronous system . . . . . . . . . . . . 22
Figure 3.3 Primitive functions in three-value logic . . . . . . . . . . . 24
Figure 3.4 An asynchronous netlist communicating with environment
modules S and R . . . . . . . . . . . . . . . . . . . . . . . 32
Figure 3.5 The 3V network model of the asynchronous netlist . . . . . 32
Figure 3.6 Right distributivity of SEND (SEND Reconditioning) . . . 36
Figure 3.7 When e = 0 : f (x
1
; x
2
)re = 06= f (x
1
re; x
2
re) = 1 . . . 41
Figure 3.8 Function RECEIVE
1
, denotedr
1
, is similar tor except
for when e = 0 . . . . . . . . . . . . . . . . . . . . . . . . . 42
Figure 3.9 RECEIVE reconditioning . . . . . . . . . . . . . . . . . . . 42
Figure 3.10 Local OPC when the output is never N . . . . . . . . . . . 54
Figure 3.11 Peephole Optimization (e = 0!GOPC(B;y
1
) = 1) . . . . 56
Figure 3.12 GOPC conditioning . . . . . . . . . . . . . . . . . . . . . . 60
Figure 3.13 RECEIVE reconditioning . . . . . . . . . . . . . . . . . . . 62
Figure 4.1 A 32-bit ALU in SVC . . . . . . . . . . . . . . . . . . . . . 66
Figure 4.2 A 32-bit ALU before (left) and after (right) proposed opti-
mization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
Figure 4.3 An example of OIED . . . . . . . . . . . . . . . . . . . . . 69
Figure 4.4 Pre layout estimation of the relative benet (
P
f
P
t
otal
) vs. ac-
tivity ratio(r
f
) . . . . . . . . . . . . . . . . . . . . . . . . . 73
Figure 5.1 Reconditioning can reduce switching activity . . . . . . . . 77
vii
Figure 5.2 A sample asynchronous sub-netlist . . . . . . . . . . . . . . 82
Figure 5.3 Reconditioning network and possible moves involving node 3 84
Figure 5.4 A sample asynchronous netlist and its reconditioning network 88
Figure 5.5 PUSH and PULL operations using HEAD and TAIL . . . . 91
Figure 5.6 Committing a move m = (v; d) . . . . . . . . . . . . . . . . 103
Figure 5.7 Invalid and new moves as a result of committing a move . . 104
Figure 5.8 Algorithm 5.2 gets trapped in local a minimum . . . . . . . 107
Figure 5.9 Replicating versus sharing conditional nodes on multiple
fanouts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
Figure 5.10 SVC description of RECON1: Dual mode arithmetic unit . 113
Figure 5.11 RECON2: A multiplier with enable and default value . . . 114
Figure 5.12 Comparison greedy heuristics with S
c
= 0 , 0:5 , and 1 . . . 115
Figure 5.13 Comparison of power consumption for RECON1 . . . . . . 118
Figure 5.14 Comparison of power consumption for RECON2 . . . . . . 119
Figure 5.15 Comparison of power consumption for ALU-OI . . . . . . . 120
Figure 5.16 Running time comparison for RECON1, RECON2, ALU-
OI for 16, 32, 64, and 128 bit datapaths. The ILP results are
not shown for 128 bit datapaths due to very long running
times. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
Figure 5.17 Comparison of the predicted versus the post-layout power
improvements . . . . . . . . . . . . . . . . . . . . . . . . . 122
Figure 6.1 Desired comparison scenarios in Proteus
ow . . . . . . . . 126
Figure 6.2 Equivalence check using BC3VL transformation . . . . . . 127
Figure 6.3 Examples of BC3VL encoded cells . . . . . . . . . . . . . . 127
Figure 6.4 The valid bit of the primary output of RTL BODY is calcu-
lated from the valid bit of the primary input of the RTL BODY129
Figure 6.5 Decomposed ALU. Each sub-block is described in SVC . . 131
Figure 6.6 Screenshot of Conformal tool declaring two binary coded
3V networks are equivalent . . . . . . . . . . . . . . . . . . 131
viii
List of Tables
4.1 Post-layout total switching power measurements (mW ). . . . . . . 74
4.2 Area cost measurements (m
2
) . . . . . . . . . . . . . . . . . . . . 75
6.1 BC3VL (Binary Coded 3VL) . . . . . . . . . . . . . . . . . . . . . . 127
ix
Abstract
Asynchronous circuit design has long been considered a suitable alternative to syn-
chronous design due to its potential for achieving lower power consumption, higher
robustness to process variations, and faster throughput. The lack of commercial
CAD tools, however, has been a major obstacle for its wide-spread adoption. Al-
though there is no central clock, the use of handshaking protocols in asynchronous
circuits often introduces excessive switching activity which then translates to high
power consumption. This work is about reducing unnecessary switching activity
and automatically optimizing power consumption of asynchronous circuits. Our
focus is on circuits synthesized by a recently commercialized high-throughput asyn-
chronous ASIC CAD
ow called Proteus.
We propose a formal framework based on three-valued logic in which we model
the conditional communication primitives of asynchronous circuits as three-valued
operators. Using this framework, we introduce two systematic power reduction
techniques for asynchronous circuits: conditioning (adding conditional communi-
cation) and reconditioning (moving conditional communication primitives).
x
To demonstrate an application of conditioning, an automatic method is intro-
duced for the adoption of operand-isolation in asynchronous circuits using com-
mercial synchronous CAD tools. Our experimental results show that for a 32-bit
ALU, we achieve an average of 53% power reduction for about a 4% increase in
area with no impact in performance.
An integer linear program (ILP) formulation is presented for the reconditioning
problem. Our experimental results show that our ILP can be solved in reasonable
time for medium size circuits and can achieve up to 80% power improvement. For
larger circuits when the ILP formulation is not tractable, a fast heuristic algorithm
is provided. Our experimental results show that our heuristic algorithm can still
signicantly reduce power and can achieve close-to-optimal results.
Finally, a method for formal verication of asynchronous circuits based on
the three-valued logic model is presented. In particular, we show how our three-
valued logic model can enable the use of powerful commercial synchronous formal
verication tools for equivalence check of asynchronous circuits.
xi
Chapter 1
Introduction
As the demand for faster and more integrated circuits (IC) rises, low power con-
sumption becomes a more important design paradigm. With the high popularity
and versatility of mobile electronic devices, lower battery usage has become a key
requirement. In addition to the battery usage, higher power consumption gener-
ates more heat, which is detrimental to performance and reliability of the system.
The cost of packaging and cooling increases rapidly with power dissipation. In
today's electronic VLSI technology, low power consumption and energy eciency
are the most dominant constraints for designing integrated circuits (IC) [BR07].
In synchronous design, there are numerous power optimization techniques, such
as clock-gating, operand-isolation, multiple voltage supplies, dynamic voltage scal-
ing, low-swing signaling on long wires, multi- and dynamic-threshold CMOS via
power gating and back-body biasing. In addition, power-aware synthesis and
place-and-route algorithms addressing both static and dynamic power have been
integrated into commercial CAD tool suites [BR07]. Among these methods, clock-
gating and operand-isolation have been successfully and widely used for many years
1
now since automation and verication of these techniques are generally straight-
forward, and they are not process specic. Most commercial RTL-synthesizers
support these techniques, which are largely based on observability don't cares
(ODCs) [HS96], where switching activity is avoided by preventing the propagation
of non-observed data.
The clock network itself can contribute up to 40 % of the total dynamic power
of a chip [MKWS04]. Asynchronous circuits, in which there is no central clock
signal for synchronization, have been demonstrated to be able to achieve lower
power consumption, higher throughput, and higher tolerance of process variability
[BOF10]. If not careful however, the energy overhead associated with handshaking
protocols and local controllers in an asynchronous circuit may exceed the benets.
Asynchronous methodologies have been successfully applied in large designs,
such as System-On-Chips [SMPG07, RVFG05, Lin04], and in low-power applica-
tions such as DCC error correction [VBBK
+
94], an asynchronous 80C51 micro-
controller [VGVBP
+
98], an asynchronous Fano channel decoder [OB04], a seven
band IFIR lter bank [NS99], and three generations of asynchronous microproces-
sors from Caltech [MNW03].
More recently, an asynchronous
oating-point multiplier has been demonstrated
to consume three times less energy per operation while operating at 2:3 times
higher throughput compared to the synchronous baseline implementation [SM12].
2
The inherent
ow control in asynchronous designs and the use of local con-
trollers enables two possibilities: rst, dierent portions of the design can operate
at their individual ideal frequencies, and secondly, each portion can operate and
idle as needed, consuming energy only when and where needed. The latter is of-
ten called conditional communication [BOF10], where data is communicated only
between blocks that are performing useful calculations.
Conditional communication is an important technique in low-power asyn-
chronous design and in essence is very similar to operand-isolation and clock-gating
techniques in synchronous circuits. In contrast to synchronous design, however,
CAD tool support for automatically adding conditionality is very limited. Cur-
rently, architectural decomposition of the high-level specication is mostly done
manually. This work develops techniques for automating this task. Some asyn-
chronous
ows translate only the conditionality that is explicit in the high-level
specication into conditional communication [WM03]. Others translate clock-
gating structures derived from an RTL description to conditional asynchronous
split-merge architectures [Man07, ST06].
The focus of this work is a recently commercialized ASIC
ow, Proteus, for
high-performance asynchronous circuits [BDL11]. The power of Proteus comes
from reusing mature industrial synchronous synthesis and place-and-route tools.
The motivation of this work is to adopt the power reduction techniques available
in commercial CAD tools for asynchronous circuits. In particular, we focus on
3
power-optimization of an asynchronous gate-level netlist, which is automatically
generated in the Proteus
ow.
1.1 Contributions
We introduce two power optimization techniques for asynchronous circuits: rst,
conditioning, which is automatically adding conditional communication primi-
tives between gates in order to reduce switching activity. As an example imple-
mentation, we adopt the operand isolation technique from synchronous design and
automatically add conditional communication to an unconditional asynchronous
netlist. Secondly, reconditioning, which is manipulating and moving the exist-
ing conditional communication primitives from one section of the netlist to another
section. In both cases, the goal is to optimize power consumption while preserving
equivalence.
In order to formally dene the notion of equivalence, we will use the theory of
multi-valued logic [BK99] and in particular three-valued (3V) logic [WF80, FB96,
MRST97, FF03, CCL06]. We dene the conditional communication primitives as
operators in the three-valued logic model and introduce the logical equivalence
of asynchronous circuits with conditional communication primitives and use this
model as a formal framework to prove the correctness of our transformations.
By mapping an asynchronous netlist to a 3V network, we are able to use multi-
valued observability don't care (ODC) [YB00] theory to formally reason about the
4
validity of our proposed power optimization techniques as well as leveraging o of
synchronous automatic power optimization methodologies and CAD tools that are
based on this mature theory.
The three-valued logic model forms the foundation of all of our contributions,
in that it allows us rigorously prove equivalence under our proposed optimizations.
In Addition, we show that the notion of equivalence in the three-valued logic
model can be used to address the longstanding open problem of formal verication
of asynchronous circuits. In particular, we show a method for using sophisticated
industry-proven synchronous formal verication tools to verify the logic equivalence
of asynchronous circuits with conditional communication primitives.
The detailed description of our contributions is as follows:
We model the conditional communication primitives of asynchronous circuits
as three-valued logic operators. This enables the use of three-valued logic al-
gebra for formally evaluating the correctness of adding or moving conditional
communication primitives in a given network for power optimization.
We adopt observability don't care theory in three-valued logic for asyn-
chronous circuits and in particular in the Proteus
ow. Using this frame-
work, we introduce two eective systematic power reduction techniques for
asynchronous circuits: conditioning (adding conditional communication) and
5
reconditioning (moving conditional communication primitives). While lim-
ited forms of conditioning of asynchronous circuits existed in previous re-
search [Man07, ST06, WM03], to the best of our knowledge, the notion of
reconditioning is new.
To demonstrate applications of automatic conditioning, an automatic method
is introduced for the adoption of operand isolation in asynchronous circuits
using commercial synchronous CAD tools. Our experimental results show
that for a 32-bit ALU, we achieve an average of 53% power reduction for
about a 4% increase in area with no impact in performance.
We formulate the reconditioning problem as an integer linear program (ILP)
by extending the classic retiming problem in synchronous circuits [LS91].
Our experimental results show that our ILP can be solved in reasonable time
for medium size circuits and we can achieve up to 80% power improvement.
A heuristic-based algorithm for the reconditioning problem for the cases when
the ILP solution is not tractable is presented. We show that our heuristic
approach can achieve close-to-optimal results in a very short time (about 11
seconds for a circuit with about 70,000 gates).
We present a method for formal verication of asynchronous circuits using
commercial synchronous tools. Previous work in formal verication of asyn-
chronous circuits is mostly based on process and concurrent system modeling.
6
For example, [Roi97] is based on Petri Nets [Mur89], [Kon01, Neg98] is based
on Process Spaces, and [Cun04] is based on Proposition Automata. These
approaches require new and often non-standard languages, have signicant
computational complexity challenges, and call for the development of new
tools. Our three-valued logic model and the proposed binary-coded three-
valued logic model on the other hand, is unique in the sense that it enables
modeling an asynchronous circuit using a two-valued logic network, which
rather than creating new tools, allows the use of industrial synchronous for-
mal verication tools for the equivalence check of asynchronous circuits.
We developed the SystemVerilogCSP (SVC) [SB11] front-end of the Proteus
ow which has been used in many examples throughout this thesis. SVC is
a package based on the standard IEEE SystemVerilog [IEE09] for modeling
asynchronous circuits at a high level of abstraction. This package is currently
used in the Asynchronous VLSI course EE-552 at University of Southern
California and also at Technion University.
1.2 Organization
This thesis is organized as follows:
In Chapter 2 we explain the details of the Proteus
ow including the SVC
front-end.
7
Chapter 3 includes the formal denitions of three-valued logic, 3V networks,
conditioning, and reconditioning.
Chapter 4 presents an automatic method for adding conditional communi-
cations based on operand-isolation in the Proteus
ow.
Our main contribution is in Chapter 5, which includes the mixed integer
linear formulation and a heuristic algorithm for solving the reconditioning
problem.
Chapter 6 includes our secondary contribution, formal verication of asyn-
chronous circuits.
8
Chapter 2
Background: Proteus Flow
While we believe most of the ideas in this work can be applied to other asyn-
chronous circuit design methodologies, the focus of this work is an industrialized
asynchronous ASIC CAD
ow, called Proteus. This chapter provides some back-
ground information about this
ow.
The industrial version of Proteus uses a proprietary language called CAST
[BDL11] at its front-end. Besides the background information about Proteus in
this chapter, Section 2.3 describes the SVC2RTL package (one of our minor contri-
butions) that replaces the proprietary CAST front-end by standard SystemVerilog.
2.1 Introduction
Proteus is a complete and commercially proven asynchronous ASIC
ow [BDL11].
Its current academic version takes a high-level description in a communicating se-
quential processes (CSP)-based [Hoa85] language called SystemVerilogCSP (SVC)
[SB11] at its front-end and generates a netlist of asynchronous gates, which are
then automatically placed and routed using commercial physical design CAD tools.
9
Compared to other asynchronous
ows, such as [KL02, LGC11, CKLS06] Pro-
teus is unique because it not only adopts commercial synchronous CAD tools for
both synthesis and physical design, but is also performance-driven, i.e., it automat-
ically re-pipelines the circuit to meet a given throughput constraint. In particular,
using a domino dual-rail PCHB-based
1
65
nm
TSMC library [Lin98], it generates
circuits that operate at 1.1 GHz, signicantly faster than what is achievable in a
comparable synchronous ASIC
ow.
The power of Proteus
ow, as illustrated in Figure 2.1, comes from the re-
use of mature industrial synchronous synthesis and place-and-route tools. The
asynchronous-specic aspects of this
ow are the SVC to RTL translation, con-
version into an asynchronous pipeline template, clustering [DBL11], and pipeline
optimization (a.k.a. slack-matching [BLDK06]).
2.2 Asynchronous Gates and Netlists
Proteus supports a class of SVC descriptions suitable for high-performance designs
that consist of an initialization block followed by a forever loop.
The synthesized output is an asynchronous netlist, which is a set of communi-
cating asynchronous gates that each can be described in SVC. Modules described at
1
Proteus has also been successfully applied to other types of back-end asynchronous handshak-
ing templates, such as Multi-Level Domino [DBL11], and Multi-Level Single-Track Full-Buer
[GB11].
10
Figure 2.1: Proteus
ow
SVC communicate through blocking CSP-like communication actions called Send
and Receive [SB11]. When module m
1
sends a value to module m
2
, we say m
1
sends a token to m
2
.
There are four types of asynchronous gates in Proteus
ow whose SVC descrip-
tions are shown in Figure 2.2:
1. RECEIVE gates, for conditionally receiving data
2. SEND gates, for conditionally sending data
3. Unconditional gates implementing logic functions
11
4. TOKBUF, which sends an initial token at its output upon reset and after
that behaves like a buer.
always begin
forever begin
E.Receive(e);
if (e==1) L.Receive(d);
else d=0;
R.Send(d);
end
end
(a) RECEIVE
always begin
forever begin
E.Receive(e);
L.Receive(d);
if (e==1)
R.Send(d);
end
end
(b) SEND
always begin
forever begin
fork
I1.Receive(i1);
I2.Receive(i2);
//...
Im.Receive(im);
join
o = f (i1, i2, ..., im);
O.Send(o);
end
end
(c) Unconditional gate
always begin
I.Send(initValue);
forever begin
I.Receive(i);
O.Send(i);
end
end
(d) TOKBUF
Figure 2.2: Asynchronous gates described in SVC
12
As mentioned in Chapter 1, conditional communication is an eective technique
for reducing unnecessary activity in an asynchronous circuit. The idea is to only
communicate with the section of the circuit that is performing useful calculation.
In Proteus, this technique is implemented by instantiating RECEIVEs at primary
inputs and instantiating SENDs at primary outputs of the given circuit. More
details on how these gates are instantiated is provided in Section 2.3.
Figure 2.3 shows two special cases of unconditional gates: a DataGenerator,
which repeatedly generates data and sends it to its output, and a DataBucket,
which repeatedly receives data from its input and consumes it.
always begin
forever begin
//Generate data
O.Send(d);
end
end
(a) DataGenerator
always begin
forever begin
I.Receive(d);
//Consume data
end
end
(b) DataBucket
Figure 2.3: DataGeneretor and DataBucket described in SVC
An asynchronous netlist together with an environment form an asynchronous
system. The asynchronous netlist communicates with the environment through
primary input and primary output channels.
Figure 2.4 shows a sample SVC specication of a conditional accumulator mod-
ule, CondAccumulator that has C
1
, C
2
, C
3
, I
1
, and I
2
as input channels and O
as an output channel.
13
module CondAccumulator (e1of2_1.In C1,C2,C3,
e1of2_16.In I1,I2,
e1of2_16.Out O);
logic [15:0] x1,x2,o;
logic c1, c2, c3;
always begin
s = 0; //Initialization block
forever begin //Forever block
x1=0; x2=0;
fork
C1.Receive(c1);
C2.Receive(c2);
C3.Receive(c3);
join
if (c1) I1.Receive(x1);
if (c2) I2.Receive(x2);
s= s + x1 + x2;
if (c3) O.Send(s);
end
end
endmodule
Figure 2.4: Conditional Accumulator in SVC
The SystemVerilog interface [IEE09] type e1of2 in Figure 2.4 species that the
channel should be implemented using a four-phase 1-of-2 handshaking protocol
with an inverted acknowledgment signal e [Lin98]. In the four-phase 1-of-N hand-
shaking protocol, as shown in Figure 2.5, N wires are used to encode log(N ) bits
of data.
In our example, I
1
, I
2
, and O are each an array of 16 e1of2 channels. The
CondAccumulator module rst initializes variables x
1
and x
2
to zero, then concur-
rently receives the values of c
1
, c
2
, and c
3
from channels C
1
, C
2
, and C
3
. If c
1
(c
2
) is one, the module receives the value of x
1
(x
2
) from channel I
1
(I
2
), and then
14
(a) 1-of-N data encoding
(b) Four-phase handshaking
Figure 2.5: 1-of-N channels and four-phase handshaking
it calculates the value of s as the sum of s, x
1
, and x
2
. Finally, if c
3
is one, the
value of s is sent to the output channel O.
2.3 SVC2RTL
The industrial version of Proteus uses a proprietary language called CAST [BDL11]
at its front-end. In order to use a standard front-end language which enables
leveraging o of synchronous CAD tools and the adoption of Proteus by other
researchers, we developed a program called SVC2RTL, which replaces the CAST
front-end with the SVC front-end, based on standard SystemVerilog.
SVC2RTL converts the SVC specication into a synthesizable single-rail block
called the WRAPPER as shown in Figure 2.8. In order to simplify the SVC2RTL
algorithm and keep the RTL-BODY concise and close to the original SVC module,
15
task Receive (output
logic [W-1:0] data);
do = 1; data = value;
endtask
task Send (input
logic [W-1:0] data);
do = 1; value = data;
endtask
task InitDo;
do = 0;
endtask
task InitValue;
value = 0;
endtask
Figure 2.6: Description of rtl interface tasks
instead of expanding e1ofN M channels into their individual handshaking signals,
SVC2RTL keeps them as e1ofN M interfaces, but uses SystemVerilog's compiler
directives to ensure the non-synthesizable part of the tasks dened in this interface
are ignored by the RTL synthesizer. Moreover, for connecting RECEIVE/SEND
cells to RTL-BODY, we dene a new synthesizable interface rtl interface to en-
capsulate the do and value signals and dene new synthesizable Receive and Send
tasks as shown in Figure 2.6. Furthermore, as shown in Figure 2.6 two initialization
tasks are called at the beginning of the forever loop of the RTL-BODY: InitDo,
which initializes the do signal to 0, and InitValue, which initializes the value signal
to 0. These tasks prevent creating unwanted latches by the RTL synthesizer.
The WRAPPER module instantiates special RECEIVE/SEND library cells
to implement conditional communication and a synthesizable block called RTL-
BODY that implements the core logic of the process. Figure 2.7 shows the RTL-
BODY description of the CondAccumulator example of the previous section.
16
module CondAccumulator_RTL (
interface C1 ,C2 ,C3 ,I1 ,I2, O,
input CLK, input _RESET );
logic [I1.W-1:0] x1, ff$x1;
logic [I2.W-1:0] x2, ff$x2;
logic [O.W-1:0] s, ff$s;
logic c1, ff$c1, c2, ff$c2, c3, ff$c3;
//The initialization part is converted to always_ff
always_ff @ (posedge CLK, negedge _RESET)
if (!_RESET) begin
ff$s <= ’0;
end
else begin
ff$x1 <= x1; ff$x2 <= x2; ff$s <= s;
ff$c1 <= c1; ff$c2 <= c2; ff$c3 <= c3;
end
//The forever loop is converted to always_comb
always_comb begin
C1.InitDo; C2.InitDo; C3.InitDo;
I1.InitDo; I2.InitDo; O.InitDo;
O.InitValue;
x1 = ff$x1; x2 = ff$x2; s = ff$s;
c1 = ff$c1; c2 = ff$c2; c3 = ff$c3;
x1 = 0; x2 = 0;
C1.Receive(c1); C2.Receive(c2);
C3.Receive(c3);
if (c1) I1.Receive(x1);
if (c2) I2.Receive(x1);
s = s + x1 + x2;
if (c3) O.Send(s);
end
endmodule
Figure2.7: Generated SystemVerilog description of the RTL-BODY for the Conditional
Accumulator example in Figure 2.4
Each iteration of the forever loop in the SVC specication is mapped into
one clock cycle of the RTL-BODY. In case of conditional inputs, the RTL-BODY
asserts the condition of the communication on each RECEIVE cell's E input (the
do signals in Figure 2.8). Only if the condition is 1 , does it consume the RECEIVE
cell's output port R's value (the value signal).
17
Figure 2.8: WRAPPER for CondAccumulator
In the example of Figure 2.4, since C
1
, C
2
, and C
3
are unconditionally received,
the values of C1:do, C2:do, and C3:do are always 1 . On the other hand, the value
of I1:do and I2:do is c1:value and c2:value, respectively. The RTL-BODY then
calculates and asserts the output condition (do) and output data (value) on each
SEND cell's E and L ports, respectively.
2.4 Synthesis, Clustering, and Physical Design
The RTL synthesis tool recognizes RECEIVE and SEND cells as hard macros
and only synthesizes the RTL-BODY into a network of image library cells, i.e.
18
single-rail cells that do not physically exist, but will be replaced by their real
asynchronous equivalents after synthesis. The RECEIVE and SEND cells are
also replaced by their real asynchronous equivalents. Power optimization of the
synthesized WRAPPER is the main contribution of this work.
The next step, Clustering [DBL11], i.e. associating several real asynchronous
cells to an asynchronous controller yields a complex non-linear pipeline. Asyn-
chronous pipeline optimization algorithms, such as slack-matching [BLDK06], are
then applied to ensure the pipelined design meets the target throughput constraint.
Finally, the asynchronous netlist is instantiated and given to a commercial place-
and-route tool for physical design.
19
Chapter 3
Three-Valued Logic
3.1 Introduction
This chapter analyzes the behavior of an asynchronous system by decomposing its
trace of behavior into sub-traces called system iterations. In each system iteration,
each channel either communicates a data value 0, or 1, or has no communication
action. In other words, a channel cannot communicate more than one token per
system iteration. For SVC descriptions synthesizable by Proteus, each iteration of
the forever loop satises this criterion [BDL11].
As an example, Figure 3.1b shows the SVC description of a module S that
communicates with two DataBuckets (descried in Figure 2.3b). If each system
iteration corresponds to a single iteration of the forever loop of S, then S sends a
token on each channel C
1
and C
2
only at every other system iteration, as shown
in Figure 3.2a.
To analyze such behaviors, in this chapter we use a model based on Three-
Valued Logic (3VL) [WF80, FB96, MRST97, FF03, CCL06], in which each system
iteration is eectively modeled by one clock cycle and in each cycle variables can
20
always begin
s = 1; d1 = 1; d2 = 0;
forever begin
if (s == 1) begin
C1.Send(d1);
d1 = ˜d1;
end
else
C2.Send(d2);
s = ˜s;
end
end
(a) SVC description of S (b) S communicating with two data buckets
Figure 3.1: A sample conditional SVC description
take one of the three values in the setT =f0; 1; Ng. In particular, the absence
of communication on a channel is modeled by the value N. If there is no commu-
nication action during system iteration i on channel c in the asynchronous netlist,
the value of variable c in clock cycle i in the 3V model would be N. This property
is illustrated in Figure 3.2b.
By mapping an asynchronous netlist to a 3V network, we are able to use multi-
valued observability don't care [YB00] theory to formally reason about the validity
of our proposed power optimization techniques as well as leveraging o of syn-
chronous automatic power optimization methodologies and CAD tools.
21
(a) Values of tokens on C
1
and C
2
in each system iteration
(b) 3V values of C
1
and C
2
in each clock cycle
Figure 3.2: 3VL model of an asynchronous system
3.2 Basic Denitions and Properties
We dene three-valued logic (3VL) based on the more general notion of multi-
valued logic [BK99]. In the following, letT =f0; 1; Ng, andB =f0; 1g.
A three-valued (3V) variable x can take a value from the setT =f0; 1; Ng.
Values 0 and 1 are the two common Boolean logic values, whereas the value N
models the condition of no communication action (a.k.a. a spacer [SF01]) on a
channel of an asynchronous gate during a system iteration, as shown in Figure 3.2.
Denition 3.1 (Minterm). A three-valued minterm (a.k.a. vertex or point) of n
three-valued variables x
1
; x
2
;:::; x
n
is a point in the spaceT
n
=
n times
z }| {
TTT .
22
3.2.1 3V Functions
Denition 3.2 (3V Function). A 3V function f is a function which maps minterms
inT
n
toT , written f :T
n
!T .
T
n
is called the domain of f : Domain(f ) =T
n
, andT is called the co-domain
or range of f : Range(f ) =T . The set of variablesfx
1
; x
2
;:::; x
n
g is called the
support of f : Support(f ) =fx
1
; x
2
;:::; x
n
g.
Often, functions are dened with their correspondence rule. We use the notation
y = f (x) :T
n
!T , where x2T
n
, y2T , and (x; y)2 f . Basic 3V functions can
be dened based on a set of 3V tables as shown in Figure 3.3.
The^ and_ operators represent logical functions similar to Boolean logical
AND and OR functions as long as none of the inputs isN, otherwise the output of
these functions isN. The inverting operator: is the same as the Boolean inverter
when the input is 0 or 1 , but its output is N , when its input is N . The output of
the equivalence operator is always 0 or 1 and never N
1
.
The RECEIVE operator (denotedr) behaves like a buer when enable input
E is 1 . Whereas when E is 0 , the output R is 0 irrespective of the value of L.
The SEND operator (denoted s) also behaves like a buer when E is 1 .
Whereas when E is 0 or N , its output is N .
1
Therefore represents a function with 3V inputs and a 2V outputT
2
!B:
23
A
^ 0 1 N
0 0 0 N
B 1 0 1 N
N N N N
(a) A^B (AND)
A
_ 0 1 N
0 0 1 N
B 1 1 1 N
N N N N
(b) A_B (OR)
A
0 1 N
0 1 0 0
B 1 0 1 0
N 0 0 1
(c)AB (EQUIVALENCE)
: A
0 1
A 1 0
N N
(d):A (NOT)
E
r 0 1 N
0 0 0 N
L 1 0 1 N
N 0 N N
(e) LrE (RECEIVE)
E
s 0 1 N
0 N 0 N
L 1 N 1 N
N N N N
(f) LsE (SEND)
Figure 3.3: Primitive functions in three-value logic
We use the term unconditional function to distinguish between logic functions
and SEND/RECEIVE functions:
Denition 3.3 (Unconditional function). A function f is called unconditional if
there exists a representation of f which is obtained only by composition of^,_,
and: functions.
For example, f
1
= (x
1
^:x
2
)_ (:x
1
^ x
2
) and f
2
=:
(x
1
^ x
2
)_ (x
1
^ x
3
)
are unconditional functions whereas f
3
= x
1
rx
2
and f
4
= x
1
sx
2
are conditional.
An unconditional function f has an important property: the output of f is N
if and only if at least one of its input variables is N. This can be directly concluded
24
from the tables of Figure 3.3a and 3.3b. It is important to observe that RECEIVE,
SEND, and EQUIVALENCE do not have the above property.
Formally, we state the following lemma:
Lemma 3.1 (Support synchrony). Assuming that y =f(x
1
;x
2
;:::;x
n
) is an un-
conditional 3V function:
9x
i
2fx
1
;x
2
;:::;x
n
g;x
i
=N
! (y =N):
Proof (both directions): Direct corollary of the tables of Figure 3.3a, 3.3b and 3.3d.
Denition 3.4 (3V literals). A 3V literal x
C
is a 2V function of the form:
x
C
:T !B ; x
C
=
_
i
2C
(x
i
)
where CT .
Below are some example literals for a 3V variable x:
x
f0g
= (x 0) ; x
f0;Ng
= (x 0)_ (xN) ; x
B
= (x 0)_ (x 1)
Obviously, for a 3V variable x: x
T
= x
f0;1;Ng
= 1 .
25
Denition 3.5 (3V cube). A cube C = C
1
C
2
C
n
is a product of 3V
literals in the form:
x
C
1
1
x
C
2
2
x
Cn
n
A 3V function f can be written in the form of a sum-of-cubes for each of
its values, where each value f
f
i
g
:
i
2T , is represented as a Boolean function.
Examples of such representations are as follows:
Example 3.1
For the 3V function RECEIVE in the form r(l; e) = lre, using the table in Figure
3.3e we can write:
r
f0g
=e
f0g
+e
f1g
l
f0g
r
f1g
=e
f1g
l
f1g
r
fNg
=e
fNg
+e
f1g
l
fNg
26
Example 3.2
A 3V multiplexer with data inputs i
0
and i
1
, select input s, and output o can be
described as the following unconditional function:
o
f0g
=s
f0g
i
f0g
0
i
f0;1g
1
+s
f1g
i
f0;1g
0
i
f0g
1
o
f1g
=s
f0g
i
f1g
0
i
f0;1g
1
+s
f1g
i
f0;1g
0
i
f1g
1
o
fNg
=s
fNg
+i
fNg
0
+i
fNg
1
Notice that for o
f0g
and o
f1g
, it is required that all input variables have a 0
or 1 value. Otherwise, based on Lemma 3.1, the value of o would be N .
3.3 3V Networks
A 3V network is a directed acyclic graph G(V,E), where V is the set of nodes and
E is the set of edges. Each node represents a 3V function with a single 3V output
and several 3V inputs. As such, a 3V variable is associated with each node. A
3V network has a set of primary input and a set of primary output nodes. A 3V
variable is also associated with each primary input node.
In order to distinguish the type of nodes in the circuit, we dene
P =fI; R; S; U; Og as a partition of V consisting of I (primary input nodes),
27
R(RECEIVE nodes), S(SEND nodes), U (unconditional nodes), and O(primary
output nodes).
There is a directed edge (u; v)2 E from the node u to the node v if a function
represented by the node v explicitly depends on the output variable at the node
u. The node v is called the fanout or the successor of the node u and the node u
is called the fanin or the predecessor of the node u.
The output variable y of a node v2 V with input variables x = (x
1
; x
2
;:::; x
n
)
implementing the function f is y = f (x). Alternatively, the output of a node can
be written as a function of primary inputs.
A path p in a 3V network is a sequence of nodes and edges. We assume that
all paths are simple paths, i.e., they contain no nodes more than once. If a path
starts at a node u and ends at a node v, we use the notation p
uv
or u v.
The transitive fanin (TFI) of a node v is the set of nodes u from which there
exists a path p
uv
. The transitive fanout (TFO) of a node v is the set of nodes u
to which there exists a path p
vu
.
The 3V behavior of an n-input, m-output 3V network is a 3V function
F :T
n
!T
m
. Two networks are equivalent if there is a one-to-one correspon-
dence between their respective primary inputs and primary outputs, and if their
corresponding 3V behaviors are equivalent.
A node representing an unconditional function is called an unconditional node,
otherwise it is called a conditional node.
28
Based on Lemma 3.1, an unconditional node performs useful calculation only
when all of its inputs are simultaneously non-N . We shall call variables whose
values are simultaneously N or simultaneously non-N synchronized variables:
Denition 3.6 (Synchronized variables). In a 3V network, two variables x
1
and
x
2
are synchronized if the following expression evaluates to 1:
(x
f0;1g
1
^x
f0;1g
2
)_ (x
fNg
1
^x
fNg
2
) (3.1)
We denote synchronized variables by writing: x
1
c
x
2
.
Notice that our denition implies that the synchronization relationship is an
equivalence relation.
Lemma 3.2. if y = f (x
1
; x
2
; ; x
n
) is an unconditional function whose inputs
are synchronized: (x
1
c
x
2
c
x
n
), then8x
i
: x
i
c
y.
Proof : We prove: 8x
i
: y
f0;1g
! x
f0;1g
i
, which due to equivalence implies that
y
fNg
! x
fNg
i
.
y
f0;1g
Lem:3:1
! 8x
i
2fx
1
;x
2
;:::;x
n
g :x
i
f0;1g
:
29
Denition 3.7 (3V Finite state machine). A 3V nite state machine is a 6-tuple:
(;S;;S
0
; ;);
where:
is a nite non-empty set of input minterms.
S is a nite non-empty set of states.
:S !S is the next state function.
S
0
S is the set of initial states.
is a nite non-empty set of output minterms.
:S ! is the output function.
Two FSMs are equivalent if starting from their respective initial states, they will
produce the same output sequence when they are given the same input sequence
[Gup09]. In Boolean logic, if two sequential circuits share the same set of inputs,
outputs, and state-holding elements (e.g.,
ip-
ops), then it can be shown that
it is sucient to check their combinational portions for equivalence [PH09]. This
observation extends to three-valued FSMs. We will use this notion of equivalence
for formal verication of asynchronous circuits in Chapter 6.
30
3.4 3V Model of an Asynchronous Netlist
In this work we model an acyclic asynchronous netlist without any TOKBUF
(described in Figure 2.2d) by a 3V network. Also, we model an asynchronous
netlist that has TOKBUFs by a 3V nite state machine whose trace is divided
into clock cycles. Finally, we model an asynchronous system (consisting of an
asynchronous netlist and its environment) also by a 3V nite state machine. In
this model, each iteration of the asynchronous system is modeled by one clock
cycle. In the remainder of this chapter, we focus on the power optimization of
acyclic asynchronous netlists without TOKBUFs, which as described, are modeled
by 3V networks.
3.4.1 3VL Model Limitations and Iteration Stall-Freedom
The 3VL model does not capture all behaviors of an asynchronous system.
Firstly, it does not model the
ow-control nature of asynchronous gates and
processes; therefore, it cannot capture behaviors in which tokens are stalled
[BLDK06, JKB
+
02] at the inputs of asynchronous gates across system iterations.
Consider for example the simple asynchronous netlist of Figure 3.4 which com-
municates with environment modules S and R. The SVC description of the S
module is shown in Figure 3.1. As explained in Section 3.1, S alternates be-
tween sending on channels C
1
and C
2
in each iteration, whereas the gate R is a
DataBucket (described in Figure 2.3b). Figure 3.4a shows the rst iteration of the
31
(a) Iteration 1: the token on C
1
can move
no further than C
3
(b) Iteration 2: both tokens have arrived.
AND can complete its iteration
Figure 3.4: An asynchronous netlist communicating with environment modules S and
R
(a) First clock cycle (b) Second clock cycle
Figure 3.5: The 3V network model of the asynchronous netlist
S module, in which it sends a token to the BUF gate. The BUF gate receives a
token from its input and sends it to its output to be received by the AND gate.
But the AND gate needs to stall and delay the completion of the receive of the
rst token from the BUF gate until the S module starts its second iteration in
which it sends another token on channel C
2
, as shown in Figure 3.4b. Once the
AND receives both tokens, it can send a token with value 0 on channel C
4
. It is
important to notice that in order for the AND gate to complete one iteration, S
should complete two iterations.
The corresponding 3VL model, however, does not re
ect this behavior: Fig-
ure 3.5a and 3.5b show the rst and the second clock cycles in the 3VL models,
respectively. In both clock cycles, the value of C
4
is N, rather than 0.
32
The reason for this dierence is that in an asynchronous netlist, asynchronous
gates can create stalls during which tokens can stay on channels. In our example,
the AND gate can delay the Receive on C
3
(and hence stall the BUF gate) until
the second token arrives on C
2
, which does not take place only until the second
iteration of the S module. During this stall, the BUF keeps its output token on
C
3
.
In the 3VL model, however, unconditional nodes cannot stall other nodes, nor
can they keep a value on their outputs if the value of their inputs change. Therefore,
the value 1 on C
3
in Figure 3.5 in the rst clock cycle will not be stored and cannot
be reused in the second clock cycle.
The 3VL model as dened in Section 3.4 is thus a simplied model, and we shall
use it only for modeling a subclass of asynchronous systems that have a special
property called iteration-stall-freedom.
Intuitively, in an iteration-stall-free asynchronous system, if a module P com-
pletes one iteration, any asynchronous gate A in the system either does not start
its iteration at all, or if it starts its iteration, it can complete the iteration without
being stalled until P's next iteration. In other words, A can complete the receive
of all of its input tokens within one iteration of P. Therefore, no token needs to be
kept on a channel from the rst iteration of P to the second iteration of P. That is
to say, the completion of one iteration of a module does not require the completion
of more than one iteration of other modules, and hence we can divide the system's
33
behavior into iterations in which every module either does not start its iteration
or it can start and complete exactly one iteration.
In this work, we assume that Proteus is only applied on a single non-hierarchical
SVC description, i.e., an SVC description that only has an initial block followed by
a forever loop. Further, we assume that the environment is constrained, such that
the asynchronous system dened by the asynchronous netlist generated by Proteus
and the environment is iteration-stall-free. In our example, if S was constrained
such that it would send a token on both channels in each iteration, BUF and AND
would not have been required to stall until S's second iteration.
In practice, a large class of asynchronous circuits and their environments are
iteration-stall-free. It is very common to constrain the environment to provide all
of the inputs in only one iteration, and also receive all of the outputs in only one
iteration.
It is worth mentioning that under such a constraint, the communication ac-
tions on C
1
and C
2
are called synchronized communication actions [Mar81], and
variables C
1
and C
2
in the 3VL model are synchronized variables (Denition 3.6).
Therefore, in an iteration-stall-free 3VL model, certain variables are constrained
to be synchronized. In the following, we formally dene an iteration-stall-free 3V
network. An asynchronous netlist is iteration-stall-free if its corresponding 3VL
model is iteration-stall-free.
34
Denition 3.8 (Iteration-stall-free 3V network). A 3V network G = (V; E) in
which V is partitioned by P =fI; R; S; U; Og is called iteration-stall-free if:
8u2U :i
u
1
c
i
u
2
c
c
i
u
n
(Unconditional gates)
8r2R :
8
>
>
>
>
<
>
>
>
>
:
l
rfNg
(if e
rf0;Ng
)
l
rf0;1g
(if e
rf1g
)
(RECEIVE gates)
8s2S :l
s
c
e
s
(SEND gates)
where i
u
1
; i
u
2
; ; i
u
n
are the input variables of the unconditional gate u2 U ; l
r
and
e
r
are the left and enable input variables of a RECEIVE gate r2 R; l
s
and e
s
are
the left and enable variables of a SEND gate.
Using the 3VL model, we can now dene a notion of equivalence for two asyn-
chronous netlists:
Denition 3.9 (Logical equivalence). Two acyclic iteration-stall-free asyn-
chronous netlists without TOKBUFs are logically equivalent i their corresponding
3V networks are equivalent.
In the next sections, we show that our proposed power-optimization modica-
tion methods, namely conditioning and reconditioning, preserve logical equivalence
and iteration-stall-freedom.
35
3.5 SEND Reconditioning
An important property of thes operator is its right distributivity on unconditional
functions. If we start with the network on the left-hand side of Figure 3.6 and move
a SEND node from the output of a node to its inputs (Figure 3.6, right), when e
f0g
,
we get y
fNg
1
; y
fNg
1
;:::; y
fNg
1
, and since N represents no communication, depending
on how often e
f0g
, this may lead to less switching activity in f in the corresponding
asynchronous netlist. On the other hand, if we start from the network on the right-
hand side and e is often 1, it makes sense to move multiple SEND nodes to a single
SEND at the output of f and save area and switching activity in SEND nodes.
Notice that for this transformation to be correct, not only should node f have
a SEND on all of its inputs in the right-hand side network, but also the enable
variables of all SEND nodes should be equal. Also, in order to keep the network
acyclic, we should be careful that such a transformation does not create a cycle.
Therefore, we require the node with output e not to be in the transitive fanout
(TFO) of f , as dened in Section 3.3.
Figure 3.6: Right distributivity of SEND (SEND Reconditioning)
36
Lemma 3.3 (SEND Reconditioning). In a 3V network, let node u represent an
unconditional function y = f (x) = f (x
1
; x
2
;:::; x
n
) and let e be an output variable
of a node v such that v62 TFO(u). Then
f(x
1
;x
2
;:::;x
n
)se =f(x
1
se;x
2
se;:::;x
n
se):
Proof (by case analysis on e):
e = 0 : f (x)s0 = N
Lem:3:1
= f (N;:::; N ) = f (x
1
s0; x
2
s0;:::; x
n
s0 ):
e = 1 : f (x)s1 = f (x) = f (x
1
s1; x
2
s1;:::; x
n
s1 ).
e = N : f (x)sN = N
Lem:3:1
= f (N;:::; N ) = f (x
1
sN; x
2
sN;:::; x
n
sN ):
Notice that in order to avoid creating cycles, we required v62 TFO(u). The
result of Lemma 3.3 is that if a node u in a 3V network implements the func-
tion y
1
= f (x)se, replacing u with the node u
c
implementing the function
y
2
= f (x
1
se; x
2
se;:::; x
n
se) preserves equivalence. On the other hand, the re-
verse transformation states that if a node u
c
in a 3V network implements the
function y
2
= f (x
1
se; x
2
se;:::; x
n
se), replacing u
c
with the node u implement-
ing the function y
1
= f (x)se preserves equivalence.
Next we prove that SEND reconditioning preserves iteration-stall-freedom.
Consider a 3V network G
1
(V
1
; E
1
) which is transformed to G
2
(E
2
; V
2
) after
37
moving a SEND cell from the output of a node u corresponding to an uncon-
ditional function f to all of its incoming edges. The input variables of u are
x = (x
1
; x
2
;:::; x
n
), and the output variable is o
f
, as shown in Figure 3.6 (left).
This transformation creates a set of new variables with s = (s
1
; s
2
;:::; s
n
) as shown
in Figure 3.6 (right). In the following theorem, we prove that if G
1
is iteration-
stall-free, the transformed graph G
2
is also iteration-stall-free.
For the reverse transformation, assume that we start from a 3V iteration-stall-
free network G
2
and transform it to G
1
. The node u in G
2
should have a SEND
node on all of its inputs, where the enables of all SEND nodes are equal. We prove
that if G
2
is iteration-stall-free, G
1
is also iteration-stall-free.
Theorem 3.1 (SEND reconditioning iteration-stall-freedom preservation). SEND
reconditioning preserves iteration-stall-freedom.
Proof: (G
1
) G
2
)
The nodes in the transitive fanin of u are not aected by this transformation.
The only possible nodes that might be aected are in the transitive fanout of u. If
we prove that a node v whose direct input is y
1
preserves the iteration-stall-freedom
condition, then since synchronization is an equivalence relation, we can state that
the iteration-stall-freedom condition for the nodes in the transitive fanout of v will
not be aected either.
38
Also, we add new SEND nodes at the input of u for which we should verify the
iteration-stall-freedom condition. Therefore, assuming G
1
is iteration-stall-free, we
rst show:
s
1
c
s
2
c
::: c
s
n
(Unconditional node u)
8x
i
2fx
1
;:::;x
n
g :x
i
c
e (SEND nodes)
Since G
1
is iteration-stall-free, based on Denition 3.8: x
1
c
x
2
c
c
x
n
. There-
fore a case analysis on e results:
e
f0;Ng
:s
N
1
;s
N
2
;:::;s
N
n
e
f1g
: (s
1
=x
1
); (s
2
=x
2
);:::; (s
n
=x
n
)
9
>
>
=
>
>
;
x
1
c
x
2
c
c
xn
!s
1
c
s
2
c
c
s
n
For the unconditional node u we can write:
y c
e
Lem:3:2
! (8x
i
2fx
1
;:::;x
n
g :x
i
c
e)
Finally, based on Lemma 3.3, y
1
= y
2
, which also implies y
1
c
y
2
. Therefore, after
changing the input of a node v that is a successor of the SEND node in G
1
from
y
1
to y
2
, the iteration-stall-freedom condition for v does not change.
Reverse transformation: (G
2
) G
1
)
39
Assuming G
2
is iteration-stall-free, similar to the direct transformation, the
only nodes whose input variables are changing are the unconditional node u and
any node v whose direct input is y
2
. Also we add one new SEND node at the output
of u for which we should verify the iteration-stall-freedom condition. Therefore,
we rst show:
x
1
c
x
2
c
c
x
n
(Logic node u)
y c
e (The SEND node)
Since G
2
is iteration-stall-free, based on Denition 3.8, it is true that
(8x
i
2fx
1
;:::; x
n
g : x
i
c
e). Therefore:
(x
1
c
x
2
c
::: c
x
n
):
Also, based on Lemma 3.2, (8x
i
2fx
1
;:::; x
n
g : x
i
c
e) implies that:
(y c
e):
Finally, similar to the direct transformation, we can write y
1
c
y
2
. Therefore,
after changing the input of a node v that is a successor of the node u in G
2
from
y
2
to y
1
, the iteration-stall-freedom condition for v does not change.
40
3.6 RECEIVE Reconditioning
In general,r is not right-distributive on unconditional functions. For example,
consider f (x
1
; x
2
) =:x
1
^:x
2
, as shown in Figure 3.7. It is easy to verify that
when e
f0g
:
y
1
=f(x
1
;x
2
)r0 = 0 6= y
2
=f(x
1
r0;x
2
r0) = 1
Figure 3.7: When e = 0: f (x
1
;x
2
)re = 06= f (x
1
re;x
2
re) = 1
The dierence is because in the table of Figure 3.3e, we dened8x : xr0 = 0 .
We dene a new function RECEIVE
1
denoted byr
1
, such that8x : xr
1
0 = 1
as shown in Figure 3.8. Now, we can write:
y
1
=f(x
1
;x
2
)re = y
2
=f(x
1
r
1
e;x
2
r
1
e) = 0 (3.2)
y
2
=f(x
1
re;x
2
re) = y
1
=f(x
1
;x
2
)r
1
e = 0 (3.3)
41
E
r
1
0 1 N
0 1 0 N
L 1 1 1 N
N 1 N N
Figure 3.8: Function RECEIVE
1
, denotedr
1
, is similar tor except for when e = 0
Figure 3.9: RECEIVE reconditioning
Equation 3.2 states that if we start from the left network in Figure 3.7, in order
to get an equivalent network, we need to replace the RECEIVE node by RECEIVE
1
node as it is being moved from the output of f to its inputs. Similarly, Equation
3.3 states that if we start from the right network in Figure 3.7, in order to get
to an equivalent network, we need to replace the RECEIVE nodes by RECEIVE
1
nodes as it is being moved from the inputs of f to its output.
In general, we can state the following:
42
Lemma 3.4 (RECEIVE reconditioning). In a 3V network, let node u represent an
unconditional function y = f (x) = f (x
1
; x
2
;:::; x
n
) and let e be an output variable
of a node v such that v62 TFO(u). Then:
f(x
1
;x
2
;:::;x
n
)r
l
e =f(x
1
r
1
e;x
2
r
2
e;:::;x
n
r
n
e);
where eachr
i
denotes eitherr orr
1
and is chosen such that: f(x)r
l
0 =
f(x
1
r
1
0;x
2
r
2
0;:::;x
n
r
n
0)
Proof (by case analysis on e):
e = 0 : holds because we set f(x)r
l
0 =f(x
1
r
1
0;x
2
r
2
0;:::;x
n
r
n
0)
e = 1 : f (x)r
l
1
Fig:3:3e; 3:8
= f (x) = f (x
1
r
1
1; x
2
r
2
1;:::; x
n
r
n
1 )
e = N : f (x)r
l
N = N = f (N;:::; N )
Fig:3:3e; 3:8
= f (x
1
r
1
N; x
2
r
2
N;:::; x
n
r
N
n
N )
The result of Lemma 3.4 is that if a node u in a 3V network im-
plements the function y
1
= f (x)r
l
e, replacing u with the node u
c
imple-
menting the function y
2
= f (x
1
r
1
e; x
2
r
2
e;:::; x
n
r
n
e) and vice versa pre-
serves equivalence (assuming we replace each r
i
with r or r
1
such that
y
2
= f (x
1
r
1
0; x
2
r
2
0;:::; x
n
r
n
0 ) = f (x)r
l
e = y
1
).
The application of Lemma 3.4 is shown in Figure 3.9. Notice that for the
right-to-left transformation, one only needs to nd the value of f (0;:::; 0 ). If
43
this value is 0, a RECEIVE should be placed at the output of f, otherwise, a
RECEIVE
1
. For the left to right transformation, however, one needs to nd the
values for x
1
; x
2
;:::; x
n
, such that f (x) = 0 , which may not have a unique answer.
The general case of this problem requires a SAT solver and is known to be NP-
hard. However, if the number of node types in the 3V network are limited and
each node has fewer than 20 inputs, which is often the case, one can pre-compute
these values in advance for all the practical cases.
Next we show that RECEIVE reconditioning preserves iteration-stall-freedom.
Without loss of generality, we assume that all the RECEIVE cells are of type
r. Consider a 3V network G
1
(V
1
; E
1
) which is transformed to G
2
(E
2
; V
2
) af-
ter moving a RECEIVE cell from the output of a node u corresponding to an
unconditional function f to all of its incoming edges. The input variables of u
are x = (x
1
; x
2
;:::; x
n
), and the output variable is o
f
, as shown in Figure 3.9
(left). This transformation creates a set of new variables with r = (r
1
; r
2
;:::; r
n
)
as shown in Figure 3.9 (right). In the following theorem, we prove that if G
1
is
iteration-stall-free, the transformed graph G
2
is also iteration-stall-free.
For the reverse transformation, assume that we start from a 3V iteration-stall-
free network G
2
and transform it to G
1
. The node u in G
2
should have a RECEIVE
node on all of its inputs, where the enables of all RECEIVE nodes are equal. We
prove that if G
2
is iteration-stall-free, G
1
is also iteration-stall-free.
44
Theorem 3.2 (RECEIVE reconditioning iteration stall-freedom preservation).
RECEIVE reconditioning preserves iteration-stall-freedom.
Proof: (G
1
) G
2
)
The nodes in the transitive fanin of u are not aected by this transformation.
The only possible nodes that might be aected are in the transitive fanout of u.
If we prove that a node v whose direct input is y
1
preserves the iteration-stall-
freedom condition, then due to the transitivity of synchronization relationship, we
can state that the iteration-stall-freedom condition for the nodes in the transitive
fanout of v will not be aected either.
Also, we add new RECEIVE nodes at the input of u for which we should verify
the iteration-stall-freedom condition. Therefore, assuming G
1
is iteration-stall-
free, we rst show:
r
1
c
r
2
c
::: c
r
n
(Unconditional node)
8x
i
2fx
1
;:::;x
n
g :
8
>
>
>
>
<
>
>
>
>
:
x
i
fNg
(if e
f0;Ng
)
x
i
f0;1g
(if e
f1g
)
(RECEIVE nodes)
45
Since G
1
is stall-free, based on Denition 3.8: x
1
c
x
2
c
::: c
x
n
. Therefore, a case
analysis on e results:
e
f0;Ng
implies: r
B
1
;r
B
2
;:::;r
B
n
e
f1g
implies: (r
1
=x
1
); (r
2
=x
2
);:::; (r
n
=x
n
)
9
>
>
=
>
>
;
x
1
c
x
2
c
::: c
xn
!r
1
c
r
2
c
::: c
r
n
For the unconditional node u, we can write:
8
>
>
>
>
<
>
>
>
>
:
y
fNg
(if e
f0;Ng
)
y
f0;1g
(if e
f1g
)
Lem:3:2
!8x
i
2fx
1
;:::;x
n
g :
8
>
>
>
>
<
>
>
>
>
:
x
i
fNg
(if e
f0;Ng
)
x
i
f0;1g
(if e
f1g
)
Finally, based on Lemma 3.3, y
1
= y
2
, which also implies y
1
c
y
2
. Therefore,
after changing the input of a node v that is a successor of the RECEIVE node in
G
1
from y
1
to y
2
, the iteration-stall-freedom condition for v does not change.
Reverse transformation: (G
2
) G
1
)
Assuming G
2
is iteration-stall-free, similar to the direct transformation, the
only nodes whose input variables are changing are the unconditional node u and
any node v whose direct input is y
2
. Also we add one new RECEIVE node at
46
the output of u for which we should verify the iteration-stall-freedom condition.
Therefore, we show:
x
1
c
x
2
c
::: c
x
n
(Unconditional node)
8
>
>
>
>
<
>
>
>
>
:
y
fNg
(if e
f0;Ng
)
y
f0;1g
(if e
f1g
)
(The RECEIVE node)
Since G
2
is iteration-stall-free, based on Denition 3.8:
8x
i
2fx
1
;:::;x
n
g :
8
>
>
>
>
<
>
>
>
>
:
x
i
fNg
(if e
f0;Ng
)
x
i
f0;1g
(if e
f1g
)
! (x
1
c
x
2
c
::: c
x
n
)
Also:
8x
i
2fx
1
;:::;x
n
g :
8
>
>
>
>
<
>
>
>
>
:
x
i
fNg
(if e
f0;Ng
)
x
i
f0;1g
(if e
f1g
)
Lem:3:2
!
8
>
>
>
>
<
>
>
>
>
:
y
fNg
(if e
f0;Ng
)
y
f0;1g
(if e
f1g
)
Finally, similar to the direct transformation, we can write y
1
c
y
2
. Therefore, after
changing the input of a node v that is a successor of the node u in G
2
from y
2
to
y
1
, the iteration-stall-freedom condition for v does not change.
47
3.7 Observability Condition
In Boolean logic, the observability don't care of a variable x is the condition in
which its value is not observed by the environment at any primary output [HS96].
We adopt the denition of observability condition of 3V variables from [YB00]. We
rst dene the notion of local observability at each node, and subsequently dene
the global observability of each variable.
3.7.1 Local Observability Partial Care
Let y be the output variable of a node u implementing a 3V function f , and let
(x
1
; x
2
;:::; x
n
) be the input variables of u, where y = f (x
1
; x
2
;:::; x
n
). The local
observability partial care of the input x
j
is the set of minterms in the local space
of y, such that a subset of the values of x
j
are indistinguishable at y.
Denition 3.10 (Local observability partial care). In a 3V network, the local
observability partial care of a variable x
j
on the set C =fc
1
;:::; c
t
gT at a
node implementing the function y = f (x
1
; x
2
;:::; x
n
) is:
OPC(f;C;x
j
) =
m2T
n
f
m[x
j
=c
1
]
= =f
m[x
j
=c
t
]
;
where m[x
j
= k]; k2 C denotes setting the value of x in minterm m to k.
Denition 3.10 denes the set of minterms in the local space of f , for which a
subset C of the values of x
j
are indistinguishable at y, i.e., the value of y does not
48
change if we arbitrarily
ip the value of x
j
within the values in C and keep other
parts of the minterm xed.
It can be shown that OPC (f; C; x
j
) of f is the set of minterms for which the
following equation evaluates to 1 [YB00]:
f
0
x
c
1
j
f
0
x
c
2
j
f
0
x
cn
j
+f
1
x
c
1
j
f
1
x
c
2
j
f
1
x
cn
j
+f
N
x
c
1
j
f
N
x
c
2
j
f
N
x
cn
j
; (3.4)
where f
flg
is a two-valued function, l2T , of n three-valued variables. The value
of f
flg
is 1 for all the minterms inT
n
for which the value of f is l, as exemplied in
Examples 3.1 and 3.2. The function f
l
x
k
j
is the cofactor [HS96] of f
flg
with respect
to literal x
fkg
j
and is independent of x
j
. The value of f
l
x
k
j
is 1 for all the minterms
m inT
n
for which f
flg
= 1 and the value of x
j
in m is replaced by k.
The value of f
l
x
c
1
j
f
l
x
c
2
j
f
l
x
cn
j
is 1 for all the minterms m inT
n
, such that for all
the values of x
j
in the range C , we have f (x
1
; x
2
;:::; x
n
) = l.
In the rest of this thesis, in order to specify OPC (f; C; x
j
), we abuse notation
as follows [YB00]:
OPC(f;C;x
j
) =f
0
x
c
1
j
f
0
x
c
2
j
f
0
x
cn
j
+f
1
x
c
1
j
f
1
x
c
2
j
f
1
x
cn
j
+f
N
x
c
1
j
f
N
x
c
2
j
f
N
x
cn
j
; (3.5)
by which we mean the set of minterms in which the value of the right hand side
of the equation evaluates to 1 .
49
Example 3.3
For a RECEIVE node with inputs l and e and output r = lre, using Example
3.1, OPC (r;T; l) can be calculated as follows:
ODC(r;T;l) =
e
f0g
z }| {
r
0
l
0r
0
l
1r
0
l
N
+
0
z }| {
r
1
l
0r
1
l
1r
1
l
N
+
e
fNg
z }| {
r
N
l
0r
N
l
1r
N
l
N
=e
f0;Ng
(3.6)
This means that when e is 0 or N , the output r of the RECEIVE will not be
aected as l's value changes and takes values inT .
Example 3.4
For a multiplexer node with data inputs i
0
and i
1
, select input s, and output o
described in Example 3.2, we can calculate OPC (o;T; i
0
) as follows:
OPC(o;T;i
0
) =
0
z }| {
o
0
i
0
0
o
0
i
1
0
o
0
i
N
0
+
0
z }| {
o
1
i
0
0
o
1
i
1
0
o
1
i
N
0
+
s
fNg
+i
fNg
1
z }| {
o
N
i
0
0
o
N
i
1
0
o
N
i
N
0
=s
fNg
+i
fNg
1
(3.7)
This means when s is N or i
1
is N , the output o of the multiplexer will not be
aected as i
0
's value changes and takes values inT .
Notice that the rst two terms in OPC (o;T; i
1
) in Example 3.4 are zero. In
fact, this is the case for any unconditional node:
50
Lemma 3.5. The local observability partial care of the variable x
j
at an uncondi-
tional node implementing an unconditional function y = f (x
1
; x
2
;:::; x
r
) is:
OPC(f;T;x
j
) =f
N
x
0
j
f
N
x
1
j
f
N
x
N
j
: (3.8)
Moreover, if x
1
c
x
2
c
::: c
x
r
, then OPC (f;T; x
j
) = 0 .
Proof: Since f is unconditional, based on Lemma 3.1, the output of f cannot be
0 or 1 while x
j
is N. As a result:
f
0
x
N
j
=f
1
x
N
j
= 0:
Therefore:
OPC(f;T;x
j
) = 0 + 0 +f
N
x
0
j
f
N
x
1
j
f
N
x
N
j
:
Moreover, since x
1
c
x
2
c
::: c
x
r
:
if (x
f0;1g
1
) then: (8x
k
2fx
1
;x
2
;:::;x
r
g :x
f0;1g
k
):
Therefore:
f
fNg
= 0:
51
Thus in Equation 3.8:
f
N
x
0
j
=f
N
x
1
j
= 0, which implies that OPC(f;T;x
j
) = 0:
Therefore, although local observability partial care onT is suitable for speci-
fying the conditions in which the changes on the input of a RECEIVE node is not
observable, it does not seem to be useful for unconditional nodes. In particular, as
will be discussed in more detail in Section 3.4.1, we are only interested in analyzing
and optimizing 3V networks in which all input variables x
j
of any unconditional
node f are synchronized. In that case, based on Lemma 3.5, OPC (f;T; x
j
) = 0 .
However, in such circuits we are still interested in the conditions where the
output of an unconditional node is never N, and it is insensitive to only a subset
ofT , such asB. As such, OPC (f;B; x
j
) of f can be computed using the following
equation [YB00]:
OPC(f;B;x
j
) =f
0
x
0
j
f
0
x
1
j
+f
1
x
0
j
f
1
x
1
j
+f
N
x
0
j
f
N
x
1
j
(3.9)
52
Example 3.5
For a multiplexer node with data inputs i
0
and i
1
, select input s, and output o
described in Example 3.2, we can calculate OPC (o;B; i
0
) as follows:
OPC(o;B;i
0
) =
s
f1g
i
f0g
1
z}|{
o
0
i
0
0
o
0
i
1
0
+
s
f1g
i
f1g
1
z}|{
o
1
i
0
o
o
1
i
1
o
+
s
fNg
+i
fNg
1
z}|{
o
N
i
0
o
o
N
i
1
o
=s
f1g
i
f0;1g
1
+s
fNg
+i
fNg
1
: (3.10)
If we know that the output value of a node is inB, we can simplify Equations
3.5 and 3.9. We state this as the following lemma:
Lemma 3.6. In a 3V network, for a node implementing the function
y = f (x
1
; x
2
;:::; x
r
), if for all values of the vector x = (x
1
; x
2
;:::; x
r
): y6= N ,
then:
OPC(f;T;x
j
) =f
0
x
0
j
f
0
x
1
j
f
0
x
N
j
+f
1
x
0
j
f
1
x
1
j
f
1
x
N
j
(3.11)
OPC(f;B;x
j
) =f
0
x
0
j
f
0
x
1
j
+f
1
x
0
j
f
1
x
1
j
(3.12)
Proof: y6= N! f
N
= f
N
x
0
j
= f
N
x
1
j
= f
N
x
N
j
= 0 . Therefore, the third term in Equations
3.5 and 3.9 is 0.
53
Thus when the output of a node is not N, as illustrated in Figure 3.10a, we can
rewrite Equations 3.6 and 3.10 from Example 3.3 (RECEIVE) and Example 3.10
(Multiplexer) as follows:
OPC(f;T;l) =r
0
l
0r
0
l
1r
0
l
N
+r
1
l
0r
1
l
1r
1
l
N
=e
f0g
; (r6=N) (3.13)
OPC(f;B;i
0
) =o
0
i
0
0
o
0
i
1
0
+o
1
i
0
0
o
1
i
1
0
=s
f1g
i
f0;1g
1
; (o6=N) (3.14)
(a)
OPC (r;T;l) = e
f0g
(b)
OPC (o;B;x
1
) = s
f1g
x
f0;1g
2
Figure 3.10: Local OPC when the output is never N
3.7.2 Global Observability Partial Care
In a 3V network, the global observability partial care of a variable x is the condition
that x is not observable at all primary outputs, and we formally dene it as follows:
Denition 3.11 (Global observability partial care). In a 3V network, the global
observability partial care of a variable x on the set C =fc
1
;:::; c
t
gT is:
GOPC(C;x) =
m2T
n
8o
i
2O :o
i
m[x
j
=c
1
]
= =o
i
m[x
j
=c
t
]
;
54
where m[x
j
= k]; k2 C denotes setting the value of x in minterm m2T
n
to k.
The obvious result of Denition 3.11 for primary outputs is:
8o
i
2O :GOPC(C;o
i
) = 0: (3.15)
Notice that similar to Equation 3.5, we abused the notation GOPC (C; o
i
) in
Equation 3.15, by which we mean the set of minterms in which the right hand side
of Equation 3.15 evaluates to 1 , which in this case is empty.
3.8 GOPC-Based Conditioning
In this section, we describe how we can add conditional nodes (RECEIVE and
SEND) in a given 3V network, such that the resulting asynchronous circuit has
less switching activity, while preserving equivalence.
Intuitively, since N represents no communication, we can change the value of
non-observable variables fromf0; 1g to N as long as we preserve equivalence and
stall-freedom. Also, our transformation should not create a cycle in the network.
We present this in the following lemma and theorems.
Lemma 3.7 (SEND-RECEIVE Peephole optimization). Let u be an uncondi-
tional node with inputs (x
1
; x
2
;:::; x
n
) and output y
1
, implementing the func-
tion y
1
= f (x). For a variable e62 TFO(y
1
), if e
f0g
implies GOPC(B;y
1
) = 1
55
Figure 3.11: Peephole Optimization (e = 0!GOPC(B;y
1
) = 1)
and if8x
i
2fx
1
;:::; x
n
g : e c
x
i
, replacing u with u
c
implementing the function
y
2
=
f (x)se
re preserves equivalence.
Proof (by case analysis on e): In the case e
f0g
, it is given that GOPC (B; y
1
) = 1 .
Therefore, replacing y
1
= f (x) with y
2
=
f(x)se
re does not change the value
of any primary output variable if we show that when y
1
takes values inB, y
2
also
only takes values inB.
Since (8x
i
2fx
1
;:::;x
n
g :e c
x
i
), based on Lemma 3.1:
y
1
c
e:
Also, y
B
1
and y
1
c
e imply that e
B
. Therefore, based on the tables of Figure
3.3e and 3.3f:
y
B
2
:
For the other two cases, we prove: y
2
=
f (x)se
re = f (x) = y
1
:
e
f1g
: y
2
=
f (x)s1
r1 = f (x)r1 = f (x) = y
1
:
e
fNg
: y
2
=
f (x)sN
rN = NrN = N
Lem:3:1
= f (N;:::; N ) = y
1
:
56
Next we prove the transformation of Lemma 3.7 also preserves iteration stall-
freedom. Consider a 3V network G
1
(V
1
; E
1
) which is transformed to G
2
(E
2
; V
2
)
after placing a SEND followed by a RECEIVE cell at the output of a node u
corresponding to an unconditional function f as shown in Figure 3.11. The input
variables of u are x = (x
1
; x
2
;:::; x
n
), and the output variable is y
1
. This trans-
formation creates two new variables y
c
and y
2
as shown in Figure 3.11 (right). In
the following theorem, we prove that if G
1
is iteration-stall-free, the transformed
graph G
2
is also iteration-stall-free.
For the reverse transformation, assume that we start from a 3V iteration-stall-
free network G
2
and transform it to G
1
. We prove that if G
2
is iteration-stall-free,
G
1
is also iteration-stall-free.
Theorem 3.3 (SEND-RECEIVE Peephole optimization iteration-stall-freedom
preservation). SEND-RECEIVE Peephole optimization peephole optimization pre-
serves iteration-stall-freedom.
Proof: (G
1
) G
2
)
The nodes in the transitive fanin of u are not aected by this transformation.
The only possible nodes that might be aected are in the transitive fanout of u.
If we prove that a node v whose direct input is y
1
preserves the iteration-stall-
freedom condition, then since the synchronization relationship is an equivalence
57
relation, we can state that the iteration-stall-freedom condition for the nodes in
the transitive fanout of v will not be aected either.
Also, we add a new SEND and a new RECEIVE node at the output of the
node u for which we should verify the iteration-stall-freedom condition. Therefore,
assuming G
1
is iteration-stall-free, we rst show:
y
1
c
e (The SEND node)
8
>
>
>
>
<
>
>
>
>
:
y
c
fNg
(if e
f0;Ng
)
y
B
c
(if e
f1g
)
(The RECEIVE node)
For the SEND node, since (8x
i
2fx
1
;:::;x
n
g :x
i
c
e) based on Lemma 3.2:
y
1
c
e:
For the RECEIVE node, if e
f0;Ng
, based on the table of Figure 3.3f:
y
N
c
:
Also, e
f1g
implies that (y
c
=y
1
). Therefore, y
1
c
e implies that
y
B
c
:
58
Finally, based on Lemma 3.3, y
1
= y
2
, which also implies y
1
c
y
2
. Therefore,
after changing the input of a node v that is a successor of the node u in G
1
from
y
1
to y
2
, the iteration-stall-freedom condition for v does not change.
Reverse transformation: (G
2
) G
1
)
Assuming G
2
is iteration-stall-free, similar to the direct transformation, the
only nodes whose input variables are changing are any node v whose direct input
is y
2
.
Similar to the direct transformation, we can write y
1
c
y
2
. Therefore, after
changing the input of a node v that is a successor of the node u in G
2
from y
2
to
y
1
, the iteration-stall-freedom condition for v does not change.
Based on Lemma 3.7, when e
f0g
, the value of y
c
in Figure 3.11 is N. The more
often e = 1 , the more often the value of y
c
is N, and hence the less communication
actions occur on the channel y
c
in the equivalent asynchronous netlist, which can
translate to less switching activity.
In order to increase the number of variables with value N, using Lemma 3.7,
one can move the SEND cell in Figure 3.11 through f, and place it at the inputs
of f as shown in Figure 3.12. Therefore, when e
f0g
, all signals between the SEND
and RECEIVE cells have the value N, hence no switching activity occurs in f. We
formally prove this transformation as the following theorem:
59
Figure 3.12: GOPC conditioning
Theorem 3.4 (GOPC Conditioning). Let u be an unconditional node with inputs
(x
1
; x
2
;:::; x
n
) and output y
1
, implementing the function y
1
= f (x). For a variable
e : e62 TFO(y
1
), if e
f0g
implies GOPC(B;y
1
) = 1, and8x
i
2 fx
1
;:::;x
n
g :
e c
x
i
, replacing u with u
c
implementing the function y
2
= f (x
1
se;:::; x
n
se)re
preserves equivalence.
Proof : Based on Lemma 3.7, since e
f0g
implies GOPC(B;y
1
) = 1, and8x
i
2
fx
1
;:::;x
n
g :e c
x
i
, transforming f (x) to
f (x)se
preserves equivalence.
Using SEND Reconditioning in Lemma 3.3:
f(x)se
re =f(x
1
se;:::;x
n
se)re =y
2
:
The signicance of Theorem 3.4 is that knowing the observability of a function
y = f (x
1
;:::; x
n
), one can isolate f by placing a SEND before all of its inputs
and a RECEIVE after its output as shown in Figure 3.12 without aecting the
values of primary outputs. When e
f1g
, the SEND and RECEIVE cells act like
60
buers. When e
f0g
, the SEND cells generate an N value at the inputs of f, and
the RECEIVE cells generate a 0 at the output, but based on Theorem 3.4, since
the output of f is not observable, the primary output values remain unaected.
In particular, note that the RECEIVE node is necessary to preserve stall-
freedom. If it was absent, in Figure 3.12 (right), y
2
= y
c
, but y
c
may not necessarily
be synchronized with any variable that is currently synchronized with y
1
in Figure
3.12 (left).
Next, we state the fact that GODC conditioning preserves iteration-stall-
freedom under the following theorem:
Theorem 3.5 (GODC conditioning stall-freedom preservation). ODC condition-
ing preserves stall freedom.
Proof: GODC conditioning is the successive application of SEND-RECEIVE Peep-
hole optimization (Lemma 3.7) and SEND reconditioning (Lemma 3.3) transfor-
mations. Since both transformations preserve iteration-stall-freedom, GODC con-
ditioning preserves iteration-stall-freedom.
61
3.9 GOPC-Based RECEIVE Reconditioning
We presented RECEIVE
reconditioning in Section 3.6. In this section, we show
that if a variable is not globally observable, we can perform RECEIVE recondi-
tioning without being worried about the decision between using RECEIVE and
RECEIVE
1
nodes.
Figure 3.13: RECEIVE reconditioning
Theorem 3.6 (GOPC-Based RECEIVE reconditioning). Let u be an un-
conditional node with inputs x = (x
1
; x
2
;:::; x
n
) and output y
1
, implement-
ing the function y
1
= f (x)r
l
e. If e
f0g
implies GOPC(B;y
1
) = 1 and
if 8x
i
2fx
1
;:::; x
n
g : e c
x
i
, replacing u with u
c
implementing the function
y
2
= f (x
1
r
1
0; x
2
r
2
0;:::; x
n
r
n
0 ) preserves equivalence, where eachr
i
can be
arbitrarily a RECEIVE or a RECEIVE
1
function, and the choice between RE-
CEIVE and RECEIVE
1
for eachr
i
does not aect the equivalence.
Proof (by case analysis on e):
62
In the case e
f0g
, it is given that GOPC (B; y
1
) = 1 . Therefore, replacing y
1
with f (x
1
r
1
0; x
2
r
2
0;:::; x
n
r
n
0 ) when eachr
i
is arbitrarily chosen to be a
RECEIVE or a RECEIVE
1
does not change the value of any primary output
variable. Notice that in this case the value of both y
1
and y
2
is either 0 or
1 . More precisely, when e = 0 , based on the table of Figure 3.3e, y
B
1
and since
(8r
i
2fr
1
;:::; r
n
g : r
B
i
), it is true that y
B
2
.
For the other two cases, the output of RECEIVE and RECEIVE
1
is the same.
Therefore, the proof is the same as the proof of Lemma 3.4:
e
f1g
: f (x)r
l
1
Fig:3:3e; 3:8
= f (x) = f (x
1
r
1
1; x
2
r
2
1;:::; x
n
r
n
1 ):
e
fNg
: f (x)r
l
N = N = f (N;:::; N )
Fig:3:3e; 3:8
= f (x
1
r
1
N; x
2
r
2
N;:::; x
n
r
n
N ):
63
Chapter 4
Operand Isolation Conditioning
4.1 Introduction
Operand isolation is a technique for minimizing the dynamic energy overhead asso-
ciated with redundant operations by selectively blocking the propagation of switch-
ing activity through the circuit which is about to perform a redundant operation
[CJ95]. This technique prevents the data from propagating through sections of the
circuit whose outputs are not observable. In synchronous circuits, often latches or
more commonly, AND gates are used as isolation cells.
Operand isolation has been incorporated at the gate level [TMA98], RTL level
[MWM
+
00], and also during high-level synthesis [CK97, CGK
+
06]. Most com-
mercial synchronous synthesis tools are able to perform operand isolation while
synthesizing an RTL description. Our proposed method in this section is applied
to the output netlist of the synchronous synthesis tool which both performs the
synthesis and adds AND-based isolation gates. As explained in Chapter 2, in Pro-
teus
ow, this netlist is called the image netlist. The commercial synthesis tool
performs operand isolation on the combinational section of a circuit. Therefore,
64
our proposed method is also applied to the same combinational section of the cir-
cuit. In this section, we assume that the we are given a combinational image netlist
in which each gate is either a logic gate or an isolation gate (i.e., an AND gate).
As will be explained shortly, the image netlist does not save any power when
the nodes are replaced by asynchronous gates in the Proteus
ow. In response, we
propose a method that takes the synthesized image netlist which has isolation gates
and modies it so that the resulting asynchronous netlist consumes less power.
4.1.1 A Motivating Example
Operand-isolation is the direct application of observability don't cares, where prop-
agation of switching activity through mathematical operators is blocked by placing
isolation cells (such as AND gates) at inputs of operators whose outputs are often
not observable [CJ95]. Consider the 32-bit ALU described in SystemVerilogCSP
1
(SVC) [SB11] in Figure 4.1. The synthesized WRAPPER in which AND-based iso-
lation cells are automatically inserted by the RTL synthesis tool is shown in Figure
4.2a. Notice that since in this example all inputs and outputs are unconditional,
the enable inputs of RECEIVE/SEND cells are connected to 1 .
Although AND-based isolation cells are eective in synchronous design, they
will not block the propagation of switching activity in the resulting asynchronous
1
In SVC, M 1-of-N channels are modeled by an interface [IEE09] called e1of N M.
65
module ALU( e1of2_32.In I1,
e1of2_32.In I2,
e1of2_4.In OP,
e1of2_32.Out O);
logic [4-1:0] op;
logic [32-1:0] i1, i2, o;
always begin
forever begin
I1.Receive(i1);
I2.Receive(i2);
OP.Receive(op);
unique case (op)
4’b0001: o = i1 & i2 ;
4’b0010: o = i1 + i2 ;
4’b0100: o = i1 - i2 ;
4’b1000: o = i1
*
i2 ;
endcase
O.Send(o);
end //forever
end //always
endmodule
Figure 4.1: A 32-bit ALU in SVC
circuit. In particular, after replacing the image nodes with their asynchronous
counterparts, the AND gates associated with an isolated operator still uncondi-
tionally handshake sending the value 0 to down-stream stages. Moreover, unlike in
synchronous netlists, using dual-rail/1-of-N handshaking forces switching activity
on data wires even if the same value is sent on a channel over and over.
Our proposed solution, illustrated in Figure 4.2b, is to use Theorem 3.4 and au-
tomatically translate these isolation gates into additional SEND/RECEIVE gates
that truly isolate the operands by introducing conditional communication and thus
reduce switching activity.
66
(a) WRAPPER with AND-based isolation cells
(b) Replacing AND gates with conditional communication
Figure 4.2: A 32-bit ALU before (left) and after (right) proposed optimization
67
4.2 Problem Statement and Algorithm
We assume that we are given an iteration-stall-free unconditional combinational
3V network G(V; E), which is partitioned by P =fI; U; A; Og, where sets I , U ,
A, and O represent primary inputs, combinational nodes, isolating nodes, and
primary outputs, respectively. Further, we assume that
V is a set of enable
sources, i.e., the output of each !
i
2
is the enable input of an isolating node.
The set A is partitioned into sets A
!
i
A, where !
i
2
.
We would like to replace isolating nodes with SEND nodes, and nd the appro-
priate places to insert RECEIVE nodes while preserving equivalence and iteration-
stall-freedom.
Intuitively, for a set of isolating nodes A
!
i
, we need to nd the biggest set of
nodes in the transitive fanout of a
j
2 A
!
i
: nodes whose outputs are not observable
when the value of the enable toggles between 0 and 1 . As shown in Figure 4.2,
we need to nd the nodes between the isolating nodes and the set of nodes imple-
menting the multiplexing functionality. Formally, we dene isolated nodes to be
in the enable domain of an enable source !2
as follows.
Denition 4.1 (Operand isolation enable domain: OIED). For an enable source
!2
with isolating nodes A
!
A, OIED(!) is dened as:
OIED(!) =
u2Uj8p
iu
:9k;p
iu
[k]2A
!
^i2I
; (4.1)
68
where p
iu
is a path from a primary input i2 I to u, and p
iu
[k] is the k
th
node
along this path.
Denition 4.1 says that if a node u is in OIED(!), all paths starting from a
primary input and ending at u should pass through an isolating node with enable
source !. As an example, Figure 4.3 shows a network in which the OIED of an
enable source e is highlighted.
Based on Theorem 3.4, we replace isolating nodes with SENDs and place a RE-
CEIVE node between each v and u on boundary edges (v; u) where v2 OIED(!
i
),
but u62 OIED(!
i
).
Figure 4.3: An example of OIED
4.2.1 An Algorithm for Finding OIEDs
In order to detect all OIEDs, in Algorithm 4.1 we visit nodes starting from imme-
diate fanouts of isolating nodes a
j
2 A
!
i
with enable source !
i
. For each node u
69
that we visit, if we have previously explored all u's incoming edges, we add u to
OIED(!
i
) and recursively visits u's fanouts. This ensures that all paths starting
from a primary input and ending at u pass through an isolating node in A
!
i
.
Algorithm 4.1 Finding OIED
1: procedure Explore(u; R)
2: if (8e = (v; u)2 E: e is \Explored") then . (All incoming edges of u)
3: insert u into R
4: for all (u; t)2 E do . (All outgoing edges of u)
5: Mark (u; t) as \Explored"
6: Recursively call Explore(t; R)
7: end for
8: end if
9: end procedure
10:
11: procedure Main
12: for all A
!
i
do
13: O
i
[
a
j
2A!
i
fo
k
j(a
j
; o
k
)2 Eg . (Fanouts of A
!
i
members)
14:
15: 8e = (a
j
2 A
!
i
; o
k
2 O
i
)2 E, mark e as \Explored"
16: for all o
k
2 O
i
do
17: R fg
18: Explore(o
k
; R)
19: OIED(!
i
) OIED(!
i
)[ R
20: end for
21: end for
22: end procedure
Complexity of the Algorithm
LetD
max
= max
v2V
(InDegree(v)), n =jVj, and m =jEj. In Algorithm 4.1, we may
visit each node v2 V as many times as it has incoming edges: once when we
explore fanin of nodes in line 2 and once when we explore fanout of nodes in line 4.
70
We visit each edge, on the other hand, at most twice, therefore the running time
is:
O(2m + 2D
max
n) =O(m +n)
4.3 Pre-Layout Power and Cost Evaluation
Adding extra RECEIVE and SEND nodes to a circuit has performance, area and
power overhead. In order to satisfy performance constraints, we assume that
operand isolation is performed only on non-critical paths, and in this section,
we focus on area and power overheads.
In particular, it is often desired to decide whether the extra switching power
and area associated with these nodes is justied by the amount of saved power
before physical design. Here we present a pre-layout cost and benet estimation
function to evaluate the cost and benet of isolating each operator and only commit
the ones whose costs are justied by their benets.
Let V be the set of all nodes in the given network and V
f
V be the set of
nodes implementing a functionf(x
1
;:::;x
n
). Let S
u
be the switching power of each
node u2 V when it is active, i.e., the amount of switching power necessary for
communicating on all input channels, calculating the output, and communicating
on the output channel. Let the total switching power of the circuit be P
total
, and
let the switching power consumption in f without operand-isolation be P
o
f
:
71
P
total
=
X
u2V
S
u
; P
o
f
=
X
u2V
f
S
u
(4.2)
Assuming that we know the operational factor O
f
of each operator f (the
fraction of iterations that f is executing useful operations), the switching power of
f after isolation is:
P
i
f
=O
f
X
u2V
f
S
u
+S
c
(n + 1) ; 0O
f
1 (4.3)
where the second term accounts for switching activity of n isolating SEND nodes
at inputs and one RECEIVE node at the output. For simplicity, we assumed that
the switching power for all RECEIVE/SEND nodes is S
c
. We dene the relative
benet B
f
of isolating operator f to be:
B
f
=
P
f
P
total
=
P
o
f
P
i
f
P
total
(4.4)
To estimate the area cost, let L
c
be the area cost of a SEND or a RECEIVE cell.
The area cost of isolating f with n inputs and one output is Cost
f
=L
c
(n + 1).
4.4 Experimental Results
In this section we focus on a case-study of an ALU, a classic example of where
operand-isolation can be benecial.
72
Assuming L
c
= 1, the area cost of isolating an operator is 96 units: 64 SEND
and 32 RECEIVE cells. To estimate the switching power reduction, we used
Equation 4.4 and a simple model in which the switching power of each cell is
normalized and equal to 1 unit, i.e.,8u2 V : S
u
= 1, and S
c
= 1. Figure 4.4
shows that the relative benet of isolating ADD and SUB is either negligible or
negative. However, the benet of isolating MUL is signicant and is positive until
the fraction of MUL operations is higher than about 0:9 .
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
r
f
-20%
0%
20%
40%
60%
80%
DP
f
P
total
ADD,SUB
MUL
Figure 4.4: Pre layout estimation of the relative benet (
P
f
P
t
otal
) vs. activity ratio(r
f
)
Next, we implemented the proposed operand-isolation
ow within Proteus and
used a commercial power analysis tool to estimate the post-layout power values of
the nal asynchronous netlist. We used a proprietary TSMC 65nm PCHB-based
cell library for which unfortunately internal power was not available and thus
73
ignored. To measure switching power, we used a Value Change Dump [IEE09]
le generated from simulating the post-layout asynchronous netlist at maximum
throughput (1.1GHz) with dierent mixes of op-codes.
Table 4.1: Post-layout total switching power measurements (mW ).
P
o
: No operand isolation. P
i
M
: only MUL is isolated.
P
i
ASM
: ADD, SUB, and MUL are isolated. P
m
: manual decomposition
Activity P
o
P
i
M
P
i
ASM
P
m
(P
o
P
i
M
)
P
o
(P
o
P
i
ASM
)
P
o
(P
m
P
i
ASM
)
P
m
O
AND
= 1 110 34 34 39 69% 69% 13%
O
ADD
= 1 110 34 38 41 69% 66% 8%
O
SUB
= 1 110 34 38 42 69% 65% 9%
O
MUL
= 1 110 106 106 108 4% 3% 2%
O
f
i
= 0:25 110 52 54 56 53% 51% 3%
The results are shown in Table 4.1. In rows 2 to 5 , the ALU only executed
one type of operation. The last row shows the results when the ALU performed
random uniformly distributed operations, i.e., for each operator f , O
f
= 0:25. In
this setting, if we only isolate MUL (P
i
M
), the total switching power reduces by
53 %. The fourth column (P
i
ASM
) shows that if we simultaneously isolate ADD,
SUB, and MUL, the power savings drops to 51 %. Note that this small drop in
power savings is consistent with the pre layout estimations of Figure 4.4. Lastly,
the last column shows that the power savings of our proposed method is better
than those of a design in which the ADD, SUB, and MUL are manually isolated
via CSP decomposition. This suggests that not only is the automatic approach
ecient, but that manual decomposition suers from the fact that the synthesis
tool cannot optimize across CSP boundaries.
74
Table 4.2 shows the area of the nal layouts. The cost of only isolating MUL is
4 % whereas if we simultaneously isolate ADD, SUB, and MUL, the cost increases
to 13 %. Thus from both an area and power perspective, if the four operands are
equally likely, it makes sense to only isolate the multiplier. In addition, the last
column shows that the area cost of our proposed method is lower than that of
manual decomposition by 8 %, further indication of the eciency of our approach.
Table 4.2: Area cost measurements (m
2
)
A
o
:no operand isolation. A
i
ASM
: ADD, SUB, and MUL isolated.
A
i
M
: only MUL isolated. A
m
: manual decomposition
A
o
A
i
M
A
i
ASM
A
m
(A
o
A
i
M
)
A
o
(A
o
A
i
ASM
)
A
o
(A
m
A
i
ASM
)
A
m
Area 98564 102701 111513 121541 -4% -13% 8%
Finally, it is important to note that the throughput of the circuits was not
impacted by the introduction of the SEND/RECEIVE, as the pipeline optimization
steps within Proteus automatically compensated for their existence.
4.5 Summary and Conclusions
In this section we presented an automatic method to reduce the power consumption
of high-performance asynchronous ASICs using operand isolation. We showed that
in a classic ALU executing uniformly distributed operations, a 53 % power saving
can be achieved for only 4 % area cost with no impact on performance. Our results
are better than the manual decomposition both for the power and the area.
75
Chapter 5
Reconditioning
5.1 Introduction
While explicitly surrounding unconditional gates with conditional gates is an ef-
fective method to marry unconditional logic with conditional communication, the
power consumption of the resulting netlist is often far from optimal.
Consider the example in Figure 5.1a, where the value of the enable input e
1
is
often 0 . As a result, the RECEIVE gates often generate a dummy token with value
0 . As explained in Chapter 2, in circuits synthesized by Proteus, these dummy
tokens might pass through several unconditional gates before being ignored (for
example by a multiplexer). Assume that F
1
is a gate which calculates the value
of its output using the dummy input values, only to be ignored by the next stage.
Therefore, the energy used for calculating the output of F
1
when e
1
is 0 is wasted.
An improvement, illustrated in Figure 5.1b, is to move the RECEIVE gate to the
inputs of F
1
. In this case, when e
1
is 0 , the output of F
1
will stay on the RECEIVE
gates' inputs and will pass only when it is actually used, i.e., when the enable value
76
of the RECEIVE gates become 1 . During this time, there is no activity in F
1
,
since F
1
is blocked until its output can be received by the RECEIVE gate.
(a) Original netlist
(b) Improved netlist
Figure 5.1: Reconditioning can reduce switching activity
Similarly, consider the logic block F
3
illustrated in Figure 5.1a driving a SEND
gate whose enable is often 0 . In this case, F
3
wastes power calculating a value that
is often being blocked by the SEND gate. An improvement, illustrated in Figure
5.1b, is to move the SEND cell to the inputs of F
3
.
In Sections 3.6, 3.9, and 3.5, we showed that such transformations convert an
asynchronous netlist to a new netlist which is logically equivalent to the original
netlist, i.e., their 3VL models are equivalent. Moreover, we showed that such
transformations preserve iteration-stall-freedom.
77
Moving logic gates through SEND and RECEIVE gates is reminiscent of re-
timing [LS91] in synchronous design in which logic is passed through state-holding
elements. Therefore, we use the term reconditioning to denote moving conditional
gates through logic gates.
Notice that the number of conditional gates before and after reconditioning
is dierent. Therefore, there is a trade-o between the amount of power saved
and the cost of extra RECEIVE/SEND gates. In the next sections, we present an
objective function to capture this trade-o and perform reconditioning only if the
extra cost is justied by the power saved and the net saving is positive.
5.2 Reconditioning Problem
We perform reconditioning using the 3VL model of an asynchronous netlist, and
we focus only on iteration-stall-free 3V networks, which as discussed in Chapter
2 includes a large class of practical asynchronous circuits. It should be noted,
however, that reconditioning can be applied to sequential circuits by reconditioning
their combinational sub-networks, i.e., the next state and the output function logic.
The given 3VL model is a graph G = (V; E), in which V is partitioned into
a set P =fI; R; S; U; Og, where I is the set of primary inputs, R is the set of
RECEIVE nodes, S is the set of SEND nodes, U is the set of unconditional logic
nodes, and O is the set of primary outputs.
78
Let P
v
be the switching power of the node v2 V when it is active. This number
represents the amount of switching power necessary for communicating on all input
channels, calculating the output value, and communicating on the output channel
of the corresponding asynchronous gate in the asynchronous netlist modeled by the
3V network. For primary inputs and primary outputs, we assume that P
v
= 0 .
Also, in order to capture how often an unconditional node is active, we assign a
value called the operational factor 0 O
v
1 to each node v2 V . The circuit is
rst simulated with an environment that generates practical input values. Then the
simulation trace is divided into system iterations as described in Chapter 3. The
operational factor of the node v is the fraction of the system iterations in which v
is active, i.e., it receives a value on all the inputs and sends a value on its output.
Depending on the arrangement of nodes with respect to the RECEIVE/SEND
nodes, the operational factor of nodes changes.
Alternatively, one can calculate the operational factors if only the probability
of the enable values of the conditional nodes are given. Let the probability a
fg
j
of
the enable variable e
j
of a conditional node be the fraction of system iterations in
which e
j
=.
We calculate the operational factors for all nodes as follows:
For conditional nodes v2fR; Sg:
O
v
= 1: (5.1)
79
For a node v2 U , such that the rst conditional node in the paths from v
to primary outputs is a RECEIVE node with enable probability a
f1g
i
:
O
v
=a
f1g
i
: (5.2)
For a node v2 U , such that the last conditional node in the paths from
primary inputs to v is a RECEIVE node with enable probability a
fNg
i
:
O
v
= 1a
fNg
i
: (5.3)
For a node v2 U , such that the last conditional node in the paths from
primary inputs to v is a SEND node with enable probability a
f1g
j
:
O
v
=a
f1g
j
: (5.4)
For a node v2 U , such that the rst conditional node in the paths from v
to primary outputs is a SEND node with enable probability a
fNg
j
:
O
v
= 1a
fNg
j
: (5.5)
80
Notice that iteration-stall-freedom requires all the operational factor values
calculated for the unconditional node v using Equations 5.2 to 5.5 from any i v
path (i2 I ) or any v o path (o2 O) to be equal.
The total power consumption of the network P
Total
is:
P
Total
=
X
u2U
O
u
P
u
+
X
c2R[S
P
c
(5.6)
Denition 5.1 (Reconditioning Problem). In an iteration-stall-free 3V network
G = (V; E), using Lemma 3.3 and Lemma 3.4, rearrange RECEIVE and SEND
nodes such that P
Total
as dened in Equation 5.6 is minimized, while equivalence
and iteration-stall-freedom are preserved.
In the following sections, we rst describe the reconditioning network model,
and then we present algorithms to solve this problem.
5.3 Reconditioning Network Model
Before formally presenting the reconditioning network model, it is worth focusing
on an example that shows the intuition behind our model.
81
Figure 5.2: A sample asynchronous sub-netlist
5.3.1 An Intuitive Example
Example 5.1
Consider the asynchronous netlist shown in Figure 5.2 whose reconditioning model
is illustrated in Figure 5.3a. This model was created by removing the conditional
nodes from the 3VL model, and instead creating an edge between pairs (u
1
; u
3
),
(u
2
; u
3
), and (u
3
; u
4
). We assign a sequence of removed conditional nodes to each
newly created edge as shown in Figure 5.3a. This modied network is called a
reconditioning network.
We consider all possible reconditioning moves
1
involving node u
3
:
1. Figure 5.3b: Move both RECEIVEs with enable e
1
to the outgoing edge of
u
3
.
2. Figure 5.3c: Move the RECEIVE with enable e
2
to the incoming edges of
u
3
.
1
Without loss of generality, in this example and the following sections, we assume that using
RECEIVE
1
is not necessary.
82
3. Figure 5.3d: Subsequent to Move 2, move the SEND with enable e
3
to the
incoming edges of u
3
.
We assign a distance value D(u)2N
0
to each unconditional or primary IO node
u2 U[ I[ O, which species the maximum number of conditional nodes among
all the i u paths, where i2 I . More precisely, for a conditional node u2U and
primary input i2 I , let d
iu
be the maximum total number of conditional nodes
among all the p
iu
paths. The distance D(u) is:
D(u) = max
i2I
(d
iu
)
The immediate conclusion of analyzing all the moves for node u
3
is that the
value of D(u
3
) is bounded, i.e., 0 D(u
3
) 3 . We call this the range-of-motion
of the node u
3
.
Also, we make the following important observations:
1. The sequence of all possible conditional nodes on the path from node u
1
to u
3
is a subsequence of (RCV
e
1
; RCV
e
2
; SND
e
3
), where a RCV
e
i
is a RECEIVE
node with enable e
i
, and a SND
e
j
is a SEND node with enable e
j
.
We call this longest sequence the longest possible conditional vector (PCV)
of an edge, and its subsequences are called possible conditional sub-vectors
(PCSV) of the edge. For example, (RCV
e
1
; RCV
e
2
) is a PCSV for the edge
(u1; u3 ). We will show how we can calculate PCV of each edge.
83
(a) Reconditioning network model of Figure 5.2
(b) Move 1
(c) Move 2
(d) Move 3
Figure 5.3: Reconditioning network and possible moves involving node 3
2. We do not move the conditional nodes through primary input and primary
output nodes. Therefore, the distance of primary input and primary output
nodes does not change as we rearrange the conditional nodes.
84
3. Using the distance of a node u and the PCV of its incoming and outgoing
edges, we can decide which conditional node's transitive fanout u is on, and
hence its operational factor O
u
can be uniquely identied.
Example 5.1 shows that as we rearrange the conditional nodes in a 3V network,
the distance values and the number of conditional nodes on the edges are modied.
Using the range-of-motion, we can set constraints on rearrangements in order to
preserve equivalence and iteration-stall-freedom.
If the distances change, using PCVs of incoming and outgoing edges, one can
calculate the updated operational factor of each node and hence calculate the new
power consumption of the network using Equation 5.6.
In the next sections, we formalize this model and present algorithms to solve
the reconditioning problem.
5.3.2 The Model
In a 3V network, a path p
uv
is purely conditional if for all nodes c (c6= u; c6= v)
in p
uv
, we have c2 R[ S.
In order to solve the reconditioning problem, we remove RECEIVE and SEND
nodes from a given 3V network to get a new modied network called the recon-
ditioning network G = (V; E) by contracting the purely conditional paths p
uv
85
between unconditional nodes u and v with an edge (u; v). We call these new edges
created from contraction purely conditional edges. Notice that the iteration stall
freedom requires that between each two unconditional nodes u and v either there
is only a single purely conditional path or if there are multiple purely conditional
paths, the conditional nodes on those paths are exactly the same with the same
enable values.
Further, for the reconditioning network we dene the following:
Each edge e = (u; v)2 E is weighted with a conditional node counter
W (e)2 N
0
, which species how many conditional nodes originally existed
in the purely conditional path between u; v.
For each path p
j
uv
= u
e
0
! n
0
e
1
! n
1
e
k1
! v between u and v in the recon-
ditioning network, where j denotes the j
th
path, we dene the path weight to
be the number of conditional nodes that originally existed along that path.
W (p
j
uv
) =
k1
X
l=0
W (e
l
)
Each node v2 V is weighted with its Distance: D(v)2N
0
, specifying the
maximum path weight among all the longest paths starting from a primary
input to v, as dened below:
D(v) = max
i2I;j
W (p
j
iv
):
86
It should be noted that although nding the longest path in a graph is in
general NP-hard, for directed acyclic graphs (which is the case for a recondi-
tioning network), it can be solved in polynomial time by negating the values
of edge weights and solving the shortest-path-problem [CLRS09, Sch03].
To each edge e = (u; v)2 E, we assign an initial possible conditional sub-
vector PCSV (e), where each item in the PCSV (e) is a triple, specifying the
type (RECEIVE or SEND), the enable variable, and the probability of the
enable being 1 . This initial PCSV represent the purely conditional path that
initially existed between the nodes u and v.
As we rearrange the network, the size of the PCSVs changes. The longest
PCSV for each edge is called the possible conditional vector (PCV) and spec-
ies the longest purely conditional path that can be constructed by moving
all the conditional nodes that could possibly be placed between the nodes u
and v while equivalence and iteration stall freedom are preserved.
Notice that we used a vector (rather than a set) to represent all the possible
conditional nodes that can be moved and be located between the two un-
conditional nodes. We need to preserve the sequence of conditional nodes so
that given an optimized reconditioning network and the PCSV of the edges,
we should be able to regenerate the complete network by de-contracting the
PCSVs and regenerating the purely conditional paths.
87
Example 5.2
Figure 5.4a shows an example asynchronous netlist. The sources of enables are
not shown. Figure 5.4b shows the corresponding reconditioning network, where
the initial distances of nodes are annotated by D, and the initial weights of edges
are annotated by W , whereas the distances and edge weights after a rearrangement
in the network are annotated by d and w, respectively. Notice that only the enable
edges of the PCV of each edge are annotated.
(a) A sample asynchronous netlist (sources of enables are not shown)
(b) The corresponding reconditioning network
Figure 5.4: A sample asynchronous netlist and its reconditioning network
88
In the next section we present an algorithm to nd PCVs.
5.4 Creating Possible Conditional Vectors
In this section, we describe how to create the possible conditional vectors (PCV) for
edges of a reconditioning network. The goal is to nd out how far each RECEIVE
and SEND node can be moved using the SEND and RECEIVE reconditioning tech-
niques presented in Chapter 3 (Lemmas 3.3 and 3.4), while preserving equivalence
and iteration-stall-freedom.
Initially, the PCSVs assigned to non-purely conditional paths are empty,
whereas the PCSV of purely conditional edges e = (u; v) are initialized to the
sequence of the RECEIVE and SEND nodes in the p
uv
paths of the original net-
work from which we generated the edge e. We use the PCSV of purely conditional
edges to grow PCSVs of other edges.
In order to perform this systematically, we dene the following operations on
vectors:
Denition 5.2 (Vector operations). For n vectors K
1
,, K
n
:
HEAD(K
1
;:::; K
n
): The longest sub-vector which all K
1
to K
n
start with.
TAIL(K
1
;:::; K
n
): The longest sub-vector which all K
1
to K
n
end with.
89
CAT(K
1
;:::; K
n
): The vector that has all members of K
1
in sequence, fol-
lowed by all members of K
2
in sequence,, followed by all members of K
n
in sequence.
Example 5.3
if K
1
= (a; b; c; e; f ) and K
2
= (a; b; d; e; f ), then:
HEAD(K
1
;K
2
) = (a;b)
TAIL(K
1
;K
2
) = (e;f)
CAT (K
1
;K
2
) = (a;b;c;e;f;a;b;d;e;f)
In order to grow PCSVs of outgoing (incoming) edges of a node v using Lemma
3.3 and Theorem 3.6 and PCSVs of incoming (outgoing) edges of v, we dene the
following two operations:
Denition 5.3. For a node u2 U with n incoming edges and m outgo-
ing edges, let the PCSV of incoming edges be I
1
; I
2
;:::; I
n
, and PCSV of
outgoing edges be O
1
; O
2
;:::; O
m
. Also, let T = TAIL(I
1
; I
2
;:::; I
n
), and
H = HEAD(O
1
; O
2
;:::; O
n
). The PUSH and PULL operations are dened as:
PUSH (u) =
CAT(T; O
1
); CAT(T; O
2
);:::; CAT(T; O
m
)
PULL(u) =
CAT(I
1
; H ); CAT(I
2
; h);:::; CAT(I
n
; H )
90
Given the PCSVs of incoming edges of a node u, the PUSH operation creates
a longer PCSV for each outgoing edge. Similarly, given the PCSVs of outgoing
edges of a node u, the PULL operation creates a longer PCSV for each incoming
edge.
Figure 5.5 shows an example of PUSH and PULL operations on a node u
3
.
(a) PUSH operation
(b) PULL operation
Figure 5.5: PUSH and PULL operations using HEAD and TAIL
5.4.1 An Algorithm for Finding the Longest PCSV
We assume that a reconditioning network G = (V; E) as dened in Section 5.3.2
is given. We assign a vector PCV
e
to each edge e2 E. If the edge e is purely
conditional (i.e., contracted), we initialize PCV
e
to represent the purely conditional
path that was contracted to create e; otherwise, we initialize it to an empty vector.
After this initialization, we use successive PUSH/PULL operations to grow the
91
Algorithm 5.1 The Longest PCSV Algorithm
1: procedure SuccessivePush
2: F list of nodes v2 V in ascending topological order
3: S fg
4: while F is not empty do
5: Remove node u from F 's front and add it to S
6: Update PCV
e
of u's outgoing edges e using PUSH (u)
7: end while
8: end procedure
9:
10: procedure SuccessivePull
11: B list of nodes v2 V in descending topological order
12: S fg
13: while B is not empty do
14: Remove a node u from B's front and add it to S
15: Update PCV
e
of u's incoming edges e using PULL(u)
16: end while
17: end procedure
18:
19: procedure Main
20: for all e2 E do
21: if e is a purely conditional edge then
22: Initialize PCV
e
to the PCSV representing the contracted path
23: else
24: Initialize PCV
e
to an empty vector
25: end if
26: end for
27: SuccessivePush
28: SuccessivePull
29: end procedure
PCV
e
vectors for all the edges in the reconditioning network, as presented in
Algorithm 5.1.
We rst sort the nodes in ascending topological order, since we would like to
perform PUSH on the outgoing edges of a node u only after PUSH is performed
on all nodes in its fanins. Similarly, we sort the nodes in a descending topological
92
order, since we would like to perform PULL on the incoming edges of a node u
only after PULL is performed on all nodes in its fanouts.
InSuccessivePush (SuccessivePull), we maintain a set S of the vertices
u for which the PUSH (PULL) operation is performed. The set S represents the
\explored" nodes of the graph. Initially, S =fg.
5.4.2 Correctness of the Algorithm
In order to prove the correctness of the algorithm, we rst prove the following
lemma:
Lemma 5.1. SuccessivePush nds the longest possible PCSV for each edge that
can be generated by applying a sequence of PUSH operations on the nodes v2 V .
Proof by induction: We prove that at each iteration the PUSH operation on a
node u creates the longest possible PCSV on u's outgoing edges that is possible
to be created by only applying PUSH operations. Consider the set S at any point
in the algorithm's execution. We prove that for each u2 S, the outgoing edges of
u have the longest PCSV. We prove our claim by induction on the size of S:
In the case whenjSj = 1 , we only have one node with no incoming edges.
Therefore, the PUSH on this node does not increase the length of the PCSV
on its outgoing edges, and any future PUSH operation does not change this
condition since it would be on a node that is topologically after u.
93
Suppose that the claim holds whenjSj = k 1 . We grow S by adding
the next node u in the list F to size k + 1 . Let I
1
; I
2
;:::; I
n
be
the PCSVs of u's incoming edges and O
1
; O
2
;:::; O
m
be the PCSVs
of u's outgoing edges. Based on Denition 5.3, the PUSH operation
grows the PCSVs of O
1
; O
2
;:::; O
m
from the left-hand side by T, where
T = TAIL(I
1
; I
2
;:::; I
n
).
By the induction hypothesis, I
1
; I
2
;:::; I
n
are at the maximum possible
length, i.e., they initially represented the purely conditional paths between
the predecessors of u and u in the original network and grew from the left-
hand side by SuccessivePush. Consider any other possible sequence of
PUSH operations. Any PUSH operation in that sequence on u will be on
vectors I
0
1
; I
0
2
;:::; I
0
n
, where for 1 j n, I
j
is a vector that ends with I
0
j
.
As a result, the size of T will be greater than or equal to any other vector
that can be created by the TAIL operation on the PCSVs of the incoming
edges of the node u.
It follows that the resulting PCSVs on the outgoing edges, i.e.,
CAT(T; O
1
); CAT(T; O
2
);:::; CAT(T; O
m
), have the maximum possible
length, and hence the PUSH operation on u creates the longest PCSV on
each outgoing edge.
94
A similar claim can be made forSuccessivePull with a similar proof, which
we present for the sake of completeness.
Lemma 5.2. SuccessivePull nds the longest possible PCSV for each edge that
can be generated by applying a sequence of PULL operations on the nodes v2 V .
Proof by induction: We prove that at each iteration the PULL operation on a
node u creates the longest possible PCSV on u's incoming edges that is possible
to be created by only applying PULL operations. Consider the set S at any point
in the algorithm's execution. We prove that for each u2 S, the incoming edges of
u have the longest PCSV. We prove our claim by induction on the size of S:
In the case whenjSj = 1 , we only have one node with no outgoing edges.
Therefore, the PULL on this node does not increase the length of the PCSV
on its incoming edges, and any future PULL operation does not change this
condition since it would be on a node that is topologically before u.
Suppose that the claim holds whenjSj = k 1 . We grow S by adding
the next node u in the list B to size k + 1 . Let I
1
; I
2
;:::; I
n
be
the PCSVs of u's incoming edges and O
1
; O
2
;:::; O
m
be the PCSVs
of u's outgoing edges. Based on Denition 5.3, the PULL operation
grows the PCSVs of I
1
; I
2
;:::; I
n
from the right-hand side by H , where
H = HEAD(O
1
; O
2
;:::; O
m
).
95
By the induction hypothesis, O
1
; O
2
;:::; O
m
are at the maximum possible
length, i.e., they initially represented the purely conditional paths between
u and the successors of u in the original network and grew from the right-
hand side by our SuccessivePull. Consider any other possible sequence
of PULL operations. Any PULL operation in that sequence on u will be on
vectors O
0
1
; O
0
2
;:::; O
0
m
, where for 1 j m, O
j
is a vector that starts with
O
0
j
. As a result, the size of H will be greater than or equal to any other
vector that can be created by the HEAD operation on the PCSVs of the
outgoing edges of the node u.
It follows that the resulting PCSVs on the incoming edges, i.e.,
CAT(I
1
; H ); CAT(I
2
; H );:::; CAT(I
n
; H ), has the maximum possible
length, and hence the PULL operation on u creates the longest PCSV on
each incoming edge.
Theorem 5.1. Algorithm 5.1 nds the longest possible PCSV for each edge.
Proof: The PUSH and PULL operations grow each PCSV from dierent ends,
i.e., PUSH grows a PCSV from the left-hand side and PULL grows a PCSV from
the right-hand side. Therefore, any given sequence of PUSH and PULL operations
can be projected on PUSH followed by PULL, such that we rst perform all PUSH
operations in sequence and then perform all PULL operations in sequence.
96
But this is similar to what Algorithm 5.1 does except that SuccessivePush
andSuccessivePull each maximize the length of a PCSV from the left and right,
respectively. Therefore, the size of the resulting PCSV of each edge at the end of
these procedures is greater than or equal to any other sequence of PUSH/PULL
operations.
5.4.3 Complexity of the Algorithm
Let n =jVj and m =jEj. Topological sort can be done in time O(m + n)
[CLRS09]. Removing a node from the list can be done in O(1 ). In the while
loop of each of SuccessivePush and SuccessivePull, we visit each node ex-
actly once and each edge at most twice. Therefore, the run-time of the while loop
is O(m + n). Thus, the total running time is:
O
4(m +n)
=O(m +n)
5.4.4 Range of Motion
Using the longest PCSV of incoming and outgoing edges of each node u, we can
nd two constant numbers D
max
(u) and D
max
(u) which are the lower and upper
bounds for the distance of node u:
D
min
(u)D(u)D
max
(u)
97
We call [D
min
(u); D
max
(u)] range of motion of the node u.
5.5 Finding Optimal Distances of Nodes
In this section, we present an algorithm for solving the reconditioning problem as
dened in Denition 5.1. We will use the reconditioning model described in Section
5.3 and nd the optimal distance for each node. We assume that the PUSH and
PULL operations and Algorithm 5.1 are applied to constrain the lower and upper
bound of the distance of each node. We rst provide a solution based on mixed
integer linear programming (ILP). Next, we present a heuristic approach for the
cases when the ILP is not solvable in reasonable time.
5.6 Integer Linear Program
Inspired by the ILP solution for the retiming problem [LS91], we can formulate
the reconditioning problem as an ILP. First, we dene the following parameters
(which are known constant values denoted by capital case letters) and variables
(which are unknown values denoted by lower case letters):
Parameters:
D
v
2f0; 1;:::; Lg; v2 V : the initial distance of each node. This value is
calculated as explained in Section 5.3.2.
L = max
v2V
(D
v
): the maximum initial distance among all nodes v2 V .
98
L
v
; U
v
; v2 V : the lower and upper bound on the distance of node v.
Since we do not perform reconditioning through primary inputs and primary
outputs, we set their range of motions to be only their initial distances:
8v2I[O :L
v
=U
v
=D
v
:
O
vl
2 [0; 1 ]; v2 V; l2f0; 1;:::; Lg: the operational factor of node v if
its distance is l.
W
e
; e2 E: the initial number of conditional cells on edge e.
S
v
2 [0;1); v2 V : the switching power of node v.
S
c
2 [0;1); c2R[S: the dynamic power of conditional cell (RECEIVE
or SEND) c.
Variables:
d
vl
2f0; 1g; v2V;l2f0; 1;:::;Lg: a binary variable indicating whether
the nal distance of node v is l
r
v
2Z; v2V : the dierence of the nal distance and the initial distance
of node v (displacement)
Notice that the distance of each node v2 V can be stated in terms of d
il
as
d
v
=
P
L
l=1
l d
vl
. Also, the number of conditional cells on an edge e = (u; v) after
99
reconditioning can be calculated as: w
e
= r
v
r
u
+ W
uv
. The total number of
conditional nodes after reconditioning is given by:
R =
X
e2E
w
e
=
X
e2E
W
e
+
X
v2V
r
v
(jFI(v)jjFO(v)j); (5.7)
where FI is the number of incoming edges of node v, and FO is the number of
outgoing edges of node v [LS91]. Let C
v
=jFI (v)jjFO(v)j. We get:
R =
X
e2E
W
e
+
X
v2V
r
v
C
v
(5.8)
100
The rst term in Equation 5.8 is a constant. Therefore, using Equation 5.6 we
can write the following ILP problem:
minimize:
F =
X
v2V
L
X
l=0
d
vl
S
v
O
vl
+S
c
X
v2V
C
v
r
v
(5.9)
such that:
L
X
l=0
d
vl
= 1; 8v2V (5.10)
r
v
r
u
+W
uv
0; 8(u;v)2E (5.11)
L
X
l=0
ld
vl
=D
v
+r
v
; 8v2V (5.12)
L
v
L
X
l=0
ld
vl
U
v
; 8v2V (5.13)
d
vl
2f0; 1g; 8v2V; 8l2f0; 1;:::;Lg (5.14)
r
v
2Z; 8v2V (5.15)
Constraint 5.10 forces the distance of each node to be a number between 0
and L. Constraint 5.11 ensures that the number of conditioning cells after recon-
ditioning on each edge in the reconditioning network is non-negative. Constraint
5.12 states the relationship between the initial distance of each node D
v
, its dis-
placement r
v
, and its nal distance d
v
. The lower and upper bound constraints on
distances are enforced by Constraint 5.13.
101
5.7 A Heuristic Approach for Reconditioning
Our experimental results show that the ILP problem can be solved in reasonable
time for medium size networks. However, our experimental results show that as
the size of the network grows higher than about 40,000 nodes, the ILP becomes
intractable.
In this section we present a simple heuristic algorithm that uses the same
constraints that the ILP problem adheres to, but moves the cells one at a time.
In particular, for a node v2 V with current distance d
v
, let w
i1
; w
i2
;:::; w
in
be the weights of its incoming edges, and w
o1
; w
o2
;:::; w
om
be the weights of its
outgoing edges. We represent a move m by a pair m = (v; d), where d2Z.
Such a move m = (v; d) is legal if:
L
v
d
v
+dU
v
(5.16)
8j2 1;:::;n :w
ij
+d 0 (5.17)
8j2 1;:::;m :w
oj
d 0 (5.18)
Figure 5.6 shows the result of committing a move. Notice that d can be either
positive or negative.
The power improvement of a legal move m = (v; d) can be calculated as:
p
m
=S
c
C
v
d +S
v
(O
v(dv +d)
O
vdv
) (5.19)
102
Figure 5.6: Committing a move m = (v;d)
It is important to observe that committing a move m = (v; d) only af-
fects the moves associated with v and its neighboring nodes, i.e., nodes in
N
v
=fu2 Vj(u; v)2 E_ (v; u)2 Eg. Any other move stays valid and the power
improvement associated with it does not change. Consider the example of
Figure 5.7, in which we commit the move (u
3
;1 ). As a result, the moves
(u
1
; +3 ); (u
2
; +3 ); (u
3
;1 ); (u
3
;2 ); and (u
3
;3 ) become invalid and the new
moves (u
3
;1 ); (u
3
;2 ); and (u
4
; +1 ) should now be considered as candidate
moves, as shown in Figure 5.7b. Notice that the power improvements of the moves
(u
3
;1 ); (u
3
;2 ) before and after committing the move (u
3
;1 ) may be dierent,
and hence we needed to invalidate and reevaluate them.
Algorithm 5.2 shows the details of our approach. We initialize (line 17) and
maintain a list of all possible legal moves and keep it sorted in non-increasing order
with respect to p
m
. By limiting the maximum displacement of moves in Disp
set, we can control the size of moveList.
In line 19, we greedily pick the move m = (v; d) with the highest p
m
and
commit the move by updating the distance of u and the weight of u's incoming
and outgoing edge values using Equations 5.16, 5.17, and 5.18.
103
(a) Before committing (u
3
;1)
(b) After committing (u
3
;1)
Figure 5.7: Invalid and new moves as a result of committing a move
After committing a move, the moves for the nodes u2 v[ N
v
that become
invalid, i.e., violate constraints 5.16 to 5.18, are removed from the move list as
shown in line 21 of Algorithm 5.2. Also, as a result of committing a move, new
moves for the neighboring nodes of u which were not legal may now become legal.
Hence, we call AddMoves in line 22 to add them to moveList.
104
Algorithm 5.2 Greedy reconditioning heuristic algorithm
1: procedure AddMoves(moveList;U)
2: for all u2 U; d2 Disp do
3: if move m = (u; d) is legal then . (Equations 5.16, 5.17, and 5.18)
4: p
m
amount of power saved . (Equation 5.19)
5: if p
m
> 0 AND (u; r) is not in moveList then
6: Insert (u; r) into moveList
7: Keep moveList in non-increasing order using p
m
as the key
8: end if
9: end if
10: end for
11: return moveList
12: end procedure
13:
14: procedure Main
15: Disp list of possible displacement values
16: moveList empty
17: AddMoves(moveList;V )
18: while moveList is not empty do
19: Remove and commit a move m = (v; r
v
) from moveList's front
20: N
v
fu2 Vj(u; v)2 E_ (v; u)2 Eg . (Neighboring nodes)
21: Remove invalid moves associated with nodes in v[ N
v
from moveList
22: AddMoves(moveList;v[N
v
)
23: end while
24: end procedure
Complexity of the Algorithm
Let n =jVj, l =jDispj, D
max
= max
v2V
(Degree(v)), and R
max
= max
v2V
(U
v
L
v
),
where Degree(v) =jN
v
j is the number of edges at v. The maximum size of
moveList is then: nl.
An important observation is that as we move nodes, we never assign the same
distance to a node twice. This is because each time we modify the distance of a
node only if there is some positive power improvement associated with this move.
105
Moving back to any of the previously assigned distances will cause negative power
improvement and will be rejected by the algorithm. This means that in the worst
case, each node v will be assigned to all possible distances in its range-of-motion
at most once. Thus, an upper bound on the number of iterations of the while loop
in line 18 is nR
max
.
The rst call to AddMoves in line 17 has O(nl log(nl)) running time, if we
use a priority queue. Also, removing the node with the highest power improvement
has O(log(nl)) running time.
Removing invalid moves associated with nodes u2 N
v
in line 21 has worst case
running time of O(nlD
max
) since we have to search moveList based on v and d,
rather than p
m
, which was used as a key to sort the list. A separate array can
be maintained that maps v and d to the entries of the moveList and may reduce
the lookup complexity to O(D
max
log(nl)).
The second call to AddMoves in line 22 has O(D
max
l) log(nl) running time,
since it only adds the moves for the neighbors of node v, i.e. for nodes u2 N
v
.
Therefore, the total running time of the algorithm is:
O(nl log(nl)) +nR
max
O(log(nl)) +O(D
max
log(nl)) +O(lD
max
log(nl))
=
O(nl log(nl)) +nR
max
O(lD
max
log(nl)) =
O(D
max
R
max
nl log(nl)): (5.20)
106
Optimality of the Algorithm
Algorithm 5.2 uses a greedy approach in which we only commit moves that have
positive power improvement and is hence susceptible to getting trapped in a local
minimum.
(a) This move is rejected since p
m
< 0
(b) Possible optimal conguration which Algorithm 5.2 cannot
achieve
Figure 5.8: Algorithm 5.2 gets trapped in local a minimum
For example, Figure 5.8a shows a reconditioning network and a possible move.
Suppose that the probability of e
1
being 1 is a
f1g
e
1
=
1
=3. For the move shown in Fig-
ure 5.8a, the cost of four extra SEND cell may not be justied by placing only one
107
node (v) in the fanout of SEND cells. Assuming that all nodes have uniform power
consumption S
c
= S
u
1
= S
u
2
= S
u
3
= S
u
1
= S
v
= 1 , based on Equation 5.19:
p
m
1
=S
c
C
v
d +S
v
(O
v(dv +d)
O
vdv
)
=3 +
2
3
=2
1
3
As shown in Figure 5.8b, however, this cost may be justied when 5 nodes
are in the fanout of SEND cells, as the power improvement associated with this
conguration would be:
1 +
2
3
5 = +
7
3
;
which is a positive number. Algorithm 5.2 cannot achieve this conguration since
it moves nodes one by one and the only initial possible move as shown in Figure
5.8a has negative power improvement value and hence is rejected.
Alternative strategies in selecting the best move can help tackling the problem
of local minima. For example, rather than choosing the move with the highest
p
m
at each iteration, one could randomly choose from among all the moves in
the moveList until no further move is possible. This heuristic can be repeated a
number of times, keeping the best solution found.
108
Variations of the Heuristic Approach
Our greedy heuristic performs dierently if we modify the value of the power cost
of conditional nodes (S
c
) with respect to unconditional nodes S
u
. If S
c
S
u
in
Equation 5.19, the algorithm does not accept moves whose execution increases
the number of conditional nodes, and hence there is a higher chance that the
algorithm gets trapped in local minima. On the other hand, setting S
c
S
u
,
allows the algorithm to accept any move that reduces the power of unconditional
nodes and maximizes the number of conditioned nodes.
In Section 5.9, we evaluate this eect by setting S
u
= 1 and sweeping S
c
for 0 ,
0:5 , and 1 . In particular, we perform:
Greedy1: Setting the power cost of conditional nodes and unconditional
nodes to be one unit in the Greedy algorithm (i.e.,8v2 V : S
v
= S
c
= 1 ).
Greedy0.5: Setting the power cost of conditional nodes to be 0:5 , (i.e.,
8v2 V : S
v
= 1; S
c
= 0:5 ), run the Greedy algorithm once, then set S
c
= 1 ,
and run the Greedy algorithm again.
Greedy0: Setting the power cost of conditional nodes to be 0 (i.e.,
8v2 V : S
v
= 1; S
c
= 0 ), run the Greedy algorithm once, then set S
c
= 1 ,
and run the Greedy algorithm again.
109
5.8 Sharing Conditional Communication Primi-
tives
In retiming [LS91], latches on multiple fanouts of a node can be shared. Compared
to the case where latches are replicated on each outgoing edge of a logic gate,
sharing can reduce the total number of latches in the circuit.
The same technique can be used in reconditioning. Figure 5.9a shows the case
where a SEND node is replicated on all outgoing edges of an unconditional node.
Figure 5.9b shows an optimization, where only one SEND node is used for all three
outgoing edges. The implementation of sharing is left as future work.
(a) Replicating the conditional
node on multiple fanouts
(b) Sharing a conditional node
on multiple fanouts
Figure 5.9: Replicating versus sharing conditional nodes on multiple fanouts.
5.9 Experimental Results
In this section we evaluate the eciency and performance of reconditioning. We
use a simple unied power model, where we assume that for all unconditional nodes
110
v, S
v
= 1 . In the ILP, we set S
c
= 1 , and in the greedy heuristics we modify them
as described in the previous section. We performed a complete case study on the
following circuits:
1. RECON1 (Figure 5.10): Dual mode arithmetic unit. This represents cir-
cuits that have multiple modes of operation. In each mode, the circuit per-
forms a dierent calculation. Using conditional communication, in each mode
data is only sent to the part that is performing a useful calculation.
2. RECON2 (Figure 5.11): A multiplier with enable. This represents circuits
that conditionally perform a certain calculation. If the condition is not met,
the circuit still receives the inputs, but generates a default value at the
output.
3. ALU-OI (Figure 4.1): This is the ALU described in Chapter 4 for operand
isolation. We show that using reconditioning, the result of operand isolation
can be further improved.
In each case study, we performed reconditioning using ILP, and various
avours
of the heuristic approach: Greedy0, Greedy0.5, and Greedy1 as introduced
in Section 5.7.
To keep the implementation simple, we chose examples in which all nodes with
equal distance have equal operational factor. For each case study, we repeated
reconditioning while sweeping the operational factors for 0:25 , 0:50 , and 0:75 .
111
Finally, we also repeated each of the above cases while sweeping the bit width of
inputs and outputs of the circuit to evaluate the performance and eciency of our
algorithms and comparing the ILP versus heuristic approach.
We have performed our tests on a 64-bit Linux server with eight Intel Xeon
CPU cores. Our algorithms were implemented in C++ using standard template
library (STL). Also, we used the GNU GLPK package [GLP] for solving the integer
linear program.
5.9.1 Results
Our experiments show that for the heuristic algorithm, we always get the best
results when we set the power cost of conditional nodes to 0 . Figure 5.12 shows the
power comparison between the Greedy0, Greedy0.5, and Greedy1 heuristics
for RECON1, ALU-OI, and RECON2.
Since Greedy0 achieves the best results, in Figures 5.13, 5.14, and 5.15 we
only compare the power improvement of ILP versus the Greedy0 algorithm for
RECON1, RECON2, and ALU-OI. The results are shown for various values of in-
put/output widths (and hence numbers of nodes) and various values of operational
factors.
In case of RECON1 and RECON2, we get about 80 % improvements for opera-
tional factors of 0.25, and 45 % improvement for operational factor of 0.75. On the
112
module RECON1 (
e1of2_1.In E1,
e1of2_16.In IA1, e1of2_16.In IA2,
e1of2_16.In IB1, e1of2_16.In IB2,
e1of2_16.In IC1,
e1of2_32.Out O1, e1of2_32.Out O2
);
parameter WI = 16;
parameter WO = 32;
logic e1;
logic [WI-1:0] a1, a2, b1, b2, c1, c2, d1, d2;
logic [WO-1:0] f1, f2;
always begin
forever begin
a1=0 ; b1=0 ; c1=0 ; d1=0 ; f1=0;
E1.Receive(e1);
if (e1)
begin
IA1.Receive(a1);
IB1.Receive(b1);
IC1.Receive(c1);
d1 = a1 + b1;
if(d1> 4’hAAAA)
begin
f1=d1
*
c1;
O1.Send(f1);
end
end
else
begin
IA2.Receive(a2);
IB2.Receive(b2);
f2=a2
*
b2;
O2.Send(f2);
end
end
end
endmodule
Figure 5.10: SVC description of RECON1: Dual mode arithmetic unit
113
module RECON2 (
e1of2_1.In E,
e1of2_8.In I1, e1of2_8.In I2,
e1of2_16.Out O
);
parameter WI = 8;
parameter WO = 16;
logic e;
logic [WI-1:0] i1, i2;
logic [WO-1:0] o;
always begin
forever begin
i1=0 ; i2=0 ; o=0 ;
E.Receive(e);
if (e)
begin
I1.Receive(i1);
I2.Receive(i2);
o = i1
*
i2;
end
O.Send(o);
end
end
endmodule
Figure 5.11: RECON2: A multiplier with enable and default value
other hand, the power consumption of ALU-OI does not signicantly improve by
reconditioning since the conditioned area after operand isolation is almost maxi-
mized by the RTL synthesizer and the power consumption is already close to the
optimized value.
Figures 5.16a, 5.16b, and 5.16c show the running-time comparison of ILP and
Greedy0 on our sample circuits. Notice that a logarithmic scale is used for the
vertical axis. These charts show that the running time of the Greedy algorithms
114
(a) RECON1
(b) RECON2
(c) ALU-OI
Figure 5.12: Comparison greedy heuristics with S
c
= 0, 0:5, and 1
115
are at least an order of magnitude smaller than the ILP approach and grow much
slower than the ILP. In practice, the circuits synthesized by Proteus are not bigger
than 50,000 gates. Our results show that theGreedy0 heuristic can nish in less
than 10 seconds for circuits with that size.
Next, we used a commercial power analysis tool to estimate the post-layout
power values of the nal asynchronous netlist and compared the power improve-
ments with our predicted values. We used a proprietary TSMC 65nm PCHB-based
cell library for which unfortunately internal power was not available and thus
ignored. To measure switching power, we used a Value Change Dump [IEE09]
le generated from simulating the post-layout asynchronous netlist at maximum
throughput (1.1GHz) with random inputs.
Figure 5.17 shows this comparison for the 32 -bit implementation of RECON1
and RECON2. The measured post-layout power improvement values are about
36 % worse than the predicted values in the worst case (RECON2 with Greedy0 or
ILP) and about 23 % worse than the predicted values in the best case (RECON1
with ILP). The main reason for this dierence is that in our implementation, we
did not address sharing conditional communication primitives on multiple outgoing
edges of unconditional nodes as discussed in Section 5.8. The original circuit before
reconditioning, however, shares the conditional nodes on multiple outgoing edges
of unconditional nodes. After our optimization, sharing is lost and the conditional
116
nodes are replicated on all outgoing edges of unconditional nodes as shown in
Figure 5.9a. Addressing this shortcoming is left as future work.
117
(a) RECON1 (16 bit)
(b) RECON1 (32 bit)
(c) RECON1 (64 bit)
Figure 5.13: Comparison of power consumption for RECON1
118
(a) RECON2 (16 bit)
(b) RECON2 (32 bit)
(c) RECON2 (64 bit)
Figure 5.14: Comparison of power consumption for RECON2
119
(a) ALU-OI (16 bit)
(b) ALU-OI (32 bit)
(c) ALU-OI (64 bit)
Figure 5.15: Comparison of power consumption for ALU-OI
120
(a) RECON1
(b) RECON2
(c) ALU-OI
Figure 5.16: Running time comparison for RECON1, RECON2, ALU-OI for 16, 32,
64, and 128 bit datapaths. The ILP results are not shown for 128 bit datapaths due to
very long running times.
121
(a) RECON1
(b) RECON2
Figure 5.17: Comparison of the predicted versus the post-layout power improvements
122
Chapter 6
Another Application of 3VL
Model: Formal Verication
6.1 Introduction
In this chapter, we describe another application of our 3VL model, which is formal
verication of asynchronous circuits in Proteus
ow. It should be noted that this
chapter only includes the general ideas and initial experimental results. Detailed
automatic implementation and experimental results are left as future work.
6.2 Formal Verication
In synchronous circuits, the most common method for validating that two spec-
ications are equivalent is by simulation. Checking equivalence for non-trivial
and larger circuits using a simulation-based methodology, however, is infeasible
because there are practically an innite number of vectors in the input space. Al-
ternatively, formal verication uses rigorous mathematical reasoning to show that
a design meets all or parts of its specication [KG99, Lam05].
123
Formal verication can be classied further into two categories: equivalence
checking and property verication. The former determines whether two implemen-
tations are functionally equivalent, while the latter proves or disproves that the
design has a certain property [Lam05].
Equivalence checking of two combinational circuits involves proving that the
functions modeling those combinational circuits are equivalent. Formal sequential
equivalence checking, on the other hand, is the process of verifying if two sequen-
tial circuits are equivalent. Formal sequential equivalence checking is generally
recognized as a hard problem that cannot be solved eciently for large industrial
designs, except in a few special cases [PH09]. However, if the two sequential cir-
cuits being checked for equivalence share the same set of inputs I, outputs O, and
ip-
ops T, then it can be shown that it is sucient to check their combinational
portions for equivalence [PH09].
Combinational equivalence checking (CEC) of register-transfer-level (RTL) or
gate-level designs is the most widely adopted and successful formal validation tech-
nology used in modern-day IC design
ows. In such a scenario, the sequential
equivalence-checking problem can be solved as a sequence of two sub-problems:
nding a mapping between the latches of the two circuits, and then checking the
combinational portions of the two circuits for equivalence under this mapping.
The former is known as the latch-mapping problem and the latter as combinational
124
equivalence checking (CEC). This process involves removing the latches and includ-
ing the present-state variable of each latch in the set of primary input signals and
the next-state variable in the set of primary output signals for the respective cir-
cuit. Most industrial formal verication tools are able to solve the latch-mapping
and CEC problems eciently.
The above principle can be extended to three-valued logic, i.e., if two sequen-
tial 3V networks being checked for equivalence share the same set of inputs I,
outputs O, and
ip-
ops T, then it can be shown that it is sucient to check
their combinational portions for equivalence. Therefore, here, we only focus on
the combinational equivalence check of 3V networks.
Motivation
Equivalence checking of asynchronous circuits can be desired in the scenarios shown
in Figure 6.1:
Comparison of two SVC descriptions, e.g., before and after decomposition
Comparison of one SVC description versus an asynchronous netlist
Comparison of two asynchronous gate-level netlists, e.g., before and after
conditioning and reconditioning
Notice that in our model, for comparison of SVC blocks, we rst convert them
to their WRAPPER structure as described in Section 2.4.
125
Figure 6.1: Desired comparison scenarios in Proteus
ow
Scope of the work
Approaches for equivalence checking of synchronous circuits are surveyed in [KG99,
Lam05]. In this section, we focus on equivalence checking of 3VL networks using
industrial synchronous 2VL verication tools. In other words, rather than creating
a new approach for verication of 3VL networks, we encode a 3VL network into a
2VL network and use the encoded version as the input of the formal verication
tool.
6.2.1 Binary Coded Three Valued Logic (BC3VL)
In order to use a two-valued logic formal verication tool, we encode each variable
of a 3VL network using two bits: valid (v) and data (d) as shown in Table 6.1.
126
Table 6.1: BC3VL (Binary Coded 3VL)
BC3VL
3VL
Valid Data
1 0 0
1 1 1
0 0 N
0 1 Not used
In our proposed method of comparing two 3VL networks, one should rst en-
code both networks using the above BC3VL. Then a 2VL formal verication tool
can be used to compare the two BC3VL networks, as shown in Figure 6.2.
Figure 6.2: Equivalence check using BC3VL transformation
Generating Data and Valid Bits in a Gate-Level Netlist
Figure 6.3 shows the BC3VL encoding for an AND, a RECEIVE, a SEND, and a
DFF cell. Each port is encoded using two bits: valid (v), and data (d). Each DFF
is duplicated: one for the data bit and one for valid the bit.
Figure 6.3: Examples of BC3VL encoded cells
127
Using Lemma 3.1 and the tables of Figure 3.3, we can nd the equations for
the valid and data bit outputs of every library cell as a function of valid and data
bits of their inputs:
A gate implementing and unconditional function o =f(i
1
;i
2
;:::;i
n
):
o:v =i
1
:v^i
2
:v^^i
n
:v
o:d =o:v^f(i
1
:d;i
2
:d;:::;i
n
:d)
A RECEIVE cell:
r:v = (e:v^:e:d)_ (l:v^e:v^e:d)
r:d =r:v^l:d^e:d
A SEND cell:
r:v =l:v^e:v^e:d
r:d =r:v^l:d
Notice the logical AND between the output valid signal (o:v for logical gates
and r:v for RECEIVE/SEND cells) and the output data signal (o:d for logical
128
gates and r:d for RECEIVE/SEND cells). This ensures that when the value of
the output in the 3VL model is N, the equivalent BC3VL encoded data and valid
signals are both zero as dened in Table 6.1.
Generating Data and Valid Bits for an Image RTL Description
In this section we propose a method for performing the BC3VL transformation
for a state-less image RTL block, such that the BC3VL circuit can be used as the
input of a commercial formal verication tool. In particular, we should calculate
the valid bits of primary outputs of the RTL BODY from primary inputs of the
RTL BODY. For example, in Figure 6.4, RO:v is 1 whenever LI:v is 1.
Figure 6.4: The valid bit of the primary output of RTL BODY is calculated from the
valid bit of the primary input of the RTL BODY
In general, based on Lemma 3.1, the valid bit of each primary output o is the
AND of the valid bits of all inputs in the support (i.e., the set of variables on
which o actually depends [HS96]) of o. For example, for an ALU with primary
129
inputs I1, I2, and OP, the valid bit of the primary output O should be added to
the RTL BODY description as follows:
O:v =I
1
:v^I
2
:v^OP:v
Currently, adding the validity of primary outputs to the RTL BODY is done man-
ually. Automation of this task is left as future work.
Initial Experimental Result
We have successfully tested our proposed BC3VL method for several circuits. In
this section,we provide the comparison of two designs: In the rst design a 32-bit
ALU is described at SVC level, to which we manually added the primary output
valid bits. In the second design, the ALU is decomposed into multiple blocks
each described at SVC level. Again, we manually added the valid bits of primary
outputs to the RTL BODY description of each sub-block. The two designs were
given to the Cadence Conformal verication tool. Figure 6.6 shows the screenshot
of the tool in which all comparison points (a 32-bit output data plus its valid bit)
are declared to be equivalent. This example shows that an industrial synchronous
logic equivalence checker can be used to compare two asynchronous circuits that
are modeled using binary coded three-valued logic model.
130
Figure 6.5: Decomposed ALU. Each sub-block is described in SVC
Figure 6.6: Screenshot of Conformal tool declaring two binary coded 3V networks are
equivalent
131
Chapter 7
Conclusion and Future Work
Achieving low power in asynchronous circuits is not automatic and does not come
for free. It often requires careful designer's attention to and/or manual interven-
tion. On the other hand, various mature and automatic power reduction techniques
for synchronous circuits exist. Bridging this gap was the main motivation and is
the main contribution of this thesis.
An important conclusion is that using an abstract model of asynchronous cir-
cuits that is \close" to synchronous circuits, i.e., the 3VL model, can enable de-
signers not only to adopt synchronous power reduction methods in asynchronous
systems, but also for the most part take advantage of the same CAD tools that
are used by synchronous designers. In fact, the two main contributions of this
work, conditioning (based on observability conditions) and reconditioning (based
on retiming), have their roots in classic Boolean logic synthesis and optimization.
Moreover, the automatic formal verication method proposed by this work also
conrms that such a \close-to-synchronous" model provides a gateway for adopt-
ing already-existing-solutions to asynchronous circuit design.
132
7.0.2 Future Work
This section presents future directions of this work and suggests how the ideas
presented by this thesis can impact asynchronous CAD/VLSI.
Conditioning
We presented operand-isolation as an application of global observability partial
care (GOPC)-based conditioning in Section 3.8. Another natural application of
GOPC-based conditioning is clock-gating, where the output of a logic block is not
observable if the clock of the
ip-
op in its fanout cone is gated.
Furthermore, stability condition (STC)-based clock-gating [FKM], as a dual of
observability condition, has been shown to eectively contribute to power reduction
of synchronous circuits. We believe that stability-conditions can be easily extended
to three-valued logic and to provide more conditioning opportunities.
Reconditioning
In Chapter 5, we presented an MILP and a simple greedy heuristic algorithm for
solving the problem of reconditioning. A natural follow up would be exploring
other heuristic algorithms, such as simulated annealing and genetic algorithms.
Moreover, as discussed in Section 5.8, our initial formulation and experimental
results do not capture the fact that conditional nodes on multiple outgoing edges
of an unconditional node can be shared, and hence our results are sub-optimal.
133
Enabling the sharing of conditional nodes is then an important subject of future
research.
Formal Verication
Formal verication is an important step in circuit design and can play a major
role in the wide adoption of asynchronous methodologies in the industry. There-
fore, implementation and evaluation of the method discussed in Chapter 6 and a
comparison with other available methods is an important subject of research in
asynchronous design. We believe that besides functional equivalence, the three-
valued logic model can also be used for property checking, such as conservative
deadlock-freedom checks.
134
Bibliography
[BDL11] P. A. Beerel, G. D. Dimou, and A. M. Lines. Proteus: An ASIC
Flow for GHz Asynchronous Designs. IEEE Design and test of Com-
puters, 28(5):36{51, 2011.
[BK99] R. K. Brayton and S. P. Khatri. Multi-valued logic synthesis. In Pro-
ceedings of 12th International Conference On VLSI Design, pages
196{205, 1999.
[BLDK06] P. A. Beerel, A. M. Lines, M. Davies, and N. Kim. Slack matching
asynchronous designs. In Proceedings of 12th IEEE International
Symposium on Asynchronous Circuits and Systems, pages 184{194,
2006.
[BOF10] P. A. Beerel, R.O. Ozdag, and M. Ferretti. A Designer's Guide to
Asynchronous VLSI. Cambridge University Press, 2010.
[BR07] P. A. Beerel and M. E. Roncken. Low power and energy ecient
asynchronous design. Journal of Low Power Electronics, 3(3):234{
253, 2007.
[CCL06] E. Choi, K. Cho, and J. Lee. New Data Encoding Method with a
Multi-Value Logic for Low Power Asynchronous Circuit Design. In
Proceedings of the 36th International Symposium on Multiple-Valued
Logic, pages 4{, 2006.
[CGK
+
06] A. Chattopadhyay, B. Geukes, D. Kammler, E. M. Witte,
O. Schliebusch, H. Ishebabi, R. Leupers, G. Ascheid, and H. Meyr.
Automatic ADL-based operand isolation for embedded processors.
In Proceedings of the conference on design, automation and test in
Europe, pages 600{605, 2006.
135
[CJ95] A. Correale Jr. Overview of the power minimization techniques em-
ployed in the IBM PowerPC 4xx embedded controllers. In Proceed-
ings of international symposium on low power design, pages 75{80,
1995.
[CK97] C. Chen and K. Kucukcakar. An architectural power optimization
case study using high-level synthesis. In Proceedings Of IEEE In-
ternational Conference On Computer Design, pages 562 {570, 1997.
[CKLS06] J. Cortadella, A. Kondratyev, L. Lavagno, and C. P. Sotiriou.
Desynchronization: Synthesis of Asynchronous Circuits From Syn-
chronous Specications. IEEE Transactions on Computer-Aided De-
sign of Integrated Circuits and Systems, 25(10):1904 { 1921, 2006.
[CLRS09] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Intro-
duction to Algorithms. MIT Press, 2009.
[Cun04] P. A. Cunningham. Verication of asynchronous circuits. Technical
Report UCAM-CL-TR-587, Univeristy of Cambridge, 2004.
[DBL11] G. D. Dimou, P. A. Beerel, and A. M. Lines. Performance-driven
clustering of asynchronous circuits. In Integrated Circuit and System
Design. Power and Timing Modeling, Optimization, and Simulation,
Lecture Notes in Computer Science, pages 92{101. Springer Berlin
/ Heidelberg, 2011.
[FB96] K. M. Fant and S. A. Brandt. NULL Convention Logic
TM
: a com-
plete and consistent logic for asynchronous digital circuit synthesis.
In Proceedings of International Conference on Application Specic
Systems, Architectures and Processors, pages 261 {273, aug 1996.
[FF03] T. Felicijan and S.B. Furber. An asynchronous ternary logic sig-
naling system. IEEE Transactions on Very Large Scale Integration
(VLSI) Systems, 11(6):1114 {1119, dec. 2003.
[FKM] R. Fraer, G. Kamhi, and M. K. Mhameed. A new paradigm for
synthesis and propagation of clock gating conditions. In Proceedings
of 45th Design Automation Conference, pages 658{663.
[GB11] P. Golani and P.A. Beerel. An area-ecient multi-level single-track
pipeline template. In Proceedings of Design, Automation Test in
Europe Conference and Exhibition, pages 1{4, 2011.
[GLP] GLPK (GNU Linear Programming Kit). Online.
http://www.gnu.org/software/glpk/. Accessed 23-October-2012.
136
[Gup09] K. Gupta. Discrete Mathematics. Krishna Prakashan, 10th edition,
2009.
[Hoa85] C.A.R. Hoare. Communicating Sequential Processes. Prentice Hall,
1985.
[HS96] G. D. Hachtel and F. Somenzi. Logic synthesis and verication
algorithms. Kluwer Academic Publishers, 1996.
[IEE09] IEEE Standard for SystemVerilog- Unied Hardware Design, Speci-
cation, and Verication Language. IEEE Std 1800-2009 (Revision
of IEEE Std 1800-2005), 2009.
[JKB
+
02] H. M. Jacobson, P. N. Kudva, P. Bose, P. W. Cook, S. E. Schuster,
E. G. Mercer, and C. J. Myers. Synchronous interlocked pipelines. In
Proceedings of 8th IEEE International Symposium on Asynchronous
Circuits and Systems, pages 3{12, 2002.
[KG99] C. Kern and M. R. Greenstreet. Formal verication in hardware
design: a survey. ACM Transactions on Design Automation of Elec-
tronic Systems (TODAES), 4(2):123{193, 1999.
[KL02] A. Kondratyev and K. Lwin. Design of asynchronous circuits by
synchronous CAD tools. In Proceedings of 39th Design Automation
Conference, pages 411{414, 2002.
[Kon01] X. Kong. FormaI Verication of Peephole Optimization in Asyn-
chronous Circuits. PhD thesis, 2001.
[Lam05] W. K. Lam. Hardware Design Verication: Simulation and For-
mal Method-Based Approaches (Prentice Hall Modern Semiconduc-
tor Design Series). Prentice Hall PTR, 2005.
[LGC11] C. Law, B. Gwee, and J.S. Chang. Modeling and Synthesis of Asyn-
chronous Pipelines. IEEE Transactions on Very Large Scale Inte-
gration (VLSI) Systems, 19(4):682 { 695, 2011.
[Lin98] A. M. Lines. Pipelined asynchronous circuits. Technical Report
CaltechCSTR:1998.cs-tr-95-21, California Institute of Technology,
1995 (Revised 1998).
[Lin04] A. M. Lines. Asynchronous interconnect for synchronous SoC de-
sign. Micro, IEEE, 24(1):32 { 41, 2004.
[LS91] C. Leiserson and J. Saxe. Retiming synchronous circuitry. Algorith-
mica, 6:5{35, 1991.
137
[Man07] R. Manohar. Systems and methods for performing automated con-
version of representations of synchronous circuit designs to and
from representations of asynchronous circuit designs. United States
Patent Application Publication. (20070256038), 2007.
[Mar81] A. J. Martin. An axiomatic denition of synchronization primitives.
Acta Informatica, 16:219{235, 1981.
[MKWS04] N. Magen, A. Kolodny, U. Weiser, and N. Shamir. Interconnect-
power dissipation in a microprocessor. In Proceedings of the 2004 in-
ternational workshop on System level interconnect prediction, SLIP
'04, pages 7{13. ACM, 2004.
[MNW03] A. J. Martin, M. Nystrom, and C. G. Wong. Three generations of
asynchronous microprocessors. IEEE Design and test of Computers,
20(6):9 { 17, nov.-dec. 2003.
[MRST97] R. Mariani, R. Roncella, R. Saletti, and P. Terreni. On the realisa-
tion of delay-insensitive asynchronous circuits with CMOS ternary
logic. In Proceedings of 3rd International Symposium on Advanced
Research in Asynchronous Circuits and Systems, pages 54 {62, 1997.
[Mur89] T. Murata. Petri nets: Properties, analysis and applications. Pro-
ceedings of the IEEE, 77(4):541 {580, 1989.
[MWM
+
00] M. M unch, B. Wurth, R. Mehra, J. Sproch, and N. Wehn. Automat-
ing RT-level operand isolation to minimize power consumption in
datapaths. In Proceedings of the conference on Design, automation
and test in Europe, pages 624{633, 2000.
[Neg98] R. Negulescu. Process Spaces and Formal Verication of Asyn-
chronous Circuits. PhD thesis, 1998.
[NS99] L. S. Nielsen and J. Sparso. Designing asynchronous circuits for low
power: an IFIR lter bank for a digital hearing aid. Proceedings of
the IEEE, 87(2):268 {281, feb 1999.
[OB04] R. O. Ozdag and P. A. Beerel. A channel based asynchronous low
power high performance standard-cell based sequential decoder im-
plemented with QDI templates. In Proceedings of 10th International
Symposium on Asynchronous Circuits and Systems, pages 187 { 197,
2004.
[PH09] D. K. Pradhan and I. G. Harris. Practical Design Verication. Cam-
bridge University Press, 2009.
138
[Roi97] O. Roig. Formal Verication and Testing of Asynchronous Circuits.
PhD thesis, 1997.
[RVFG05] D. Rostislav, V. Vishnyakov, E. Friedman, and R. Ginosar. An asyn-
chronous router for multiple service levels networks on chip. In Pro-
ceedings of 11th IEEE International Symposium on Asynchronous
Circuits and Systems, pages 44 {53, 2005.
[SB11] A. Saifhashemi and P. A. Beerel. SystemVerilogCSP: Modeling
Digital Asynchronous Circuits Using SystemVerilog Interfaces. In
Proceedings of Communicating Process Architectures - WoTUG-33,
pages 287{302, 2011.
[Sch03] A. Schrijver. Combinatorial Optimization. Algorithms and combi-
natorics. Springer, 2003.
[SF01] J. Spars? and S.B. Furber. Principles of asynchronous circuit de-
sign: a systems perspective. Springer Netherlands, 2001.
[SM12] B. Sheikh and R. Manohar. An Asynchronous Floating-Point Mul-
tiplier. In Proceedings of 18th IEEE International Symposium on
Asynchronous Circuits and Systems, pages 89 {96, 2012.
[SMPG07] A. Sheibanyrad, I. Miro Panades, and A. Greiner. Systematic Com-
parison between the Asynchronous and the Multi-Synchronous Im-
plementations of a Network on Chip Architecture. In Proceedings of
Design, Automation and Test in Europe Conference and Exhibition,
pages 1 {6, 2007.
[ST06] A. Smirnov and A. Taubin. Synthesizing Asynchronous Mi-
cropipelines with Design Compiler. In Proceedings of Synopsys Users
Group (SNUG) Boston, 2006.
[TMA98] V. Tiwari, S. Malik, and P. Ashar. Guarded evaluation: push-
ing power management to logic synthesis/design. IEEE Transac-
tions on Computer-Aided Design of Integrated Circuits and Systems,
17(10):1051{1060, 1998.
[VBBK
+
94] K. Van Berkel, R. Burgess, J. Kessels, M. Roncken, F. Schalij, and
A. Peeters. synchronous circuits for low power: a DCC error cor-
rector. IEEE Design and test of Computers, 11(2):22 {32, 1994.
[VGVBP
+
98] H. Van Gageldonk, K. Van Berkel, A. Peeters, D. Baumann,
D. Gloor, and G. Stegmann. An asynchronous low-power 80C51 mi-
crocontroller. In Advanced Research in Asynchronous Circuits and
139
Systems, 1998. Proceedings. 1998 Fourth International Symposium
on, pages 96 {107, 1998.
[WF80] A. S. Wojcik and Kwang-Ya Fang. On the Design of Three-
Valued Asynchronous Modules. IEEE Transactions on Computers,
29(10):889{898, October 1980.
[WM03] C. G. Wong and A. J. Martin. High-level synthesis of asynchronous
systems by data-driven decomposition. In Proceedings of 40th De-
sign Automation Conference, pages 508{513, 2003.
[YB00] J. Yunjian and R. K. Brayton. Don't cares and multi-valued logic
network minimization. In Proceedings of International Conference
on Computer Aided Design, pages 520{525, 2000.
140
Abstract (if available)
Abstract
Asynchronous circuit design has long been considered a suitable alternative to synchronous design due to its potential for achieving lower power consumption, higher robustness to process variations, and faster throughput. The lack of commercial CAD tools, however, has been a major obstacle for its wide-spread adoption. Although there is no central clock, the use of handshaking protocols in asynchronous circuits often introduces excessive switching activity which then translates to high power consumption. This work is about reducing unnecessary switching activity and automatically optimizing power consumption of asynchronous circuits. Our focus is on circuits synthesized by a recently commercialized high-throughput asynchronous ASIC CAD flow called Proteus. ❧ We propose a formal framework based on three-valued logic in which we model the conditional communication primitives of asynchronous circuits as three-valued operators. Using this framework, we introduce two systematic power reduction techniques for asynchronous circuits:conditioning (adding conditional communication) and reconditioning (moving conditional communication primitives). ❧ To demonstrate an application of conditioning, an automatic method is introduced for the adoption of operand-isolation in asynchronous circuits using commercial synchronous CAD tools. Our experimental results show that for a 32-bit ALU, we achieve an average of 53% power reduction for about a 4% increase in area with no impact in performance. ❧ An integer linear program (ILP) formulation is presented for the reconditioning problem. Our experimental results show that our ILP can be solved in reasonable time for medium size circuits and can achieve up to 80% power improvement. For larger circuits when the ILP formulation is not tractable, a fast heuristic algorithm is provided. Our experimental results show that our heuristic algorithm can still significantly reduce power and can achieve close-to-optimal results. ❧ Finally, a method for formal verification of asynchronous circuits based on the three-valued logic model is presented. In particular, we show how our three-valued logic model can enable the use of powerful commercial synchronous formal verification tools for equivalence check of asynchronous circuits.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Average-case performance analysis and optimization of conditional asynchronous circuits
PDF
Clustering and fanout optimizations of asynchronous circuits
PDF
Library characterization and static timing analysis of asynchornous circuits
PDF
Formal equivalence checking and logic re-synthesis for asynchronous VLSI designs
PDF
Gated Multi-Level Domino: a high-speed, low power asynchronous circuit template
PDF
Redundancy driven design of logic circuits for yield/area maximization in emerging technologies
PDF
Security-driven design of logic locking schemes: metrics, attacks, and defenses
PDF
Variation-aware circuit and chip level power optimization in digital VLSI systems
PDF
Compiler and runtime support for hybrid arithmetic and logic processing of neural networks
PDF
Electronic design automation algorithms for physical design and optimization of single flux quantum logic circuits
PDF
A logic partitioning framework and implementation optimizations for 3-dimensional integrated circuits
PDF
Production-level test issues in delay line based asynchronous designs
PDF
Trustworthiness of integrated circuits: a new testing framework for hardware Trojans
PDF
Data-driven and logic-based analysis of learning-enabled cyber-physical systems
PDF
Optimal defect-tolerant SRAM designs in terms of yield-per-area under constraints on soft-error resilience and performance
PDF
Modeling and optimization of energy-efficient and delay-constrained video sharing servers
PDF
Modeling and simulation of complex recovery processes
PDF
Design and testing of SRAMs resilient to bias temperature instability (BTI) aging
PDF
Effects of the thermo-mechanical history on the linear shear viscoelastic properties of uncrosslinked elastomers
PDF
Fabrication and study of organic and inorganic optoelectronics using a vapor phase deposition (VPD)
Asset Metadata
Creator
Saifhashemi, Arash
(author)
Core Title
Power optimization of asynchronous pipelines using conditioning and reconditioning based on a three-valued logic model
School
Viterbi School of Engineering
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publication Date
11/29/2012
Defense Date
10/16/2012
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
3VL,asynchronous circuit,conditioning,formal verification,logic equivalence,OAI-PMH Harvest,operand isolation,Proteus,reconditioning,SVC,SVC2RTL,SystemVerilogCSP,three-valued logic
Language
English
Contributor
Electronically uploaded by the author
(provenance)
Advisor
Beerel, Peter A. (
committee chair
), Breuer, Melvin A. (
committee member
), Kempe, David (
committee member
), Parker, Alice C. (
committee member
), Pedram, Massoud (
committee member
)
Creator Email
ourarash@gmail.com,saifhash@usc.edu
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c3-123889
Unique identifier
UC11291019
Identifier
usctheses-c3-123889 (legacy record id)
Legacy Identifier
etd-Saifhashem-1372.pdf
Dmrecord
123889
Document Type
Dissertation
Rights
Saifhashemi, Arash
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the a...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Tags
3VL
asynchronous circuit
conditioning
formal verification
logic equivalence
operand isolation
Proteus
reconditioning
SVC
SVC2RTL
SystemVerilogCSP
three-valued logic