Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
The design and synthesis of concurrent asynchronous systems.
(USC Thesis Other)
The design and synthesis of concurrent asynchronous systems.
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
THE DESIGN AND SYNTHESIS OF CONCURRENT ASYNCHRONOUS SYSTEMS
by
Sunan Tugsinavisut
A Dissertation Presented to the
FACULTY OF THE GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(ELECTRICAL ENGINEERING)
May 2006
Copyright 2006 Sunan Tugsinavisut
UMI Number: 3237182
3237182
2007
UMI Microform
Copyright
All rights reserved. This microform edition is protected against
unauthorized copying under Title 17, United States Code.
ProQuest Information and Learning Company
300 North Zeeb Road
P.O. Box 1346
Ann Arbor, MI 48106-1346
by ProQuest Information and Learning Company.
Dedication
To my parents, my sister and my girlfriend who have supported, motivated and inspired
me throughout the course of this thesis. Without them lifting me up when this thesis
seemed interminable, I doubt it should ever have been completed.
ii
Acknowledgements
My doctoral study has been long and exciting journey, in which I have learned many
aspects of life during this period. There are many enjoyable experiences mixed with some
hardships. Thanks to a group of special people in my life that hardships are more bearable
and good experiences become great experiences. This thesis is the result of six years of
work whereby I have been accompanied and supported by many people. It is a pleasant
aspect that I have now the opportunity to express my gratitude for all of them.
First and foremost, I am deeply indebted to my thesis advisor, Dr. Peter A. Beerel
who always dedicates his time for our discussion, provides me with invaluable guidance,
and encourages me during difficult times. During these years I have known Peter as
a sympathetic and principle-centered person. His overly enthusiasm, global vision on
research, and mission for providing only high-quality work has made a deep impression
on me. I owe him lots of gratitude for having me shown this way of research. His vast
knowledge and incredible patience have made my academic career an enjoyable one.
My appreciation goes to the other members of my PhD committee as well who mon-
itors my work and takes effort in reading and providing me with constructive comments
iii
on earlier versions of my proposal and my thesis: Dr. Massoud Pedram, Dr. Roger Zim-
mermann, Dr. Antonio Ortega, and Dr. Sandeep Gupta.
Also, I would like to thank all EE staffs including Neila Popat, Rosine Sarafian, Alma
Hernandez, Susan Moore, Tim Boston, and Diane Demetras who provides me with great
assistance.
Thanks are extended to my colleagues for providing me with invaluable discussion:
Dr. Marcos Ferretti, Dr. Recep Ozdag, Dr. Sangyun Kim, Dr. Joong-Seok Moon, Dr.
Hoshik Kim, Nam-hoon Kim, Roger Su, Pankaj Golani, and Arash Saifhashemi. I would
especially like to thank my roommates and my friends: Dr. Nopparit Intharassombat, Dr.
Phoom Sagetong, Ekaluck Chaiyaporn, Kanate Ungkasritongkul, Suwicha Jirayucharoen-
sak, who shares many of my experiences during these past few years.
Most importantly, I would like to express my deepest gratitude to my parents Mr.
Sanan Tugsinavisut and Mrs. Prapasri Tugsinavisut whose unwavering love, trust and
support have been immeasurable. I would like to share this moment of happiness with my
parents and my sister. They render me enormous support, inspiration and encouragement
during the whole tenure of my research.
Last but definitely not least, I would like to deeply thank my lovely girlfriend, Rattana
Kiattansakul who stands by with incredible patience, understanding and encouragement.
Her love and supports had helped me through several turbulences during my study. She
had shown me the true meaning of love, care and support.
iv
Table of Contents
Dedication ii
Acknowledgements iii
List Of Tables viii
List Of Figures ix
Abstract xii
1 Introduction 1
1.1 Potential advantages of asynchronous designs . . . . . . . . . . . . . . . 4
1.2 Asynchronous design styles and comparisons . . . . . . . . . . . . . . . 7
1.3 High-level synthesis for asynchronous systems . . . . . . . . . . . . . . 9
1.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.5 Thesis organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2 Background 14
2.1 Asynchronous channel-based architecture . . . . . . . . . . . . . . . . . 15
2.2 Communication channels and encoding styles . . . . . . . . . . . . . . . 16
2.3 Basic handshaking styles . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4 Delay models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.5 Pipelines and its non-linear behavior . . . . . . . . . . . . . . . . . . . . 21
2.5.1 Bundled-data pipelines . . . . . . . . . . . . . . . . . . . . . . . 22
2.5.2 1-of-N QDI fine-grain pipelines . . . . . . . . . . . . . . . . . . 23
2.5.3 Non-linear pipelines . . . . . . . . . . . . . . . . . . . . . . . . 24
2.6 Measurement metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.6.1 Cycle time (τ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.6.2 Forward latency (FL) . . . . . . . . . . . . . . . . . . . . . . . . 27
2.6.3 Backward latency (BL) . . . . . . . . . . . . . . . . . . . . . . . 27
2.6.4 Energy (E) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
v
2.6.5 V oltage independent metric (Eτ
2
) . . . . . . . . . . . . . . . . . 29
2.7 High-level synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3 Control Circuit Templates for Asynchronous Bundled-data Pipelines 37
3.1 PCFB template for bundled-data pipelines . . . . . . . . . . . . . . . . . 39
3.1.1 Non-linear PCFB pipelines . . . . . . . . . . . . . . . . . . . . . 41
3.1.2 Timing and performance analysis . . . . . . . . . . . . . . . . . 43
3.2 T4PFB templates for bundled-data . . . . . . . . . . . . . . . . . . . . . 45
3.2.1 Non-linear T4PFB pipelines . . . . . . . . . . . . . . . . . . . . 49
3.2.2 Timing and performance analysis . . . . . . . . . . . . . . . . . 50
3.3 Zero overhead T4PFB templates for Bundled-data . . . . . . . . . . . . . 55
3.3.1 Nonlinear pipeline . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.3.2 Timing and Performance analysis . . . . . . . . . . . . . . . . . 58
3.4 Comparison of control templates . . . . . . . . . . . . . . . . . . . . . . 59
3.5 Speculative delay matching templates . . . . . . . . . . . . . . . . . . . 61
3.5.1 Asymmetric delay line templates . . . . . . . . . . . . . . . . . . 63
3.5.2 Symmetric delay line templates . . . . . . . . . . . . . . . . . . 64
3.5.3 Power-efficient asymmetric delay line . . . . . . . . . . . . . . . 66
3.6 Matrix-vector multiplication architecture . . . . . . . . . . . . . . . . . . 67
3.6.1 Matrix-vector multiplication . . . . . . . . . . . . . . . . . . . . 67
3.6.2 Asynchronous pipelined architecture: an overview . . . . . . . . 68
3.6.3 Zero detection stage . . . . . . . . . . . . . . . . . . . . . . . . 70
3.6.4 (Hardwired) multiplier stage . . . . . . . . . . . . . . . . . . . . 71
3.6.4.1 Bit-slice partitioning multipliers . . . . . . . . . . . . . 71
3.6.4.2 Speculative completion sensing circuit . . . . . . . . . 74
3.6.4.3 Multiplier controller . . . . . . . . . . . . . . . . . . . 75
3.6.4.4 Timing constraints . . . . . . . . . . . . . . . . . . . . 76
3.6.5 Accumulator stage . . . . . . . . . . . . . . . . . . . . . . . . . 76
3.6.5.1 Bit-slice partitioning accumulator . . . . . . . . . . . . 77
3.6.5.2 Speculative completion sensing circuit . . . . . . . . . 78
3.6.5.3 Accumulator controller . . . . . . . . . . . . . . . . . 78
3.6.5.4 Timing constraints . . . . . . . . . . . . . . . . . . . . 79
3.6.6 Output storing and recovering stages . . . . . . . . . . . . . . . . 79
3.6.7 Controller alternatives . . . . . . . . . . . . . . . . . . . . . . . 79
3.7 Design flow, experimental results and comparisons . . . . . . . . . . . . 80
3.7.1 Postlayout timing validation . . . . . . . . . . . . . . . . . . . . 81
3.7.2 Energy and throughput comparisons . . . . . . . . . . . . . . . . 82
3.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
vi
4 Comparisons of Asynchronous Pipelines 87
4.1 Bundled-data pipeline design . . . . . . . . . . . . . . . . . . . . . . . . 88
4.2 QDI pipeline design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.2.1 Micro-architecture alternatives . . . . . . . . . . . . . . . . . . . 89
4.2.1.1 Delay-insensitive encoding selection . . . . . . . . . . 90
4.2.1.2 Cell size choice . . . . . . . . . . . . . . . . . . . . . 90
4.2.1.3 Skewed datapath choice . . . . . . . . . . . . . . . . . 90
4.2.1.4 Complete loop-unrolling . . . . . . . . . . . . . . . . 92
4.2.2 Basic QDI pipeline architectures . . . . . . . . . . . . . . . . . . 93
4.2.3 QDI pipeline micro-architectural alternatives . . . . . . . . . . . 93
4.3 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
4.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
5 High-level Synthesis of Highly Concurrent Asynchronous Systems 98
5.1 Modeling highly concurrent systems . . . . . . . . . . . . . . . . . . . . 101
5.2 Scheduling concurrent systems . . . . . . . . . . . . . . . . . . . . . . . 104
5.3 Scheduling and allocation . . . . . . . . . . . . . . . . . . . . . . . . . . 106
5.3.1 Problem definition . . . . . . . . . . . . . . . . . . . . . . . . . 106
5.3.2 Exact algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
5.3.3 Heuristic list algorithm . . . . . . . . . . . . . . . . . . . . . . . 110
5.4 Many-to-many resource binding problem . . . . . . . . . . . . . . . . . 117
5.5 Concurrent scheduling and binding . . . . . . . . . . . . . . . . . . . . . 118
5.5.1 Problem definition . . . . . . . . . . . . . . . . . . . . . . . . . 121
5.5.2 Exact algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
5.5.3 Heuristic algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 124
5.6 Experimental results and comparisons . . . . . . . . . . . . . . . . . . . 129
5.7 Conclusion and future works . . . . . . . . . . . . . . . . . . . . . . . . 136
Bibliography 139
vii
List Of Tables
3.1 Comparison of the PCFB, T4PFB and ZO T4PFB controllers, including
forward latency, overhead, area, energy, and degree of timing assumption.
The cycle time is equal to the forward latency plus the overhead. . . . . . 60
3.2 Comparisons of PCFB-based designs using different asymmetric delay lines. 82
3.3 Timing analysis of the PCFB-based, T4PFB-based and ZO T4PFB-based
designs, including forward latency, overhead, and cycle time. . . . . . . . 83
3.4 Detail timing and energy analysis of PCFB- and T4PFB-based designs
(control and datapath). . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.1 Area, cycle time, energy per cycle and Eτ
2
statistics . . . . . . . . . . . . 96
5.1 Experimental results of the exact and heuristic scheduling and allocation
algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
5.2 Experimental results of the exact scheduling and allocation and exact
scheduling and binding algorithms. . . . . . . . . . . . . . . . . . . . . . 133
5.3 Experimental results of the exact scheduling and allocation (ESA) to and
two heuristic algorithms (LSA and LSB). . . . . . . . . . . . . . . . . . 135
5.4 Experimental results of the ESA and LSB algorithms applied to several
multi-threaded DSP applications. . . . . . . . . . . . . . . . . . . . . . . 137
viii
List Of Figures
1.1 Digital system designs: synchronous vs asynchronous systems. . . . . . . 2
2.1 Classification of asynchronous channels. . . . . . . . . . . . . . . . . . . 16
2.2 Basic asynchronous handshaking protocols. . . . . . . . . . . . . . . . . 17
2.3 Bundled-data linear pipelines. . . . . . . . . . . . . . . . . . . . . . . . 22
2.4 2-D QDI fine-grain pipelines. . . . . . . . . . . . . . . . . . . . . . . . . 23
2.5 Non-linear asynchronous pipeline. . . . . . . . . . . . . . . . . . . . . . 24
3.1 PCFB template and a detailed circuit. . . . . . . . . . . . . . . . . . . . 39
3.2 STG of the abstract PCFB protocol where each edge is labeled with its
delay (# of gate delays). . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.3 R gen circuit of the PCFB fork stage. . . . . . . . . . . . . . . . . . . . 42
3.4 Circuits of the PCFB join stage. . . . . . . . . . . . . . . . . . . . . . . 42
3.5 T4PFB circuit template and detailed circuits. . . . . . . . . . . . . . . . . 46
3.6 STG of the abstract T4PFB protocol where gray edges represent timing
constraints and dashed edges indicate ordering maintained by the envi-
ronment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.7 The modified Le gen circuit. . . . . . . . . . . . . . . . . . . . . . . . . 48
3.8 Circuits of the T4PFB fork stage. . . . . . . . . . . . . . . . . . . . . . . 49
3.9 R gen and CL circuit (in dash boxes) of the T4PFB join stage. . . . . . . 49
ix
3.10 Zero-overhead T4PFB template and detailed circuit implementation. . . . 56
3.11 The STG of the zero overhead T4PFB template. . . . . . . . . . . . . . . 57
3.12 Examples of nonlinear pipeline stages. . . . . . . . . . . . . . . . . . . . 58
3.13 Speculative delay matching templates. . . . . . . . . . . . . . . . . . . . 62
3.14 (a) an example of power-efficient asymmetric delay line. (b) STG of
D-element using in bundled-data pipeline. (c) A speed independent D-
element implementation. . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.15 Matrix multiplication with a 5 stage asynchronous pipeline. . . . . . . . . 70
3.16 Mask signals generation unit based on static logic. . . . . . . . . . . . . . 72
3.17 Example of the proposed mechanism for sign bit extension in the multi-
plier array. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.18 Proposed asynchronous fine-grained carry-save hardwired multiplier for
0. 35352*x
1
, where 0.35352 is expressed as (2
¡9
*x
1
) + (2
¡7
*x
1
) + (2
¡5
*x
1
)
+ (2
¡4
*x
1
)+ (2
¡2
*x
1
). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
3.19 Static fine-grain partitioned adder architecture. . . . . . . . . . . . . . . . 75
3.20 An example of partial sign bit recovery logic (PSBR b). . . . . . . . . . . 76
3.21 Controller alternatives: (a) asynchronous controller (b) synchronous con-
troller. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.22 Hierarchical design flow. . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.1 Matrix multiplier using a 4-stage bundled-data pipeline. . . . . . . . . . . 88
4.2 Skewed pipelines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.3 Basic QDI pipeline design. . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.4 Loop and complete loop-unrolled designs of the ACC block. . . . . . . . 92
4.5 Energy and cycle time statistics of the 4 QDI designs . . . . . . . . . . . 94
5.1 Examples of marked graphs . . . . . . . . . . . . . . . . . . . . . . . . . 103
x
5.2 A marked graph (a) and an equivalent CDFG (b) for a four-stage ring design104
5.3 A valid schedule of a three-stage pipeline with target cycle time of 2 . . . 106
5.4 Resource allocation for a 3-stage linear pipeline with target cycle time of 4 110
5.5 Many-to-many resource binding . . . . . . . . . . . . . . . . . . . . . . 118
5.6 An example of time execution . . . . . . . . . . . . . . . . . . . . . . . 120
5.7 An example of the list-based concurrent scheduling and binding . . . . . 128
xi
Abstract
This dissertation explores the design and synthesis of concurrent asynchronous systems.
In the first part, we perform an extensive case study of a DCT matrix-vector multiplier
using several different micro-architectural designs and circuit styles. This includes a
bundled-data implementation that motivates the development of efficient control circuit
templates. By adopting templates that can easily handle complex control, the designer
can save significant design time. In addition, a quasi-delay-insensitive (QDI) imple-
mentation motivates design exploration of several different micro-architectural tradeoffs
and optimizations. Experimental results comparing these designs to a full-custom syn-
chronous counterpart show that various implementations yield different advantages. The
bundled-data designs achieve higher average performance with negligible power and area
increases, while the QDI designs provide much higher throughput, more robust to process
variations, and reduced design time at the expense of higher power and larger area.
In the second part, we present a high-level synthesis framework for highly concur-
rent systems that can handle multi-threading and pipelined behaviors, which are typical in
asynchronous systems. We propose the use of marked graphs because they can naturally
express highly concurrent systems. We address several issues relating to the cyclic nature
xii
of marked graphs and propose both exact and heuristic performance-driven scheduling
and allocation algorithms. The scheduling and allocation algorithms, however, do not ad-
dress the binding problem. Hence, we propose performance-driven concurrent scheduling
and binding algorithms: one exact algorithm and one heuristic algorithm. A coloring ap-
proach is used to formulate and solve the binding problem. Experimental results show
the performance and area tradeoffs of various designs by controlling concurrency in the
design and also highlight the tradeoffs of the proposed exact and heuristic algorithms.
xiii
Chapter 1
Introduction
Since the discovery of MOS devices, the design and development of digital circuit has
dramatically changed over the generations of integrated circuit that span from thousand-
transistors large-scale integration (LSI) to billion-transistors ultra-large-scale integration
(ULSI). As manufacturing technology improves, shrinking feature sizes yields density,
speed, and power improvements. However, each generation has also added new con-
straints and limitations, and changed the cost functions for designs. In particular, in to-
day’s deep-submicron technology, the delay of long interconnect has become the center
of attention, making the synchronization of blocks across on-chip interconnect impracti-
cal. Additionally, increasing process and environmental variability may force designers to
allow higher timing margins in order to meet timing constraints, degrading system’s per-
formance. Finally, ineffective clock generation and distribution can cause a severe clock
skew problem which can also penalize system’s performance. These effects have caused
1
Asynchronous
channels
Clock
(a) Synchronous system (b) Asynchronous system
Figure 1.1: Digital system designs: synchronous vs asynchronous systems.
synchronous designers to spend more effort on simulation and verification during phys-
ical design, increasing both overall design time and risk. Thus, the need for alternative
design methodologies that alleviate these problems has become more and more important.
One of the promising alternative methodologies is asynchronous design because it offers
many elegant solutions including the handling of long interconnect, the robustness toward
process variation, and the avoidance of clock-related problems.
From a historical perspective, asynchronous design was developed in the 1950s by
Huffman and Muller. It has been applied to many applications including the design of
the ILLIAC and ILLIAC II computers [79]. However, asynchronous design has largely
disappeared from industrial practice for the last two decades due to the dominance of
synchronous design that uses a clock-based paradigm in which a globally distributed clock
(as shown in Fig 1.1(a)) provides synchronization among blocks. In particular, as long as
the worst-case computation of the combinational logic between flip-flops is smaller than
the system clock period, the system works perfectly. However, the propagation delay of
2
different clock paths may vary resulting in a time deviation between the edges of the local
clock and the central clock, a phenomena called clock skew. Recent proposed solutions
can manage the clock skew to within 4-5% of cycle time [50]. However with future deeper
submicron technologies, this problem is becoming more severe with increasing expensive
solutions.
In contrast to synchronous design, asynchronous blocks synchronize and communi-
cate with each other via event-based handshaking over communication channels shown
in Fig 1.1(b). Hence, the synchronization and communication of asynchronous designs
occurs only when necessary instead of at every clock cycle. This chapter describes an
overview of current synchronous design methodologies, reviews the potential advantages
of asynchronous design methodologies, emphasizes on the need of high-level synthesis
tools and concludes with a detailed overview of the thesis, including the expected contri-
butions and thesis ’s organization.
Current synchronous design methodologies can be divided into two categories that
tradeoff the design time and performance. The first category, full-custom design, gener-
ally targets high-speed applications such as high-end microprocessors. To achieve high
performance, advanced logic families i.e. dynamic logic (as opposed to static logic) is
extensively used particularly in the critical path. Moreover, advanced techniques such as
time-borrowing, cycle stealing, and skew-tolerant design [49] are often applied to improve
3
performance as well as reduce the clock skew penalty. Because of the high design com-
plexity, the design time can be very long, mainly due to the lack of automated CAD tools
that support these advanced techniques.
In many applications, however, high-speed is not as critical as time-to-market. A
semi-custom design style using pre-laid out cell libraries coupled with automated design
flow offers faster design time. However, the synthesis CAD tools for standard-cell library-
based design are generally limited to handle conventional gate libraries using CMOS static
logic, where each gate is carefully designed, characterized and verified separately. Addi-
tionally, the clocking methodology is also limited to conventional flip-flop-based clocking
methodology, which causes significant timing overhead. Nevertheless, there are some ad-
vanced CAD tools that can support a limited form of gated clocking to reduce power
consumption [3].
1.1 Potential advantages of asynchronous designs
With the increasing limitations and design complexity of synchronous design method-
ologies, asynchronous design have recently caught attention with the following potential
advantages over the synchronous counterparts [84].
1. Low power consumption: For CMOS circuits, power dissipation occurs in both ac-
tive and standby mode. In active mode, the largest part of power consumption is
associated with switching activity. Since asynchronous systems are event-based,
4
the activity varies on the number of incoming active events. Hence, the compara-
ble asynchronous systems can yield power equivalent to an optimal gated clocking
synchronous systems. Examples of low-power implementations include an asyn-
chronous instruction-length decoder (RAPPID) [101] and an asynchronous FIR fil-
ter [9]. In standby mode, the power dissipation is determined by the leakage current
of all transistors in the system. With smaller feature sizes, it has become a more
serious problem to both synchronous and asynchronous designs for which active
research is ongoing.
2. Average-case performance: In principle, each block in an asynchronous system
can communicate with its environment immediately upon finishing its own task. In
other words, the computation time is a function of the actual delay of the computed
blocks given the specific data it is operating on, instead of the worst-case cycle
time in the synchronous counterpart. This property allows designers to design and
optimize the critical common paths of the system yielding an overall average-case
performance. Numerous applications that utilize this advantage have been demon-
strated including DCT optimized for small-valued inputs [59] and the RAPPID de-
sign optimized for common instructions [9].
3. Design modularity and composability: A block in asynchronous designs communi-
cates it neighboring blocks using well-defined handshaking protocols which encap-
sulate internal computation within a block. This enables each block to be indepen-
dently designed and thus easily connected to each other to build larger systems.
5
4. Better handling long interconnect: A key feature of asynchronous design is that the
channel-based communication with well-defined handshaking protocols facilitate
the synchronization regardless of the distance between blocks promoting system-
on-chip (SOC)-based design. In particular, to reduce performance bottleneck in
long distance channels, asynchronous buffer stages can be easily inserted. To man-
age long latency in synchronous system, a technique called latency-insensitive has
been developed in which it uses wrappers to encapsulate synchronous modules cou-
pled with latency-insensitive protocols to compose a system that behave correctly
independent from delays in the channels [24]. This technique, however, requires
larger area and more complex control circuits.
5. Avoidance of clock-related problems: Since asynchronous system uses well-defined
handshaking to communicate each other over communication channels instead of
global clock signals used in synchronous system, all clock-related problems such as
clock-skew and clock distribution can be completely avoided, significantly reducing
design time.
6. Robustness towards supply and process variation: The robustness comes from a
property of asynchronous systems that the completion detection mechanism is either
based on matched delay or insensitive to gates and wire delay. Thus the variation of
power supply, temperature, fabrication process and even technology migration do
not affect the functionality of the system since the completion detection varied in
accordance to these changing conditions [76, 82].
6
7. Improved EMI: Since all activities in synchronous systems are concentrated at a
very narrow clock period resulting in dense energy at the clock frequency and its
harmonics. Therefore, there is substantial electrical noise at these frequencies. Ac-
tivities in asynchronous systems, however, tend to spread randomly, resulting in a
more distributed noise spectrum and a lower peak noise value [42, 89].
1.2 Asynchronous design styles and comparisons
This section introduces a variety of asynchronous pipelined design styles and discusses
the tradeoffs of each design style in terms of performance, power, area, robustness and
design time. Lloyd et. al. address the comparisons of several asynchronous design
styles including a bundled-data design using single-rail datapath and the delay insensi-
tive designs using dual-rail and 1-of-4 rail datapath [68]. The comparison, however, is
restricted to coarse-grain pipelined designs where a pipelined stage contains multiple gate
delays. As part of this dissertation, we analyze a broader range of design styles includ-
ing medium-grain bundled-data pipelined designs, QDI fine-grain pipelined designs and
gated-clocking synchronous pipelined designs. To the best of our knowledge this is one
of the first apples-to-apples comparisons of these design styles.
Many template-based fine-grain pipelined asynchronous designs have recently been
proposed with varying degrees of robustness and performance [39, 85, 86, 87, 96, 103].
The most robust fine-grain pipelined design is the quasi-delay insensitive (QDI) design.
7
The QDI pipelined design has no timing assumption among blocks and local timing as-
sumption within a block, reducing significant design time. Moreover, it also supports
nonlinear complex behaviors with conditional/unconditional and/or multiple input(s) and
output(s) [67]. However, it often requires more area and consumes higher energy com-
pared to other design styles [12].
The bundled-data design, often used for medium-grain pipelined design, can offer
medium throughput, smaller area, and lower power compared to the QDI pipelined design
with setup and hold time assumptions similar to synchronous designs. The most challeng-
ing problem for bundled-data designs is in the design of the control circuitry. Practical
control circuits must handle complex nonlinear control [67]. Previous works have ad-
dressed these problems using logic synthesis approaches, but these approaches rely on the
designer to produce correct and efficient specifications, which are often difficult and error-
prone [31, 32, 40, 118]. We propose the control circuit templates that efficiently handle
control of such complex nonlinear pipelines, simplifying the control circuit design.
We compare the proposed bundled-data designs with a synchronous counterpart and
several QDI pipelined designs as applied to a matrix-vector multiplication core of the
DCT. The comparison to the synchronous design demonstrates that the bundled-data de-
sign can achieve better performance with negligible energy penalty. Additionally, the
comparison to the QDI pipelined designs with various micro-architectural differences
shows the tradeoff of each design style in terms of area, power, performance, robustness
and design time. The results suggest that the bundled-data design achieves lower power
8
and smaller area while the QDI designs yield higher performance, more robust to process
variations, and reduced design-time, at the cost of higher power and larger area.
1.3 High-level synthesis for asynchronous systems
In synchronous design, design flows and numerous commercial CAD tools, including
synthesis, analysis, and verification tools, have been developed at various levels of ab-
straction from high-level specification to physical design [48, 94]. There are a limited
number of CAD tools for asynchronous circuit design and most of them are still under
development at educational or research institutions. These include high-level synthesis
tools [1, 4, 6, 20], performance analysis tools [13, 21, 60], relative timing verification
tools [58, 91] and physical design tools [21].
One of the most important steps in the design flow is high-level synthesis. High-level
synthesis is a step that chooses the design best optimized for a given set of objection func-
tions while satisfying design constraints. In particular, performance-driven high-level syn-
thesis plays a role in the synthesis of high-performance systems. Most high-performance
designs exploit a mixture of multi-threading, which allows multiple independent problem
instances to execute simultaneously, and pipelining, which decomposes a problem into
a series of smaller operations operating concurrently. However, most existing high-level
synthesis CAD tools are limited to handling designs with only one thread of execution
due to the limitation of control data flow graphs (CDFGs) as the input model. We pro-
pose a performance-driven high-level synthesis approach for concurrent asynchronous
9
systems using marked graphs, a restricted form of Petri net, as the input model. Unlike
CDFGs, marked graphs can easily express highly concurrent asynchronous systems, in-
cluding those with pipelined and multi-threading behaviors.
This work forms an important step for a complete CAD framework from system spec-
ification to implementation. The proposed algorithms can be used to limit the number
of operations that are run concurrently, constraining peak power. They also have direct
applicability to the high-level synthesis of asynchronous circuits, in which Petri net spec-
ifications are typically used, but can be applied to any concurrent hardware or software
system modeled by a marked graph.
1.4 Contributions
This section presents key contributions within this dissertation. First, we develop control
circuit templates for asynchronous medium-grain bundled-data pipeline for low-power
and low-area applications. Second, we compare and evaluate two different pipelined de-
sign styles, the proposed bundled-data designs and the QDI fine-grain pipelines, in terms
of area, performance, and energy. Third, we develop a performance-driven synthesis tool
for highly concurrent asynchronous systems. More detailed descriptions of each contri-
bution are as follows:
² Develop three novel control circuit templates for an asynchronous bundled-data
pipelined design that simplifies the design of non-linear pipelines. The proposed
10
bundled-data pipelines include novel data-dependent delay lines with integrated
control circuitry to efficiently implement speculative completion sensing. We demon-
strate the advantages of our control templates by applying them to a matrix-vector
multiplication core of the discrete cosine transform. Comparisons with a compara-
ble gated-clocking pipelined synchronous counterpart suggest that the best bundled-
data design yields 30% higher throughput with 5% energy overhead and has a 50%
better voltage independent Eτ
2
metric [105].
² Evaluate energy, throughput, and area comparisons of two well-known asynchronous
design styles (bundled-data pipeline vs 2-D QDI fine-grain pipeline [67]) applied
to a matrix-vector multiplication. We implement and compare bundled-data de-
sign and various QDI designs of different micro-architectural choices considering
bit/block skewed datapath and the impact of loop-unrolling. The results suggest
that the best QDI design yields significantly better worst-case performance than the
average-case performance of the bundled-data design and has a better Eτ
2
. In par-
ticular, the best QDI design has 3.1 times higher throughput, consumes 7.5 times
more energy, but has 22% better Eτ
2
than the bundled-data design. It is, however,
5.3 times bigger than the bundled-data design. This demonstrates that for at least
this application, the QDI design style can provide higher performance, lower design
effort, and improved energy consumption for a given performance at a cost of larger
area.
11
² Develop a high-level synthesis tool for highly concurrent asynchronous systems.
We first study key behaviors of highly concurrent systems including multi-threading
and pipelining, and propose the use of marked graphs (instead of CDFGs) as the
input model because they can easily express such systems. Then, we develop a
performance-driven synthesis tool modeled as marked graphs as follows. We first
define a valid schedule and develop a novel approach for calculating a valid schedul-
ing time frame for each operation in the marked graph specification. We implement
performance-driven exact and heuristic scheduling and allocation algorithms. We
observe that the algorithms define the number of resources needed, but they do not
model routing and control complexity. Hence, we propose two novel performance-
driven concurrent scheduling and binding algorithms: one exact algorithm and one
heuristic algorithm. The proposed algorithms minimize the area of the design, in-
cluding the cost of routing and control complexity. The experimental results high-
light the performance/area tradeoffs of various designs. They also suggest that the
proposed heuristic algorithms have far lower runtimes compared to the exact algo-
rithms, with only modest loss of optimality.
12
1.5 Thesis organization
The organization of the remainder of this thesis is as follows. Chapter 2 presents rel-
evant background on asynchronous circuit design and reviews previous works on high-
level synthesis. Chapter 3 describes the proposed control circuit templates for a bundled-
data pipeline and the comparisons to its synchronous counterpart. Chapter 4 extends the
comparison to include QDI pipelined designs. Chapter 5 presents new approaches for
scheduling, allocation and binding for highly concurrent asynchronous systems modeled
with marked graphs. At the end, we conclude with possible improvements to our work.
13
Chapter 2
Background
This chapter presents background material on asynchronous circuits used throughout the
thesis. In particular, we first introduce the notion of an asynchronous channel-based ar-
chitecture. Then, we present various basic components in asynchronous design, including
a variety implementations of asynchronous channels and cells. This includes the descrip-
tion of several linear and non-linear pipeline structures that the architectures often include.
Next, we present a taxonomy of asynchronous performance and power metrics used for
quantifying asynchronous designs. Finally, we review related works on high-level synthe-
sis. In particular, we review previous works that address the synthesis of highly concurrent
systems.
14
2.1 Asynchronous channel-based architecture
The asynchronous channel-based architecture is a network of blocks communicated via
abstract channels illustrated in Fig. 1.1. The channel-based architecture is used as an in-
termediate architecture between the front-end and the back-end of the design flow. From
the front-end perspective, the channel-based architecture provides a common representa-
tion to various back-end designs. In particular, the detailed implementations of blocks and
channels are encapsulated, thereby simplifying the synthesis and facilitating analysis and
optimizations.
From the back-end viewpoint, the channel-based architecture facilitates several dif-
ferent back-end implementations, ranging from the most robust to the most aggressive
handshaking protocols. For example, the resulting channel-based protocol can be im-
plemented with delay-insensitive channels and quasi-delay-insensitive blocks, providing
the most timing robust design [67]. Alternatively, the channels can be implemented by
the recently proposed single-track full-buffer templates communicated using single-track
handshaking protocol, for very aggressive implementations [38, 39]. Moreover, different
portions of the design can be implemented with different protocols and block alternatives.
Thus, the channel-based architecture provides a common intermediate target to a wide
range of implementations.
15
(a) Bundled-Data Channel
(b) Conventional 1-of-N Channel
(c) Single-Track 1-of-N Channel
Req
Ack
Data
Ack
Data
Data/Ack
sender receiver
sender receiver
sender receiver
Figure 2.1: Classification of asynchronous channels.
2.2 Communication channels and encoding styles
A communication channel is a bundle of wires between a sender and a receiver, and a pro-
tocol for communicating information discretized into tokens (representing data, control,
or a mixture). In a bundled-data channel, as illustrated in Fig. 2.1(a), tokens are encoded
using one wire per bit of information, a request line (Req) is used to tell the receiver when
the token is valid, and an acknowledge line (Ack) is used to tell the sender when the token
has been received. In other words, the data is bundled with the request line. In a 1-of-N
channel, as illustrated in Fig. 2.1(b), on the other hand, N wires are used to encode log
2
N
bits and no request line is needed. In particular, a widely used form of 1-of-N encoding
is 1-of-2 (also called dual-rail encoding) in which two wires are used to encode one bit of
information. In 1-of-N encoding, also called as one-hot encoding, validity of the data is
16
1
2
1
2
(a) Basic 2-phase handshaking protocol
Req
Ack
Transaction 1 Transaction 2
1
2
3
4
(b) Conventional 4-phase handshaking protocol
Req
Ack
Transaction 1
1
2
(c) Single-track handshaking protocol
Req/Ack
Transaction 1
Figure 2.2: Basic asynchronous handshaking protocols.
encoded in the values of N wires where all zeros indicate that the bundle of wires is reset
and holds no token.
There exists two types of 1-of-N communicating channels: conventional 1-of-N chan-
nel and single-track 1-of-N channel [14, 38]. In the conventional channel, the sender uses
a 1-of-N channel to send a token to the receiver and the receiver responds to the sender
using a separate acknowledgement signal shown in Fig. 2.1 (b). In the single-track 1-of-N
channel, there is no separate acknowledgement signal. In other words, the sender sends
a token by asserting one wire in a 1-of-N channel and then floats the wires in tri-state.
The receiver react to the sender by de-asserting the wires and then releases the wires in
tri-state to complete the handshaking and ready to receive a new token. The single-track
1-of-N channel is depicted in Fig. 2.1 (c).
17
2.3 Basic handshaking styles
This section presents the classification of basic handshaking styles using across asyn-
chronous communication channels. Handshaking styles can be largely classified into
two classes: two-phase handshaking and four-phase handshaking illustrated in Fig. 2.2.
In two-phase handshaking, also called non-return-to-zero handshaking, each transaction
performs two phases handshaking: one phase for a request and another phase for an ac-
knowledge shown in Fig. 2.2(a). In four-phase handshaking, also called return-to-zero
handshaking, one transaction comprised of four phases in which after the sender sends
the data (the first phase), the acknowledgement is asserted by the receiver (the second
phase). Next, the sender resets the data (the third phase), followed by the receiver re-
setting the acknowledgement (the fourth phase). Fig. 2.2(b) illustrates the handshaking
sequences in four-phase handshaking. If the acknowledgement is active low, it is often
referred to as an enable signal (En).
Although we demonstrate both handshaking styles using separate request and ac-
knowledgement wires, handshaking style in the single-track channel is a special case that
it is two-phases but return-to-zero handshaking. In Fig. 2.2(c), the sender asserts a re-
quest transition to send a data to the receiver de-asserts the same wire to acknowledge the
sender.
The fact that two-phase handshaking has half number of transitions per transaction
compared to four-phase handshaking, might suggest an advantage in both performance
18
and power consumption. In practice, however, non-return-to-zero handshaking in two-
phase handshaking typically introduces more complex control circuitry that yields slower
and bigger designs [99]. Hence, four-phase handshaking has often been used in recent
asynchronous designs. One exception, however, is on-chip interconnect where compli-
cated logic is not necessary and simple-two-phase handshaking can yield both power and
performance advantages. Another exception is the two-phase single-track handshaking
where return-to-zero handshaking enables the ability to handle complex control, yielding
very high performance circuit [38, 39].
2.4 Delay models
This section discusses various delay models of asynchronous circuit designs. Most de-
sign styles assume timing assumption/constraints on wires and gates. Thus, it is a task of
designers to extract and verify a list of timing constraints. For example, in synchronous
systems, there exist two major timing constraints: setup and hold constraints. In asyn-
chronous systems, the timing assumption can be classified by timing assumption on wires
and gates as follows.
² Delay Insensitive (DI): Delay insensitive design is the most conservative and robust
design that has no timing assumption on both gate and wire delay [72]. However, it
has been shown that very few gate-level delay insensitive designs can exist [73].
19
² Quasi-delay Insensitive (QDI): Quasi-delay insensitive design works correctly re-
gardless of gate and wire delay except for very loose timing assumptions on some
internal wire forks called isochronic forks [73]. The isochronic forks are wires that
assume to have the same delay at the ends of the fork. In other words, the differ-
ence in time at which the signal arrives at the ends of an isochronic fork must be
less than the minimum gate delay. In practice, if these timing assumptions are local
within a functional block, this block can be as robust as DI block. The extension
of isochronic fork that relaxes timing contraints through a number of logic gates is
also proposed called extended isochronic fork [16].
² Speed Independence (SI): This class of delay model assumes that all gate delay can
be arbitrary large but all wire delay is negligible and can be ignored [11]. Thus, this
delay model automatically assumes that all forks are isochronic. For a small circuit,
this timing assumption can be easily satisfied [111].
² Scalable Delay Insensitive (SDI): In the SDI delay model, large chip is divided into
many small blocks communicating using the delay insensitive. Within each block,
circuit is designed based on more relaxed delay model compared to the QDI model.
The SDI delay model assumes there exists a relative delay bound between any two
components [81]. This delay model has been extensively applied to a microproces-
sor design [80].
20
² Bounded delay: The bounded delay model assumes the minimum and maximum
bounded delay of all gates in the circuit. The circuit will work correctly within
this bound. These timed circuits can have advantages of being faster, smaller and
lower power than their QDI or SI counterparts but require extensive analog timing
simulation/verification to meet all timing constraints [104].
² Relative timing: In this timed delay model, a set of relative ordering of events is
listed thereby circuit designers can easily identify and validate a given list to ensure
correctness. These circuits can have the same benefits of timed circuits and may be
easier to validate [58].
2.5 Pipelines and its non-linear behavior
Pipelining is a well-known technique applied to effectively increase throughput of the
systems [25]. In synchronous systems, pipeline stages are synchronized by a global clock
signal which is required to accommodate for the stage’s worst-case computation delay.
Pipelining applied to asynchronous design allows the circuit to pact with more tokens
yielding very high throughput. This section introduces two asynchronous pipeline design
styles along with the importance of non-linear pipeline behavior.
21
DPU
Lreq
Len
FF
delay line
CIs
lclk(s)
AC
R
e
L
e
L R
Linfo Rinfo
DPU
FF
delay line
CIs
lclk(s)
AC
R
e
L
e
L R Rreq
Ren
req
en
info
Figure 2.3: Bundled-data linear pipelines.
2.5.1 Bundled-data pipelines
Bundled-data pipeline or micropipeline [104], uses single-rail synchronous datapath cou-
pled with asynchronous controllers, yielding low area, good average performance, and
low power. A bundled-data linear pipeline stage operates on a left and right bundled-data
channels, as illustrated in Fig. 2.3. It consists of standard synchronous datapath (DPU)
in which a combination of a delay line and asynchronous control circuit (AC) controls
an output flip-flop (FF). The setup and hold requirements on the flip-flop are often called
bundling constraints. Additional setup and hold requirements on the conditional inputs
(CIs) from the datapath to the asynchronous control may also exist. The controller is re-
sponsible for triggering the FF via the local clocks (lclk) and generating an output control
token to communicate with the next pipeline stage.
22
RCD
F
LCD
C
F11 F12 F21 F22
R
R
e
L
L
e
(a) 2-D pipelining (b) PCHB pipeline template
F21’ F22’ F11’ F12’
Figure 2.4: 2-D QDI fine-grain pipelines.
2.5.2 1-of-N QDI fine-grain pipelines
A QDI communicating cell shown in Fig. 2.4(a) can be implemented with various cir-
cuit templates depending on the protocol being selected. One common template is Pre-
Charged Half Buffer (PCHB) template shown in Fig. 2.4(b) in which F is a functional
block comprised of controlled dynamic gates followed by inverter drivers. The LCD and
RCD are left and right completion sensing circuits respectively. Each cell communicates
with left and right environments using PCHB four phase handshaking protocol [67]. The
detailed implementation and other template alternatives can be found in [67].
The functional decomposition into small communicating cells and the 2-D commu-
nication among them shown in Fig. 2.4(a) is the key to achieve high throughput. The
reason is that each cell enables its completion sensing logic to operate only on a few bits,
thereby reducing the impact to the global cycle time. Moreover, since the forward latency
path through each cell is a one stage domino logic with no related setup and hold timing
assumptions, the 2-D array yields very low latency.
23
Join/
Adder
A
0
A
N
O
Fork/
Copy
O
0
O
N
A
Split
O
0
O
N
A
Sel = 0
Merge
A
0
A
N
O
Sel = 0
(a) Fork stage (b) Join stage
(c) Conditional producing output (d) Conditional reading input
Figure 2.5: Non-linear asynchronous pipeline.
2.5.3 Non-linear pipelines
Complex system design inevitably involves non-linear behaviors such as writing multiple
tokens (forks) and reading multiple tokens (joins) [67, 104]. It is therefore important
that any pipeline template can handle multiple inputs and outputs as well as conditional
reading and writing of tokens.
A fork stage is a stage that receives an input and generates multiple output tokens. An
example is a copy stage in which it receives an input and forwards that input to multiple
outputs shown in Fig. 2.5 (a). A Join stage depicted in Fig. 2.5 (b) receives multiple
inputs, performs computation and produces one output token. A common join example is
a full adder cell in which the sum and carry circuits receive three inputs and each produces
one output.
24
There are some complex pipeline stages in which the token forwarding depends on a
control token. We call these stages conditional pipeline stages or routing stages. Condi-
tionally writing output stage called split receives multiple inputs and conditionally gen-
erate one output. In the split, one of the joined inputs (control input) is unconditionally
read to route an input data to one of its outputs based on its control value. This stage is
behaved similar to a de-multiplexer gate in synchronous design. Fig. 2.5 (c) shows an
abstract view of the split stage. It is worth mentioning that there exists a special condi-
tional writing output called skip in which it conditionally suppresses its output (no output
generated) [67]. One another complex behavior is conditionally reading input known as
merge. A control value in merge stage is used to conditionally read one of its inputs and
forward this input to the output shown in Fig. 2.5 (d). This behavior is equivalent to a
multiplexer gate in synchronous design.
Note that other non-linear pipeline behaviors can be derived similarly by the compo-
sition of the previously introduced non-linear pipeline stages.
2.6 Measurement metrics
In this section, performance and energy measurement metrics applied to asynchronous de-
sign are discussed. These metrics are used primarily in the design comparisons throughout
the thesis. For most of the metrics, we will discuss these metrics in the following two as-
pects: one that applies to an asynchronous block and another one that applies to a network
of blocks.
25
2.6.1 Cycle time (τ)
Asynchronous circuit consists of the complex network of inverting rings (for the circuit
to oscilate, we need odd numbers of inverting gates in any ring) where tokens are propa-
gated in circular fashion. The cycle time of a simple ring without any fork-join branches,
is defined as the accumulated delay of gates in a ring divided by the number of tokens
circulated in that ring.
First we consider a single asynchronous block under the assumption that its environ-
ments are ideal which are ready to produce and consume tokens. An asynchronous block
may consist of multiple rings with more complicated cycles of fork-join signals and in-
volves both local rings in and across a block boundary. Moreover, a ring may contains
multiple tokens. Thus, the cycle time of a block is defined to be the maximum delay
among cycle time of all associated rings. In other words, in term of token flow, the cycle
time of a block is defined to be the delay from the processing of current token to the same
processing of the next token.
The cycle time of a network of asynchronous blocks, where the environment of each
block may not be ideal, amounts to not only the local cycles of each block but also cy-
cles from its neighboring cells. A network that contains loop and fork-join structures
may cause the bottleneck of the design. Pipeline optimizations that analyze and optimize
pipeline structure to achieve minimum cycle time are addressed in [60, 67]. Note that
the cycle time is the inversion of throughput which indicates the rate at which how many
resulting tokens can be produced per time unit.
26
2.6.2 Forward latency (FL)
Forward latency is the time from which tokens start the computation to the time which
the resulting tokens are generated under the assumption that a block is empty and its
right environment is ready to consume tokens. In other words, the forward latency of an
asynchronous block measures the computation delay of the token in a block.
The forward latency of a network of blocks is the delay between when input tokens are
entered and when its corresponding output tokens are generated. Hence, forward latency is
one of a key performance parameters, particularly to a critical path of the design in which
the results are used as a decision making of the next operation such as the speculative
control block.
2.6.3 Backward latency (BL)
For each asynchronous block, a token is first computed with the delay of forward latency
and then a token is reset and prepare for the next cycle. The backward latency is a time
period from the end of forward latency to the beginning of next cycle. For a handshaking
protocol where its backward latency involves only to its immediate neighboring blocks,
the backward latency can be described with the following equation:
τ= FL+ BL
27
During a period of backward latency, a token carries no information in which we call
this token as a (bubble). Thus, backward latency is a performance metric indicating the
bottleneck in a block level. The aggressive protocols generally have very small backward
latency such as single-track handshaking protocol. The backward latency of a network of
blocks which operate concurrently is the delay between when the bubbles at the output
are released and when a new input tokens are entered under the assumption that all blocks
are initially full with tokens. The detailed analysis of dynamic behaviors of asynchronous
systems can be found in [67].
2.6.4 Energy (E)
Energy is a key parameter for the design of low-power portable applications. The most
dominant energy consumption is the dynamic energy caused by switching activity of tran-
sistors. Dynamic energy relies on the switching load capacitance and operating voltage as
follow.
E = 1=2CV
2
From the equation, C is total accumulated switching capacitance load and V is an oper-
ating voltage. This equation indicates that V has quadratic impact on energy reduction
comparing to linear energy reduction from C. Thus, reducing supply voltage is one way
to achieve low energy design. However, one must concern that reducing supply voltage
results in longer switching time yielding performance penalty.
28
2.6.5 Voltage independent metric (Eτ
2
)
Since one great advantage of asynchronous circuit mentioned earlier is its robustness to-
ward supply and process variations. This advantage coupled with supply tuning technique
provide a capability to design comparable systems with energy and performance trade-
off. Since energy decreases in quadratic relation to supply voltage reduction while cycle
time increases in linear relation to supply voltage reduction, the relation Eτ
2
provides a
constant metric independent of supply voltage variation as follows [75, 105].
E = k
1
V
2
and V = k
2
=τ
So, Eτ
2
= k
1
k
2
2
= k
This metric is useful for comparing different various synchronous and asynchronous de-
sign styles. The metric suggests that a design with better (lower) Eτ
2
will consumes lower
energy for a given performance. Thus one way to achieve an energy-efficient design is to
design high-throughput system. Then the supply voltage can be lowered to achieve a given
performance with quadratic energy saving.
2.7 High-level synthesis
High-level synthesis (HLS) is the design task that maps a behavioral description of the
design to an intermediate representation of concurrent architectural processes. High-level
29
synthesis typically involves exploring the design space, evaluating the design for a com-
plex cost function (including aspects of area, performance, and power), and choosing the
architecture that best optimizes the objective function while meeting a given set of design
constraints.
High-level synthesis typically consists of three refinement tasks: scheduling, alloca-
tion and binding. Scheduling defines the mapping of operations to time slots. Allocation
indicates the number of resources needed for a given schedule while binding maps opera-
tions to specific resources.
Numerous high-level synthesis works have been published over for several decades
[23, 26, 45, 46, 52, 53, 54, 55, 56, 61, 77, 78, 88, 93, 100, 102, 108, 116, 121]. Since
our work involves the design of systems with cyclic, multi-threading, and/or pipelined
behaviors, we first review related works on high-level synthesis of concurrent systems,
including both synchronous and asynchronous hardware as well as software systems.
For synchronous hardware systems, many works use a CDFG or some similar model as
the input. Gebotys et al. [44] propose scheduling algorithms that support loop pipelining
and loop winding. Tosun et. al. introduce an MILP formulation that can solve the binding
problem by allowing non-overlapped operations to map to the same resource [108]. Both
of these works, however, are limited to handle designs with only one thread of execution.
For asynchronous hardware systems, Badia et. al. adopts list-based algorithms and ap-
plies them to asynchronous design, but his approach utilizes data flow graphs as the input
model, which have the same limitation as CDFGs [5]. Other HLS works in asynchronous
30
design, such as controller synthesis [17] and process decomposition [106, 113], do not
address the scheduling and binding problem.
There is also a very large body of related research addressing software pipelining.
Here the problem is somewhat different as it focuses on scheduling software loops on a
fixed hardware platform. Allan et. al. gives a comprehensive review of software pipelin-
ing techniques and approaches, some of which can handle multi-threading behavior [2].
For example, Govindarajan et. al. proposes an approach to tackle scheduling for the
software pipelining problem, using cyclic dependence graphs as the input model [47]. In
particular, this work presents an exact ILP algorithm to minimize the number of registers
needed to store temporary variables across loop iterations. Cortadella et. al. introduces
quasi-static scheduling using a Petri net as the input model [33]. This work describes
an approach to avoid deadlock in the scheduling step, particularly for non-deterministic
systems, but scheduling and binding optimization is not addressed.
To the best of our knowledge, this is the first ILP formulation of HLS using marked
graphs as the input specification. Our exact algorithm has some similarities with proposed
approaches in the software pipelining domain but ours are based on Petri Net formalisms
rather than data dependence graphs. Also, our formulation has different objective func-
tions and is subject to different constraints. Moreover, we believe our heuristic algorithms
are the first to target performance-driven list scheduling that supports multi-threading. We
have not found any other work in which linear programming is used to calculate valid time
frames.
31
Other synthesis and optimization works in asynchronous design which do not address
the scheduling, allocation, and binding problems are summarized as follows. The syn-
thesis of asynchronous control circuits has been studied extensively. The community has
recently focused on two particular specification styles. First, burst-mode is a Mealy-type
state machine description [119, 120]. Second, signal transition graphs (STGs) is a Petri
net based formulation [32]. For more detailed information, [17] provides a good summary.
On the other hand, several works have highlighted on the synthesis approaches that
syntactically translate behavioral descriptions to concurrent VLSI processes, often called
syntax-directed translation. The well-known language used in describing concurrent be-
havior of asynchronous systems is called communicating sequential process (CSP) and is
pioneered by [51]. The key goal of these approaches is a correct-by-construction synthesis
methodology, implying that the resulting processes must behave syntactically equivalent
to their input behavioral descriptions. Burn proposes syntax-directed translation in which
each language construct is syntactically translated into smaller processes and then mapped
to asynchronous blocks [20]. This approach, however, leaves the designer to determine
concurrent processes in the systems manually, yielding longer design time.
Similarly, Philips develops a synthesis tool using the CSP-like programming language,
Tangram [18]. A behavioral description expressed by Tangram is translated using syntax-
directed translation into an intermediate form called handshake circuits, which represent
abstract asynchronous blocks. Handshake circuits can map to multiple design back-end
32
implementations, i.e. QDI and/or bundled-data [10, 90]. Additionally, peephole optimiza-
tions are proposed to further improve area and speed of the design [90]. Next, Bardsley
et. al. adopts a syntax-direct translation approach from Tangram and builds a new synthe-
sis tool called Balsa [7, 36]. Balsa can successfully synthesize numerous practical chips
[8, 15, 43, 57]. Additionally, Chelcea et. al. develops controller optimizations targeting
a burst-mode oriented back-end implementation for Balsa [27, 28, 29]. Lastly, Yonada
et. al. introduces a design flow similar to Balsa. This approach synthesizes high-level
C-like language using SpecC into asynchronous timed gate-level circuits. Instead of us-
ing syntax- directed translation, control is modeled and synthesized using a STG-based
approach allowing global timing optimization [117].
Synthesis of high-speed asynchronous design is introduced by [112, 113] and called
data-driven decomposition. Data-driven decomposition decomposes a sequential CSP
process into a set of smaller concurrent CSP processes based on data dependencies and
a projection technique proposed by [71]. Similarly, Teifel et. al. demonstrates data-
driven decomposition that efficiently maps CSP processes into pre-defined pipelined cir-
cuit blocks implemented in an FPGA [106]. The key disadvantage of these approaches is
that they are restricted to high-speed applications.
Many synthesis works utilize commercial synchronous high-level synthesis engines,
which translate behavioral HDL languages to gate-level netlists, as the front-end of their
tools. The back-end tools map and optimize such netlists to asynchronous designs. The
key drawback of this approach is that generated netlists may not optimally synthesized
33
for the required targets, since current commercial synchronous synthesis tools do not sup-
port several key parameters that can profoundly impact such targets. In particular, the
tools typically do not understand the notion of global and local cycle time, disabling an
opportunity to share resources. We briefly describe these approaches as follows.
Kudva et. al. propose a synthesis framework called ACK. This approach utilizes
existing synchronous synthesis flow and tools for datapath synthesis [63]. Control are
modeled and synthesized using distributed burst-mode specification [64].
Linder et. al. develop the synthesis approach which uses commercial synthesis tools
to generate a netlist and translate this netlist to the delay-insensitive asynchronous im-
plementation called Phased Logic [66]. Each gate in the netlist will be replaced with a
dual-rail gate using a special encoding called level-encoded two-phase dual-rail (LEDR)
[35]. This encoding scheme is beneficial for power consumption, since fewer transitions
are required per data token propagating through one stage. However, it requires additional
feedback connections inserted in the original netlist to ensure safeness [66].
Theseus Logic introduces the design flow that uses a new logic family called Null
Convention Logic (NCL) [62, 65]. In this design flow, large designs are synthsized using
commercial synthesis tools. Then, synchronous registers in the datapath netlist are re-
placed by asynchronous registers that communicate using a delay insensitive handshaking
protocol. However this design flow produces pipelined circuits architecturally equivalent
to the synchronous RTL implementation, typically coarse-grained. Thus, large comple-
tion detection circuits slow down the speed of the design. Another key drawback is that
34
the behavioral specification is extended to include the notion of channels to implement
handshake mechanisms between datapath and control. This requires designer to manually
specify some handshake signals in the specification, which makes the existing RTL hard
to reuse [62].
Another synthesis paradigm that automates the design of asynchronous circuits from
synchronous netlists is De-synchronization proposed by [19]. This approach translates
synchronous netlists synthesized by commercial synthesis tools to asynchronous imple-
mentation by replacing a global clock network with a set of asynchronous controllers.
Several new and existing 4-phase handshaking protocols for latch controllers are pro-
posed. The benefits of this approach are that it provides a fully automated synthesis flow,
does not require any knowledge of asynchronous design by the designer, and does not
change the structure of synchronous datapath and controller implementation, but only
affects the synchronization network. However, the target architecture of this method is
restricted to micropipeline design.
None of the above approaches can support the synthesis of automatic pipelining, yield-
ing the performance on the level of the original specification. Smirnov et. al. propose the
design flow that translates the behavioral specification to QDI pipelined asynchronous cir-
cuits by synthesizing a synchronous implementation using commercial tools and translat-
ing (weaving) it into an asynchronous pipelined implementation [97, 98]. The advantage
of this approach is that it demonstrates a general framework for automated synthesis of
35
pipelined asynchronous circuits. However, fine-grained pipelining can eliminate oppor-
tunities for resource sharing in low performance applications and also incurs a high area
penalty.
36
Chapter 3
Control Circuit Templates for
Asynchronous Bundled-data Pipelines
This chapter demonstrates the design of efficient asynchronous bundled-data pipelines,
optimized for average-case performance. The greatest design challenge for bundled-data
design is in the development of efficient control circuits. We address two major control
circuit design challenges. First, due to the large control overhead, bundled-data designs
are generally slower than their synchronous counterparts. The proposed control protocols
reduce this overhead significantly. Second, most existing methodologies are limited to
simple linear pipeline designs and the adaptation to more complex control is generally
difficult and error-prone.
Furber et. al propose circuits for simple linear pipelines [41]. New, true four-phase
circuits that are better at hiding control overhead have also been developed and proposed
37
by [37, 90]. Both of these works, however, do not address the design of more complicated
control circuits required for nonlinear pipelines, such as forks, joins, splits and merges
[104]. For these nonlinear pipelines, synthesis-based approaches using burst-mode dia-
grams (BMs) and/or signal transition graphs (STGs) are required [30, 31, 32, 40]. These
approaches, however, rely on designers to produce correct and efficient specifications,
which are often difficult and error-prone [118]. Initial efforts to automate this approach
are presented in [107, 115].
In this chapter, we propose to adopt and extend 1-of-N rail circuit templates devel-
oped for QDI circuits for the design of bundled-data pipeline control. These templates
provide a unified block-level decomposition of complex control circuits, where the im-
plementation of each block can be easily derived from the overall specification manually.
The templates greatly simplify the complex and error-prone process of complex control
circuit design using STGs or BM machines. Additionally, the design of efficient templates
would simplify the task of future synthesis tools. Instead of performing logic synthesis in
[107, 115], the tools need only to perform a mapping process that maps the designs to the
target templates. Thus, we can exploit both the low area and power of single-rail datapaths
and the simplicity of a template-based control design methodology. Specifically, we show
how to adopt an existing QDI template called a precharged full-buffer [67] to bundled-
data pipelines, develop an advanced, true four-phase full-buffer template that better hides
control overhead, and finally, optimize the T4PFB template into a zero-overhead T4PFB
template which completely hides control overhead.
38
L
e
R
e
R_gen
CIs
aC
-
en
L
0…N-1
R
0…M-1
~R
0…M-1
ilcd
ircd
lclk
(b) R_gen circuit for the i
th
output rail
L
0…N-1
CIs
en R
e
R
i
N-stacks
to iRCD
~R
i
(a) PCFB circuit template for 1-of-N
input and 1-of-M output channels
C
iLCD iRCD
Figure 3.1: PCFB template and a detailed circuit.
3.1 PCFB template for bundled-data pipelines
The adopted PCFB template for 1-of-N linear pipelines is shown in Fig. 3.1. Our template
is different from the original PCFB in that the conditional inputs CIs can be single-rail
and that the local clock signal(s) lclk(s) has no associated acknowledgement. There is
one R gen block for each output rail R
i
, as depicted in Fig. 3.1 (b). The local clock signal
can be generated like any other R
i
output or be generated via combinational logic with
R
i
’s as inputs. The iLCD and iRCD blocks are inverting left and right completion sensing
circuits [67].
The abstract protocol of this template is defined by the STG in Fig. 3.2. When a left
token arrives (L+), the R gen dynamic logic blocks evaluate, generating a valid output
token, the local clock will fire (» R
i
¡, R
i
+), and simultaneously the iLCD block detects
the token arrival (ilcd¡). Next, the iRCD block detects that right data is valid (ircd¡),
39
L+ ~R-+ R+
ircd-
ilcd-
L
e
-
~R+
R
e
-
R-
en-
ircd+
L-
L
e
+
ilcd+
en+
R
e
+
1 1
1
2
1
2
2+DL
reset
3
2 1
2
2
2
5+ DL
set
3+DL
reset
2
1
1
1 1
2+ DL
set
2
Figure 3.2: STG of the abstract PCFB protocol where each edge is labeled with its delay
(# of gate delays).
which causes the left enable to be deasserted (Le¡), and the internal state to be reset
(en¡). Once the left data is reset (L¡), the iLCD block detects that the data is null
(ilcd+) and, together with the reset of enable, causes the left enable to be reasserted
(Le+). This completes the cycle for the left environment, allowing it to send a new token,
even if the right environment is slow or stalled thereby avoiding a significant performance
penalty [41]
1
. The right environment operates concurrently. After receiving valid data,
the right environment will deassert the right enable (Re¡), allowing the R gen blocks
to precharge. This allows the right environment to reassert the right enable (Re+) and,
1
This property of full buffer [67] or fully decoupled [41] that allows the left environment to reset im-
mediately without waiting for the right environment to reset is well suited to bundled-data pipeline design
since bundled-data design usually involves a slow right environment associated with the datapath delay of
the next pipeline stage.
40
simultaneously, the internal enable to be reasserted (en+). This in turn allows the R gen
blocks to reevaluate in response to a new token.
It is important to note that this STG is a description of the abstract protocol and,
while useful to convey the level of parallelism and the timing assumptions inherent in the
protocol, it is insufficient for the purposes of synthesizing control circuits. The principle
reason is that it does not explicitly describe the functionality of the R gen blocks, which
can be quite complex and difficult to specify using the STG (often involving OR causality)
[114]. The STG also does not describe how the conditional inputs from the datapath can
induce a skip.
3.1.1 Non-linear PCFB pipelines
Fork stages need to wait for all output enable signals to set/reset before setting/resetting
the output tokens. A solution, adopted from standard PCFB, is to insert a C-element to
combine all output enable signals. If the number of fork stages is small, the C-element
can be integrated into the R gen circuit, as illustrated in Fig. 3.3.
Join stages need to wait for all input data to be set/reset before setting/resetting the
input enable. One solution is to combine the iLCD of all input channels with a C-element
to detect completion of all input data. An example of the OR of L1 and L2 dual-rail
channels is shown in Fig. 3.4. If one of the true rails of L1 or L2 is asserted, the true
rail of R is asserted. However, both false rails of L1 and L2 need to be asserted to cause
the false rail of R to be asserted. The iLCD circuit combines the completion detection of
41
en
R1
e
R_t
R2
e
R_f
R2
e
R1
e
en
L_f L_t
Figure 3.3: R gen circuit of the PCFB fork stage.
en
R
e
R_t R_f
R
e
en
L1_f L2_t L1_t
L2_f
(a) R_gen circuit implementing L1 OR L2
C
L1_t
L1_f
L2_t
L2_f
ilcd
(b) iLCD circuit
Figure 3.4: Circuits of the PCFB join stage.
42
both input data (L1, L2) with a C-element shown in Fig. 3.4(b). Interestingly, this type
of join causes significant timing problems with other pipeline design styles, such as PS0
[86, 111].
Supporting conditional reading and writing is only slightly more complex. To condi-
tionally read a channel, the associated Le generation block generates a left enable only if
a channel is read. To conditionally write a channel, the R gen block must conditionally
evaluate and handshake with the right enable only when it evaluates. In particular, a skip
can be implemented by triggering the evaluation of a separate output signal (not routed out
of the controller) that acts like an M+1 output rail and immediately sending acknowledge
back to the left environment without waiting for the right environment.
3.1.2 Timing and performance analysis
The original PCFB template is robust in that there are no internal timing assumptions on
gate delays [73], i.e., it is quasi-delay-insensitive. Our adaptations, however, have setup
and hold constraints on the conditional inputs, typical of bundled-data designs. Addition-
ally, the local clock signal must have sufficient pulse width to transfer information across
the flip-flops. In particular, the pulse width of the clock is the same as the pulse width
of the R gen circuits if implemented as combinational logic of R signals. If it is imple-
mented using an R gen circuit, the Re PMOS transistor is optional. If removed, the pulse
width reduces to the sum of the delays of the iRCD (ircd¡), left enable (Le¡), enable
43
(en¡), and R gen clock circuits. It is assumed that this pulse width is sufficient to latch
the outputs, which is easily satisfied if the flip-flops are properly designed.
A quantitative performance analysis is based on the following assumptions. First,
the delay is calculated by counting latency in term of gate (unit) delays. The abstract
STG shown in Fig. 3.2 illustrates the sequencing of events for a PCFB pipeline stage
where each edge is labeled with the above delays. Second, the analysis is performed on
a homogeneous linear pipeline assuming the completion sensing of each stage takes only
one gate delay which is a reasonable assumption for a single input/output channel of up
to four rails (a 1of4 channel). Third, the delay calculation includes the set (DL
set
) and
reset (DL
reset
) delays of the delay line attached to the left request input of the controller
as shown in Fig. 2.3.
Thus, the performance analysis of PCFB template is as follows.
FL= R+
cur
) R+
next
= DL
set
+(L+) R+)
= DL
set
+ 2
OH = R+
next
) R+
cur nextcycle
=(R+
next
) Le¡)+
(Le¡) L¡) ilcd+) Le+) R+
cur nextcycle
)
= DL
reset
+ 10
44
τ= FL+ OH
= DL
set
+ DL
reset
+ 12
The main disadvantage of this protocol is its large control overhead on the reset phase
of the protocol. The second drawback is that the forward latency contains only the set
phase of the delay line. This means that the reset phase of the delay line must be mini-
mized, motivating the use of asymmetric delay lines [95]. Lastly, the combinational logic
necessary to determine R outputs is limited to what can be implemented in a single R gen
gate.
3.2 T4PFB templates for bundled-data
To reduce control overhead, we propose a new circuit template that follows the true four-
phase handshaking protocol. In particular, our template, as illustrated in Fig. 3.5, differs
from PCFB template in that it waits until the left token arrives, the left enable to be sent
back, and the left token to reset, before generating a right token. In other words, the
T4PFB explicitly decouples the control by forcing the handshaking with the left environ-
ment to essentially finish before beginning to communicate with the right environment.
Consequently, the forward latency includes both phases of the delay line, enabling the use
of either asymmetric or symmetric delay lines and facilitating lower control overhead.
45
CIs
CL
R_gen
L
0…N-1
L
e
R
e
latch
iLCD + +
en
d lt
i
~R
0…M-1
~L
e
RCD
Latch output (lt
j
) from
other left channels
R
0…M-1
iaC
+
rcd
ilcd
lclk
~L
0…N-1
R
i
CIs
d
en R
e
~R
i
to RCD
N-stacks
(b) Latch circuit for the j
th
input rail (left) and
R_gen circuit for the i
th
output rail (right)
rcd ilcd
lt
j
~L
j
(a) T4PFB circuit template for many 1-of-N input
channels and one 1-of-M output channels
aC
lt
j
Figure 3.5: T4PFB circuit template and detailed circuits.
The STG of the abstract protocol for this template is shown in Fig. 3.6. When a left
token arrives (L+), the iLCD detects that token is valid (ilcd¡) and opens the dynamic
latches allowing the token to propagate (lt+). At the same time, the inverting asymmetric
C-element (iaC) deasserts the left enable (» Le+, Le¡). While waiting for the left token
to reset, the CL block can perform precomputation with control tokens from other input
channels as needed (d+). Once the left token is reset, the iLCD detects that the token
46
L+
ilcd-
1
lt+
1
d+
2
1
~L
e
+
L
e
-
L-
ilcd+
1
~L-
1
en+
rcd-
~R+
R-
1
rcd+
~R-
R+
1
R
e
+
~L
e
-
L
e
+ en- R
e
-
lt- d-
1
1
2 1
2
1
1 2
1
1
2
1 1
1
1
1
~L+
1
1
1
2 4
3
4+DL
set
2+DL
reset
7+DL
reset
3+DL
set
Figure 3.6: STG of the abstract T4PFB protocol where gray edges represent timing con-
straints and dashed edges indicate ordering maintained by the environment.
is reset (ilcd+) and isolates the latches from the arrival of new tokens. At this step, the
iLCD triggers two concurrent operations. First, the iLCD triggers the functional blocks
to evaluate and generate a right token (» R¡, R+). After a right token is generated, the
right token validity is detected (rcd+), causing the internal signals to reset (lt¡, d¡)
preparing to accept a new token. Second, the iLCD also triggers the left enable to reassert
(» Le¡, Le+) acknowledging the left environment. This completes the left environment
protocol, allowing the left environment to send a new token. Concurrently with the left
environment, when the right token is consumed, the right enable is deasserted (Re¡), and
the right token is reset to null (» R+, R¡). Then, the right environment will reassert the
right enable (Re+), thereby making the circuit ready to accept a new input token.
47
R
e
ilcd
L
e
~L
e
rcd
Figure 3.7: The modified Le gen circuit.
The significant overhead reduction comes from concurrent assertion of a right token
(R+) and a left enable (Le+) enabling the left environment to latch a new datum as soon
as it receives the left enable signal Le+.
We can also improve performance by allowing the right token to reset (R¡) concur-
rently with the resetting of the left enable (Le¡). This is implemented with two parallel
transistors connected in the PMOS stack shown in Fig. 3.7. A transistor connected to the
input signal rcd enables Le¡ in the first cycle after global reset. A transistor connected to
the Re input signal drives Le¡ in the remaining cycles without waiting for rcd¡, thereby
reducing the delay in the longest cycle, i.e., Le¡! L¡! R+! Re¡! Le¡ from 12 to
10 gate delays (not including the delay line delay). However, this additional concurrency
introduces timing margins TM7 discussed later in timing analysis section. A more robust
but lower performance version of T4PFB template with no concurrency between R and Le
is discussed in [109].
Compared to the PCFB template, the functional block (R gen) has the same complex-
ity of NMOS networks but has one less PMOS transistor. However, the T4PFB template
48
ilcd
R_t R_f
R2
e
R1
e
d_f d_t
R2
e
R1
e
Figure 3.8: Circuits of the T4PFB fork stage.
Figure 3.9: R gen and CL circuit (in dash boxes) of the T4PFB join stage.
provides an additional CL block that allows precomputation while waiting for the left en-
vironment to reset. This may further simplify the NMOS network in the R gen block.
Implementation and timing issues of conditional input/output signals to/from datapath
(CIs, lclk and skip) are the same as discussed in the PCFB template.
3.2.1 Non-linear T4PFB pipelines
The same techniques discussed in Section 3.1.1 are applicable to the design of T4PFB
templates for fork and join stages. An example of a fork stage implementing a copy of
49
input tokens to two output stages is shown in Fig. 3.8. The Re1 and Re2 are connected
directly to both PMOS and NMOS networks in the R gen circuit. An alternative is to
combine Re1 and Re2 with a C-element before controlling the R gen circuits. An example
of a nonlinear join stage implementing the OR of two dual-rail inputs, L1 and L2, is
depicted in Fig. 3.9. The iLCD circuit for this template is depicted in Fig. 3.4(b). The
OR functionality is precomputed within the CL block. The left CL block is asserted only
when the false rails of lt0 f and lt1 f are asserted and the right CL block is asserted when
either lt0 t or lt1 t is asserted. More complex nonlinear control circuits (e.g., merge and
split) are derived in the same manner as their PCFB counterparts.
3.2.2 Timing and performance analysis
The T4PFB template has several easily met timing assumptions that were needed to ensure
high performance. These assumptions, identified by the gray ordering edges in the STG
shown in Fig. 3.6, are now analyzed in detail.
The first four timing assumptions are timing related to the validity of local data stored
in the latches. The remaining three timing assumptions are due to the concurrent setting
and resetting of R and Le.
50
1. Latch propagation timing margin (T M1). The left token must be properly stored in the
dynamic latch (lt+) before the data are reset by the left environment (» L+). In other
words, we have the following timing constraint:
T M1=(L+)» L+)¡(L+) lt+)
where;
(L+)» L+)=(L+) Le¡) L¡)» L+)
= 6+ DL
reset
(L+) lt+)= max(L+) ilcd¡) lt+;L+)» L¡) lt+)
= 2
So; T M1= 4+ DL
reset
2. Latch reset timing margin (T M2). After the RCD initiates the reset of the latch, the
latch should have enough time to reset (lt¡) before the RCD changes its output (rcd¡).
Thus, we have that
T M2=(rcd+) rcd¡)¡(rcd+) lt¡)
where;
(rcd+) rcd¡)= 5+ DL
set
(rcd+) lt¡)= 1
So; T M2= 4+ DL
set
51
3. Data reset timing margin (T M3). To avoid reevaluation of the R gen block with stale
input data, after R gen is evaluated the output of the CL block should reset (d¡) before a
new arrival of ilcd (ilcd+). Thus, we have that
T M3=(» R¡) ilcd+)¡(» R¡) d¡)
where;
(» R¡) ilcd+)= 9+ DL
set
+ DL
reset
(» R¡) d¡)= 4
So; T M3= 5+ DL
set
+ DL
reset
4. Data stable timing margin (T M4)
2
. The output of the CL blocks need to be stable
(d+) before the output of iLCD block is asserted (ilcd+) to prevent a glitch from CL
block from causing a spurious evaluation of an R gen block. Thus, we have that
T M4=(ilcd¡) ilcd+)¡(ilcd¡) d+)
where;
(ilcd¡) ilcd+)= 5+ DL
reset
(ilcd¡) d+)= 3
So; T M4= 2+ DL
reset
2
Note that if the CL block is glitch free, this constraint can be ignored.
52
5. Output validity timing margin (T M5). Since a right token (R+) and left enable (Le+)
are generated concurrently, enough time must be given to ensure that the output validity is
detected (rcd+) before a new token arrives and deasserts the left enable (» Le¡). Thus,
we have that
T M5=(ilcd+) ilcd¡)¡(ilcd+) rcd+)
where;
(ilcd+) ilcd¡)= 5+ DL
set
(ilcd+) rcd+)= 2
So; T M5= 3+ DL
set
6. Left enable stable timing margin (T M6). Since a right token (R+) and left enable
(Le+) are generated at the same time, the left enable must be stable (» Le¡) before the
right enable is deasserted (Re¡).
T M6=(Re+) Re¡)¡(Re+)» Le¡)
where;
(Re+) Re¡)= 5+ DL
set
(Re+)» Le¡)= 1
So; T M6= 4+ DL
set
53
7. Left enable reset timing margin (T M7). Since a right token (R¡) and left enable (Le¡)
are reset at the same time, the left enable must be stable (» Le+) before the right enable
is asserted (Re+).
T M7=(Re¡) Re+)¡(Re¡)» Le+)
where;
(Re¡) Re+)= 5+ DL
reset
(Re+)» Le+)= 1
So; T M7= 4+ DL
reset
This analysis indicates that the worst timing margin is four or more gate delays (not
including the delay line). These are, thus, easily met with proper transistor sizing. Tim-
ing constraints of the conditional inputs and the local clock (CIs, lclk) are the same as
PCFB’s and also easily met with transistor sizing and delay line design. The same perfor-
mance metrics discussed in Section 3.1.2 are derived for the proposed T4PFB template as
follows.
FL= R+
cur
) R+
next
= DL
set
+(L+) ilcd¡) Le¡)+
(Le¡) L¡) ilcd+) R+)
= DL
set
+ DL
reset
+ 8
(3.1)
54
OH = R+
next
) R+
cur nextcycle
= R+
next
) Le+) R+
cur nextcycle
= 2
τ= FL+ OH
= DL
set
+ DL
reset
+ 10
The analysis shows that the overhead of T4PFB is independent of the length of the delay
line, supporting the use of both asymmetric or symmetric delay line. Moreover, com-
pared to PCFB, the control overhead is smaller by 4+ DL
reset
gate delays, a significant
improvement.
3.3 Zero overhead T4PFB templates for Bundled-data
The concurrent assertion of the right token (R+) and the left enable (Le+) in the T4PFB
control template demonstrates that part of control overhead can be hidden in the forward
latency. However, the control overhead still consists of the two gate delay penalty asso-
ciated with the right token generation of the previous pipeline stage (from Le+ to R+).
A new protocol called zero-overhead T4PFB extends the original T4PFB by hiding the
remaining overhead. In particular, by adding two gate delays in the forward path of the
T4PFB controller, the new template illustrated in Fig. 3.10 achieves zero overhead.
55
CIs
CL
R_gen1
L
0…N-1
L
e
M
e
latch
iLCD
d lt
i
~R
0…M-1
RCD
Latch output (lt
j
) from
other left channels
M
0…M-
1
rcd1
ilcd
lclk
~L
0…N-1
(a) Zero-overhead T4PFB circuit template for many
1-of-N input channels and one 1-of-M output channels
lt
j
R_gen2
RCD
rcd2
R
e
Le_gen
R
0…M-1
~R
0…M-1
R
i
M
R
e
~R
i
to RCD
N-stacks
R
e
ilcd
L
e
~L
e
rcd
M
e
(b) R_gen circuit for the i
th
output rail (left) and
Le_gen circuit (right)
Block1 Block2
Figure 3.10: Zero-overhead T4PFB template and detailed circuit implementation.
The STG of the abstract protocol is shown in Fig. 3.11. This control protocol functions
similar to the T4PFB control protocol as follows. First, a left token arrives, is acknowl-
edged, and then reset (L+, Le¡, and L¡). After this reset, an internal token and a right
token (R+) are generated concurrently with the assertion of the left enable (Le+). Notice
that the assertion of the left enable (Le+) occurs two gate delays earlier than the gener-
ation of the right token (R+) (assuming the right enable was previously asserted (Re+)
56
L+
3
L
e
-
L-
L
e
+
M-
M+
M
e
-
R-
3 + DL
set
R
e
+
R+
R
e
-
M
e
+
2 + DL
reset
3
3
2 + DL
sett
2
2
1
1
2
3 + DL
set
2
2
3
2
2
2
2
2
Figure 3.11: The STG of the zero overhead T4PFB template.
before the arrival of the right token (R+)). This enables both current and previous pipeline
stages to latch data at the same time achieving zero overhead.
3.3.1 Nonlinear pipeline
The zero overhead template can be divided into two blocks: a Block1 with R gen1 and a
Block2 with R gen2 as shown in Fig. 3.10 and 3.12(a). Nonlinear pipeline functionality
can be implemented in either the R gen1 and R gen2 blocks. However, it is more robust to
implement the complex behavior in R gen2 block since the forward latency may include
the latency of R gen1 block which can cause a setup constraint violation. Thus, the R gen1
block is generally used to implement a simple buffer and the R gen2 block is used to
57
Block1
(buf)
Block2
(buf)
L
L
e
M
M
e
R
R
e
Block1
(buf)
Block2
(fork)
L
L
e
M
M
e
R1, R2
R1
e
c R2
e
Block1
(buf)
Block2
(join)
L1
L1
e
M1
M
e
R
R
e
Block1
(buf)
L2
L2
e
M2
Block1
(buf)
Block2
(split)
L
L
e
M
M
e
R1
R1
e
R2
e
Block1
(buf)
S
S
e
R2
(a) Buffer stage (b) Fork stage
(c) Join stage (d) Split stage
Figure 3.12: Examples of nonlinear pipeline stages.
handle nonlinear behaviors. Fig. 3.12 illustrates several suggested implementations of
nonlinear pipeline stages.
3.3.2 Timing and Performance analysis
Since this template is adapted from T4PFB control template, timing assumptions listed
for the T4PFB template are also applied to this template except that T M2, T M6 and
T M7 are more stringent since there is no delay-line delay involved in the equations. The
performance metrics of the zero-overhead T4PFB template are derived from the STG
shown in Fig. 3.11 as follows.
58
FL= R+
cur
) R+
next
= DL
set
+(L+) Le¡)+
(Le¡) L¡) M+) R+)
= DL
set
+ DL
reset
+ 10
OH = R+
next
) R+
cur nextcycle
= 0
τ= FL+ OH
= DL
set
+ DL
reset
+ 10
Note that while the hold time in the datapath of this template is more critical than that
in the PCFB and T4PFB approaches, it is no more stringent than that in the synchronous
counterpart since both designs are zero-overhead pipelines.
Additionally, by adding more forward latency, negative-overhead pipeline in which
more than one data is executed in a pipeline stage can be derived with more aggressive
constraints on the hold time.
3.4 Comparison of control templates
The section compares and contrasts the advantages and disadvantages of three different
proposed control protocols: PCFB, T4PFB and ZO T4PFB.
59
Protocols FL OH area & Margin (gate delays)
gate delays gate delays energy control datapath (hold)
PCFB DL
set
+ 2 DL
reset
+ 10 1X QDI DL
reset
+ 10
T4PFB DL
set
+ DL
reset
+ 8 2 2X 3 2
ZO T4PFB DL
set
+ DL
reset
+ 10 0 3X 3 0
Table 3.1: Comparison of the PCFB, T4PFB and ZO T4PFB controllers, including for-
ward latency, overhead, area, energy, and degree of timing assumption. The cycle time is
equal to the forward latency plus the overhead.
The following equations list the flip-flop’s setup time (T
s
) and hold time (T
h
) require-
ments of a bundled-data pipeline design where D
min
and D
max
are the minimum and max-
imum delay of the datapath, D
clk to q
is the clock to output delay of the flip-flop and OH
is the overhead of asynchronous controller.
T
s
< τ¡ D
max
(3.2)
T
h
< D
min
+ D
clk to q
+ OH (3.3)
Equation (1) states that the setup time (T
s
) must be less than the cycle time (τ) minus
maximum delay of the datapath (D
max
) and Equation (2) states that the hold time (T
h
) must
be less than accumulated delay of the minimum delay of the datapath (D
min
), the clock
to output delay (D
clk to q
) and the control overhead delay (OH). Notice that hold time
constraint is generally easy to meet particularly if the overhead delay is positive. Table 4.1
compares the performance, and robustness spectrum of the three proposed protocols. The
PCFB controller offers the best robustness, area and energy, but suffers from the largest
60
overhead yielding the worst performance among the others. The T4PFB controller offers
relatively high performance with reasonable timing assumptions in both the control and
datapath. The last controller, ZO T4PFB, is the most aggressive controller and achieves
the highest speed at the cost of the most critical timing margins.
For the shallow pipelines of two gate delays, the T4PFB and ZO T4PFB templates
have longer overall latency compared to the PCFB template due to long control latency.
For medium-grain pipelines with the datapath delay plus the setup time of over ten gate
delays, however, the controller latency is not the limiting factor because it is used together
with the delay line delay to match the datapath delay. In addition, notice that for such
medium-grain pipelines, the benefits of the ZO T4PFB template over the T4PFB template
become negligible as the overhead saving is only a small fraction of the overall cycle time.
3.5 Speculative delay matching templates
A delay matching element (delay line) is combinational logic whose propagation delay is
matched with the worst case logic delay of some associated block of logic. Generally, a
delay line is implemented by replicating portions of the block’s critical path.
To take advantages of average performance, a more complicated delay line design
based on speculative completion sensing is adopted [83]. The original speculative delay
line proposed in [83] uses multiplexors to select among several independent delay lines,
thus wasting power and area. Kim et. al. proposed a more compact delay line by reusing
previous delay elements to generate the next larger matched delay [59]. However, in their
61
(a) Speculative asymmetric delay matching template
ADL
Sel
start
done
ADLC
ADL
ADLC
ADL
ADLC*
NR0
NR1
(b) ADLC circuit implementation
start
~Sel
LD
i NR
i
start
d
i
d
i
Sel
(c) Speculative symmetric delay matching template
SDL
Sel
start
done
SDLC
SDL
SDLC
SDL
SDLC*
LD 0
NR 0
(d) SDLC circuit implementation
~Sel
LD
i
NR
i
d
i
Sel
d
i
LD 1
NR 1
LD n
d0
d1
dn
start
d0
d1
dn
LD 0
LD 1
LD n
Figure 3.13: Speculative delay matching templates.
62
design the input signal still needlessly propagates through the entire delay line indepen-
dent of the data value, thereby wasting power.
We propose two novel speculative delay matching templates that are both compact and
power saving: one for an asymmetric delay line and one for a symmetric delay line. Our
templates are adapted from [59] but replace the multiplexors with delay line controllers,
one per delay element, as shown in Fig. 3.13. Each controller functions similarly to an
asynchronous split in that its input signal is routed to one of its output signals based on the
select control lines. If the select lines indicate that target delay is obtained, the controller
generates the done signal by routing the input to LD
i
. Otherwise, it propagates the input
signal to the next delay element via NR
i
. Since the input signal stops at the target delay
element, power is significantly reduced.
3.5.1 Asymmetric delay line templates
The asymmetric delay line is depicted in Fig. 3.13(a). When used with the PCFB control
template, the set phase of the delay line is matched with the worst case delay of the logic
and the reset phase of the delay line is strictly overhead.
The operation begins with the set phase. When a start signal arrives (start+), it propa-
gates to the firfst asymmetric delay element (ADL) asserting a delayed signal (d
0
+). This
delayed signal (d
0
+) and the select lines (Sel) are input signals of an asymmetric con-
troller (ADLC) whose implementation is shown in Fig. 3.13(b). This controller decides to
assert either a local done signal (LD
0
+) or the next request signal (NR
0
+). If one of local
63
done signals (LD
i
+) is fired, a done signal (done+) is generated finishing the set phase.
Otherwise, a next request signal (NR
i
+) activates the next delay element. Note that the
last controller (ADLC¤) is not required and generates only a local done signal (LD
n
+).
The reset phase begins when the start signal is reset (start¡). It causes a done signal
to reset quickly (done¡) (two gate delays) bypassing all delay elements with an AND
gate. Simultaneously, the start signal actively resets all delay elements and controllers.
Two timing constraints associated with the delay line must be satisfied. First, the select
lines of each controller must be setup and valid before its associated delayed signal (d
i
+)
arrives, referred to as a select line setup constraint, to avoid a wrong routing decision.
Second, all internal signals must be reset before the next start signal arrives, referred to as
the delay line reset constraint.
3.5.2 Symmetric delay line templates
The symmetric delay line depicted in Fig. 3.13(c) and (d) utilizes both set and reset phases
to match the worst case logic delay. It is well suited to the T4PFB control protocol since
it transfers data to the next stage after passing throughout both set and reset phases of the
delay line.
There are two timing constraints associated with the symmetric delay line. First, the
select line setup constraint described for the asymmetric delay line also applies to the
symmetric delay line. Notice, however, that this setup constraint is more stringent than in
the asymmetric delay line case because the matched delay elements are half as long. In
64
L
R
R
a
L
a
L+
R+
R
a
+ R-
R
a
- L- L
a
+ L
a
-
2 3
1 4
DL
set
DL
reset
PCFB
OH1
PCFB
OH2
D-element
SDL
Start
Done
L
L
a
R
R
a
(a)
(b)
(c)
Figure 3.14: (a) an example of power-efficient asymmetric delay line. (b) STG of D-
element using in bundled-data pipeline. (c) A speed independent D-element implementa-
tion.
addition, the select lines must be stable until after the end of reset phase, referred to as
select line hold constraint.
Satisfying both of these constraints, however, is significantly easier than satisfying the
reset constraint of the asymmetric delay line. In particular, the lack of the reset constraint
allows us to eliminate the final AND gate and alleviates the heavy load of the start signal
in the SDLC controller shown in Fig. 3.13(a). The symmetric delay line is also approx-
imately half the length of the asymmetric delay line, saving both area and power. These
advantages makes the use of symmetric template very attractive.
65
3.5.3 Power-efficient asymmetric delay line
It is also interesting to note that a power-efficient asymmetric delay line can be constructed
using a combination of a symmetric delay line and a D-element [22, 74]. A simple exam-
ple of this delay line is illustrated in Fig. 3.14(a)
3
. The D-element operates as follows.
After receiving a left request, it completes a full handshake on the right environment be-
fore acknowledging the left environment, enabling the use of a symmetric delay line on
its right environment. In the reset phase, the D-element shown in Fig. 3.14 (c) can reset
in four gate delays.
To compare this delay line with a standard one, the timing analysis of PCFB control
template using this delay line is illustrated in Fig. 3.14 (b) and detailed as follows.
FL= R+
cur
) R+
next
= DL
set
+ DL
reset
+ D¡ element delay+(L+) R+)
= DL
set
+ DL
reset
+ 8
OH = R+
next
) R+
cur nextcycle
= PCFB
OH1
+ PCFB
OH2
+ D¡ elementreset
= 14
τ= FL+ OH
= DL
set
+ DL
reset
+ 22
3
The SDL unit in the Fig. 3.14(a) can be implemented to support more complex delay line of such
symmetric speculative matching template.
66
The analysis shows that the forward latency includes both phases of the delay line
plus a small delay from the D-element (six gate delays). Additionally, the overhead is
independent of the delay line delay but still large due to the combined overhead from
PCFB control (ten gate delays) and the reset delay from the D-element (four gate delays).
Compared to the standard asymmetric delay line, this delay line can save both area
and power approximately by half. However, due to large forward latency, the delay line
is not suitable for the shallow pipeline stages. This delay line can only support a pipeline
stage with the forward latency larger than eight. Thus, the standard asymmetric delay line
is more suitable to smaller pipeline stages.
3.6 Matrix-vector multiplication architecture
In this section, we review matrix multiplication operation and discuss our proposed archi-
tecture in detail.
3.6.1 Matrix-vector multiplication
The matrix-vector specification that we are implementing can be expressed as follows:
67
2
6
6
6
6
6
6
6
6
6
6
4
y0
y1
y2
y3
3
7
7
7
7
7
7
7
7
7
7
5
=
2
6
6
6
6
6
6
6
6
6
6
4
a a a a
c f ¡ f ¡c
a ¡a ¡a a
f ¡c c ¡ f
3
7
7
7
7
7
7
7
7
7
7
5
2
6
6
6
6
6
6
6
6
6
6
4
x0
x1
x2
x3
3
7
7
7
7
7
7
7
7
7
7
5
=
2
6
6
6
6
6
6
6
6
6
6
4
(a¤ x0)+(a¤ x1)+(a¤ x2)+(a¤ x3)
(c¤ x0)+( f¤ x1)¡( f¤ x2)¡(c¤ x3)
(a¤ x0)¡(a¤ x1)¡(a¤ x2)+(a¤ x3)
( f¤ x0)¡(c¤ x1)+(c¤ x2)¡( f¤ x3)
3
7
7
7
7
7
7
7
7
7
7
5
where a, c, and f are constant coefficients
4
.
3.6.2 Asynchronous pipelined architecture: an overview
At the algorithmic level, we adopt the basic strategy of implementing each matrix vector
multiplication in four iterations, one per column of the matrix. In iteration i, the i
th
column
is multiplied by the i
th
element of X. This involves multiplying an input X
i
with three
different coefficients and optionally inverting the result, thereby motivating the use of
three distinct hardwired multipliers. The results of each iteration is stored in four distinct
accumulators whose results are written to Y after the fourth iteration and then reset in
preparation of the next input vector X.
At the architectural level, we propose the novel five stage pipelined architecture shown
in Fig. 3.15. The upper portion (highlighted in gray) of the picture shows asynchronous
4
a= 2
¡2
+ 2
¡4
+ 2
¡5
+ 2
¡7
+ 2
¡9
¼ 0:35, c= 2
¡1
+ 2
¡5
+ 2
¡7
+ 2
¡10
¼ 0:46 and f = 2
¡3
+ 2
¡4
+
2
¡8
+ 2
¡14
¼ 0:19
68
controllers communicated with the datapath and other controllers using four-phase hand-
shaking signals rather than a global clock. To obtain low-power, the datapath is imple-
mented using single-rail static logic. Numerous power optimizations taking advantages
of small-valued input statistics are applied. The general idea is to dynamically deactivate
groups of bit-slices that contain only sign extension bits (SEBs).
The multipliers and accumulators in the datapath consist of groups of partitioned bit-
slices that are selectively activated by mask control signals. In particular, the MASK and
ZD units respectively identify bit-slices of input data that contains non-SEBs and detects
the special case in which the data is zero. The mask signals (m(¢)) are used to deacti-
vate non-required SEBs by forcing them to zero via the input ANDing logic and are sent
to control delay matching units in multiplier stage (containing the matched delay lines).
Additionally, the same mask signals when latched (m
0
) are ORed with their previously
registered versions (m
00
). The resulting mask signals (ORed m) identify the bit-slices of
the accumulators that contain non-SEBs and control delay matching units in accumulators
stage.
Notice that because the input data is fed into multiple multipliers, the delay matching
unit is shared over multiple multipliers and accumulators, thereby making its overhead a
small percentage of the overall design. In the special case that the data is zero-valued,
the ZD unit asserts a zero detect signal and sends it to the controllers to disable the entire
computations. Additionally, the Partial Sign Bit Recovery (PSBR) logic extends the sign
69
’ ’’
’ ’’
’ ’’
’ ’’
’ ’’
“”
!"#$
’’’
%&
&'(
’’
’
)*
+
,-
.
-/
)*
+
.
-/
,-
“”
“”
“”
“”
“”
“”
“”
“”
.
-/
.
-/
0
D D
Speculative
delay line
D
Speculative
delay line
D
D
1!#$
1!#$
1!#$
12!#$
3-4
3
..
..
.
.5657/
.
.
5
5-4
5657/
/-4
/
-4
5.5/
.-
5.
-/
)*
+
5’
5’’
Figure 3.15: Matrix multiplication with a 5 stage asynchronous pipeline.
bit of newly activated bit-slices in the accumulator to ensure that both inputs to the accu-
mulator have the same number of activated bit-slices. Lastly, the Full Sign Bit Recovery
(FSBR) logic recovers the suppressed zero bits of accumulators results to attain the correct
final results. In the following sections, each pipeline stage is discussed in detail.
3.6.3 Zero detection stage
As mentioned earlier, it is not necessary to perform multiply-accumulated operations with
zero-valued data since the result would remain the same. To save power consumption,
zero data is detected and then stalled at this stage and only non-zero data are forwarded to
the next stage.
70
If the input data is zero (x= 0), the ZD unit asserts a zero detect signal. When the
controller (Zd Ctr) detects that the zero detect signal is asserted, it gates a local clock
signal (zd nz) thereby stalling the zero-input data. The controller also communicates with
the controller of the next pipeline stage (Mul Ctr) using a dual rail channel called zd.
If the input data is zero it asserts zd z. Otherwise, it asserts zd nz. The controller is
implemented similarly to an asynchronous spilt cell with the zero detect signal acting as
the select control channel. Additionally, regardless of the input data, the controller asserts
an extra rail zd always to latch the zero detect signal to the next stage. The zd always
is implemented by simply ORing zd z and zd nz. The details of our implementation are
illustrated in Fig. 3.21(a). Note that for correct operation the zero detect signal must be
valid before the bundled-data delayed signal matched with ZD logic becomes stable.
3.6.4 (Hardwired) multiplier stage
In this stage, a non-zero data from zero detection stage is multiplied with three constant
matrix coefficients simultaneously. The implementation details are discussed below.
3.6.4.1 Bit-slice partitioning multipliers
Ideally, we might like to selectively activate only the effective non-zero bits. However,
this would require control logic for every bit whose overhead would be difficult to over-
come. Thus, it is important to organize the activated bits into bit-slices and optimize the
number of bit-slices that can be activated taking into account the overhead of the control
71
m(3) m(2) m(1)
MSB LSB
15 14 13 12 11 10 9
Input (x)
Bit Index : 8 7 6 8 6 7 15 13 14 12 10 11 9
Figure 3.16: Mask signals generation unit based on static logic.
logic. To this end, we performed bit-level simulations of well-known image sequences
that showed that a zero detect flag along with 3-bit mask signals (m(3);m(2); and m(1))
for DCT yielded reductions in bit-activity within 10% from the optimal. Our proposed
mask generation unit yields a longest path of about 4 gate delays illustrated in Fig. 3.16.
Our fine-grain hardwired multiplier is based on a bit-partitioned carry-save multiplier,
illustrated in Fig. 3.18. The carry-save multiplier’s critical path is mainly along the final,
vector-merging adder, which we propose to implement as a bit-partitioned ripple carry
adder for two reasons. First, ripple-carry adders consume significantly lower power than
faster (e.g., carry select or bypass) adders [25]. Secondly, while ripple-carry adders have
relatively long worst-case delay, the bit-partitioning of the multiplier array (including
the ripple-carry adder) leads to very good average case delay for this application. The
staircase-patterned bit-slices, as illustrated by the dotted lines in Fig. 3.18, allow the
72
!
"#$
!%!%&
Figure 3.17: Example of the proposed mechanism for sign bit extension in the multiplier
array.
adders to be dynamically configured for different input bit-widths. For example, if the
first two bit-slices are activated, the multiplier behaves exactly as a typical multiplier that
handles 9-bit inputs.
There are two key aspects of the architecture that enable this type of reconfigurable
bit-widths. The first is that when only the first two bit-slices are activated, the inputs
to the second input bit-slice that emanate from the third input slice (i.e., that cross the
dotted line) are forced to zero by the input ANDing logic. The second feature is the sign
extension of the most right shifted input to the bit-slice boundary. Fig. 3.17 illustrates an
example of the issue and our proposed solution. In particular, it illustrates the case when
x
0
À 9 is added to x
0
À 7 when three bit-slices of x
0
are activated, i.e., when bits b13
through b15 are forced to zero. The further right shifted input in this case is the x
0
À 9
input and it must be sign extended two bits to the bit-slice boundary. Our solution is to
add two MUXes that are controlled by the MASK logic. The MUXes output the x
0
input
bit except in the case when exactly three bit-slices are activated, in which case the MUXes
73
MSB LSB
Input (x
1
)
: Half Adder
4 5 6 7 8 9 10 11 12 13 0 1 2 3
1 2 3 4 5 6 7 8 9 10 11 12 13 14
2 3 4 5 6 7 8 9 10 11 12 13 14 15
4 3 5 6 7 8 9 10 11 12 13 14 15 16 17 18
0
2
4
>>5
>>7
>>9
1
3
5
2
4
6
3
5
7
4
6
8
5
7
9
6
8
10
7
9
11
8
10
12
9
11
13
10
12
14
11
13
15
12
14
13
15
0 1 2 3 5 6 7 8 9 10 11 12 13 14
0 1 2 3 4 5 6 7 8 9 11 12
13
10
>>4
>>2
: Full Adder
MUX :
1 bit-slice activated
2 bit-slices activated
3 bit-slices activated
ALL bit-slices activated
s
a
b
: Critical Path
m(1)
4
m(2)
m(3)
15
15
“”
14
15 14
Figure 3.18: Proposed asynchronous fine-grained carry-save hardwired multiplier for 0.
35352*x
1
, where 0.35352 is expressed as (2
¡9
*x
1
) + (2
¡7
*x
1
) + (2
¡5
*x
1
) + (2
¡4
*x
1
)+
(2
¡2
*x
1
).
output the sign extension bit (which in this case is the b12 bit of x
0
). As illustrated in Fig.
3.18, the number of MUXes needed is relatively small and they are typically not in the
critical path
Notice that some adders are eliminated in the area of the highest bit-slice due to the
precomputation of their sign bits which enabling area and power saving even more. For
example, the b14 of x
0
À 5 is precomputed and forwarded to the next adder block 14 of
the second row.
3.6.4.2 Speculative completion sensing circuit
Let us focus on the completion-sensing unit for our proposed hardwired multiplier. The
critical path of the array depends on the carry chain of the ripple carry adder highlighted in
74
C
0
4FA
C
1
C
2
4
4
Output
10
10
6FA+2HA+2FA
Input
4FA
4
4
MSB LSB
C
out
3FA
3
3
Figure 3.19: Static fine-grain partitioned adder architecture.
Fig. 3.18. This path is partitioned into four bit-slices, as illustrated in Fig. 3.19. To sense
the completion of this adder, we use our speculative delay matching template discussed
earlier. The completion-sensing unit is composed of four delay lines, matched to the four
different bit-slices activated shown in Fig. 3.21(a). The mask signals m from the datapath
are fed as the select lines to control speculative delay line.
3.6.4.3 Multiplier controller
There are two types of matched delay lines used in the multiplier stage illustrated in
Fig. 3.15, a short delay line (driven by zd z) that matches the computation delay asso-
ciated with zero input data and a speculative delay line (driven by zd nz coupled with the
mask signals) that matches the data-dependent multiplier computation. In both cases, the
Mul Ctr generates mul z and mul nz signals using simple controllers illustrated in Fig.
3.21(a). By ORing both signals together, it generates the non-conditional mul always to
trigger the FFs forwarding all control signals to the accumulator stage. For low-power,
the mul nz signal latches the multiplier results only when the input data is non-zero.
75
MSB LSB
m''(3)
t(21)-t(19)
m''(3)
SIGN
t(18)-t(15)
m'(3)
m''(2)
m'(2)
t(14)-t(12)
m''(1)
m'(1)
MUX MUX MUX
3 4 3 12
t(11)-t(0)
m''(2) m''(1) t(18) m''(3) m''(2) m''(1) t(14) m''(3) m''(2) m''(1) t(11)
Figure 3.20: An example of partial sign bit recovery logic (PSBR b).
3.6.4.4 Timing constraints
The setup constraint from the delay matching template is that the mask signals m must be
valid before the first matched delay signal is valid. This ensures that the setup constraint
for the next matched delay lines are also satisfied. In addition, the reset and hold con-
straints, for the asymmetric and symmetric delay templates must be satisfied. However,
since there are no conditional inputs connected to the controller, there is no other timing
constraints associated with the controller.
3.6.5 Accumulator stage
Our 4x4 matrix-vector multiplier consists of four accumulators each responsible for sum-
ming up the multiplication results for a different matrix row. For each computation, the
accumulators accumulate four inputs corresponding to four matrix columns before assert-
ing one output result.
76
S
S
D
D
D
D
D B
B
ADLC/
SDLC
D
D
D
D
D
S
S
S
B D
D
B D
ZD
Ctr.
MUL
Ctr.
OUT_S
Ctr.
OUT_R
Ctr.
ACC
Ctr.
B
S
= R_gen w/ buffer function
= R_gen w/ split function
CG
CG
CG CG
ZD
Ctr.
MUL
Ctr.
OUT_S
Ctr.
OUT_R
Ctr.
ACC
Ctr.
CG
= Clock gating module
’
’
CG
CLK
Clock
gating (CG)
CG CG
CG
(a) Asynchronous controllers of five stage pipelines
(b) Gated clocking synchronous controllers of five stage pipelines
ADLC/
SDLC
zd_detect’ is a latched signal of zd_detect.
last’ is a latched signal of last.
Figure 3.21: Controller alternatives: (a) asynchronous controller (b) synchronous con-
troller.
3.6.5.1 Bit-slice partitioning accumulator
The bit-sliced architecture extends to the accumulator stage. By extending the bit-widths
of each bit-slice by two in the accumulator stage, overflow/underflow is guaranteed not
to occur during the four iterations of accumulation. In order to ensure that both input
operands to each accumulator have the same number of activated bit-slices, both operands
are partially sign extended by PSBRs.
An example of PSBR b is shown in Fig. 3.20. The PSBR b first extracts the sign
bit using its associated mask signal m
00
for the current accumulation result. It then sign
extends any newly activated bit-slices using a bank of MUXes that either pass the current
bit or the extracted sign bit depending on the AND of the stored (m
00
) and current mask
77
signals (m
0
). Notice that the least significant 12 bits needs no sign extension since they
are never forced to zero.
The mask signals associated with both input operands (m
0
;m
00
) produce a new mask
signals (ORed m) by ORing function carrying out the worst-case mask signals. The mul-
tiplexors M0 selectively feed the proper multiplier results to the first accumulator operand.
The multiplexors M1 route either previous accumulator results or zero data as initial input
operand. To save power, the results are latches only if data is non-zero. We latch initial
zero results at the beginning of each iteration by introducing multiplexors M2.
3.6.5.2 Speculative completion sensing circuit
The critical path of the accumulators depends on the carry chain of the ripple carry adder.
The speculative delay matching circuitry is therefore similar to that in multiplier with the
mask signal ORed m acting as the select lines.
3.6.5.3 Accumulator controller
Similar to the multiplier stage, two delay lines (driven by mul nz and mul z) are matched
to zero and non-zero data computations respectively. In addition, for each computation,
the controller Acc Ctr asserts the acc req signal at the end of each computation indicating
that the results are ready. The acc latch nz f irst signal conditionally latches in zero data
at the beginning of every computation and the intermediate results after every iteration in
which the input data is non-zero (i.e. mul nz is asserted). The acc latch nz last signal
78
updates the mask signals am
0
with zero data at the end of every computation and the
current mask (ORed m) after every iteration in which the input data is non-zero. Fig.
3.21(a) shows that all R gen blocks are implemented using conditional output control
templates (spilt or skip).
3.6.5.4 Timing constraints
The delay line has the setup constraint that the mask signals must be valid (ORed m)
before the first matched delay signal is valid. In addition, there is a setup constraint on the
controller stating that the conditional signals (c0;c1) must be valid before a done signal
from either delay line is asserted.
3.6.6 Output storing and recovering stages
The output storing stage latches the results from the accumulator stage at the end of each
computation. The output recovering datapath (FSBR) then recovers the sign bits using its
associated mask signals (m
000
). It is implemented using logic similar to the PSBR blocks.
Note that there is no timing constraints for either of these two controllers.
3.6.7 Controller alternatives
Both synchronous and asynchronous controllers can be integrated with the same datap-
ath. To fairly compare with our asynchronous designs, we implemented a gated-clocking
79
Algorithmic Design
Algorithmic Design
Architectural Design
Architectural Design
Gate Level Design
Gate Level Design
Transistor Level Design
Transistor Level Design
Layout
Layout
Input statistics analysis
Control handshaking design
and analysis
Dynamic timing analysis,
performance analysis
and optimization in gate delay
Algorithmic
verification
Functional
verification
Gate level Verification
Timing verification w/o
interconnection delay
Timing verification w/
interconnection delay
Detailed timing and
performance
analysis and optimization
Verification Timing and performance analysis
Figure 3.22: Hierarchical design flow.
synchronous controller using the same clocking conditions as the asynchronous design il-
lustrated in Fig. 3.21(b). In addition, the controllers in Fig. 3.21(a) are implemented using
PCFB, T4PFB and ZO T4PFB templates, yielding three different asynchronous designs
for us to compare to. Both standard and power-efficient asymmetric delay lines are used
with the PCFB-based design for comparison while symmetric delay lines are used with
both T4PFB-based designs.
3.7 Design flow, experimental results and comparisons
Our designs use a hierarchical design flow shown in Fig. 3.22. First, after behavioral spec-
ification of the design is completely specified, an architectural specification is constructed
by describing each block behaviorally using Verilog. In particular, the handshaking proto-
cols between controller blocks are explicitly modeled. At this step, functional correctness
80
of our architecture is verified by simulation. Next, each block is decomposed into gate-
level where each gate is described behaviorally using Verilog. Dynamic timing analysis
and optimization are performed that find the actual critical path in the datapath in term of
gate delays. Additionally, timing analysis is also applied to the control to estimate average
cycle time, forward latency and control overhead. Gate-level simulation of each block is
performed to ensure correct operation. The next step is to map each gate in our library into
its transistor-level implementation. A set of transistor-level simulations is performed to
verify correctness and to ensure that all timing constraints are met. In particular, the delay
line’s delay including setup and hold constraints are adjusted more precisely at this step.
The final step is to hierarchically generate the layout. At this step, correctness and timing
analysis are performed by extracting wire capacitance and thus considering the impact of
interconnection delays.
3.7.1 Postlayout timing validation
All designs discussed above were laid out in Hynix 0.35μ CMOS technology. We simu-
lated our designs on the extracted layout using Nanosim in typical environment i.e. 3.3V
and 25
o
C.
We validated timing constraints manually in postlayout and allowed all timing margins
to be between 10% and 20%. Where necessary these margins were achieved by careful
design of both the clock tree (for the synchronous design) and the delay lines (for the
asynchronous designs).
81
PCFB
ASY M
PCFB
SY M
% lower % lower
Test Power τ E/cye Power τ E/cyc overall controller
Patterns (mW) (ns) (pJ) (ns) (ns) (ns) energy energy
zero 12.5 7.4 92.5 13.1 7.1 92.3 0.16% 2-3%
bs1 43.5 16.6 722 42.7 16.8 715 1% 10-19%
bs2 45.3 18.6 843 43.9 18.8 825 2% 20-40%
bs3 48.5 21.8 1055 47.8 21.7 1037 1.9% 19-38%
bs4 46.4 23.9 1109 45.5 24 1092 1.5% 15-31%
Table 3.2: Comparisons of PCFB-based designs using different asymmetric delay lines.
3.7.2 Energy and throughput comparisons
Our first experiment compares asynchronous designs using the PCFB control with two
different delay lines: one using a standard asymmetric delay line (PCFB
ASY M
) and one
using the power efficient delay line (PCFB
SY M
).
We simulated our designs by applying five different inputs which activates zero to all
bit-slices. Table 3.2 displays average power, cycle time and energy per cycle. The result
suggests that with comparable performance the design using PCFB
SY M
control yields up
to 2% lower energy than one using PCFB
ASY M
control. Nevertheless, since the controller
contributes as little as 5% of the overall energy, the PCFB
SY M
controller yields up to 40%
lower energy than the PCFB
ASY M
controller. Thus, we choose the PCFB
SY M
control as
the candidate design using PCFB control for the remaining comparisons.
Next we compare three different asynchronous designs. Table 3.3 illustrates the worst-
case forward latency (FL), cycle time (τ), and controller overhead (OH) of three designs
for each type of inputs from zero to all bit-slices activated. The results suggests that the
82
PCFB T4PFB ZO T4PFB
Test FL τ OH FL τ OH % faster FL τ OH % faster
Patterns (ns) (ns) (ns) (ns) (ns) (ns) (vs PCFB) (ns) (ns) (ns) (vs T4PFB)
zero 3.4 7.1 3.7 4.1 4.6 0.5 35% 4.1 4.2 0.1 8.7%
bs1 12.7 16.8 4.1 12.6 13.1 0.5 22% 12.6 12.8 0.2 2.3%
bs2 14.7 18.8 4.1 14.6 15 0.4 20% 14.5 14.7 0.2 2.0%
bs3 17.5 21.7 4.2 17.6 18.1 0.5 17% 17.6 17.8 0.2 1.6%
bs4 19.8 24 4.2 19.8 20.2 0.4 16% 19.8 20 0.2 1.3%
Table 3.3: Timing analysis of the PCFB-based, T4PFB-based and ZO T4PFB-based de-
signs, including forward latency, overhead, and cycle time.
T4PFB controllers operate 17-35% faster than PCFB’s and the ZO T4PFB controllers run
1-9% faster than T4PFB’s.
The result suggests the advantage of the ZO T4PFB template over the T4PFB template
depends on the datapath length. For example, ZO T4PFB yields a 9% advantage for
the zero-data case while it yields only 1% in case of all bit-slices activated. Thus, the
ZO T4PFB template is more advantageous for designs with shallower datapaths.
Furthermore, we simulated our synchronous counterpart by setting the cycle time to
slightly more than the worst-case forward latency (to compensate for clock skew). In
particular, the worst-case latency of the accumulators (acc bs3) is 19.8 ns and we set the
synchronous cycle time to 20 ns.
To quantify performance-power tradeoff, we setup 10 test cases as follows. The first 7
test cases, each having 20 input vectors, are simulated using Nanosim on the extracted lay-
out. Of these, the first 5 test cases demonstrate average cycle time and energy comparison
of zero data and 4 different bit-slices activated starting from zero data and then bit-slice
one (bs1) to bit-slice four (bs4). Test case 6 is dedicated for mixed inputs activating all
83
SYNC ASYNC-PCFB ASYNC-T4PFB ASYNC-ZO T4PFB
Test τ E/cyc Eτ
2
τ E/cyc Eτ
2
τ E/cyc Eτ
2
τ E/cyc Eτ
2
patterns (ns) (pJ) (ns) (pJ) (ns) (pJ) (ns) (pJ)
zero 20 96 38 7.1 92 4.6 4.6 90 1.9 4.2 100 1.8
bs1 20 672 269 16.8 687 193 13.1 673 115 12.8 700 115
bs2 20 776 310 18.8 818 289 15 786 177 14.7 834 180
bs3 20 982 393 21.7 1037 488 18.1 962 313 17.8 983 311
bs4 20 1016 406 24 1099 633 20.2 1036 423 20 1047 417
mixed 20 830 332 18.9 894 319 15 863 194 14.8 870 191
LB 20 568 227 17.9 628 201 14.2 581 117 14 611 119
UB 20 826 330 21.4 890 406 17.7 860 270 17.5 875 268
Flower 20 705 282 17.7 738 231 14.3 706 144 14.0 740 145
Football 20 705 282 17.8 738 234 14.4 706 146 14.1 740 147
Tennis 20 705 282 18.1 738 242 14.7 706 152 14.4 740 153
Table 3.4: Detail timing and energy analysis of PCFB- and T4PFB-based designs (control
and datapath).
bit-slices. Test case 7 and 8 derive bounds of cycle time by arranging input sequences as
follows. First, 20 inputs with the same bit-slice-activation distribution as real images are
generated. Since the cycle time of a smaller bit-slice is shorter than that of a longer bit-
slice, the lower bound (LB) is simulated by ordering inputs from small to big valued data.
Further, since our DCT initializes every four iterations and the accumulators state dictates
global performance, the upper bound (UB) is arranged differently. By ordering from big
to small-valued numbers within each computation, we obtain the worst-case cycle time
for each iteration due to the worst-case bit-slice alignment in the accumulator stage. The
last 3 test cases, derived from real images, have approximately seven million input vectors
and are simulated using Verilog-XL with back-annotated timing. The energy metrics for
the last three test cases are estimated using a weighted average of the first 5 test cases.
84
The experimental results are depicted in Table 4.5. The first 2 columns for each design
show the cycle time (τ) and energy per cycle (E=cyc). The third column for each design
enumerate the Eτ
2
[105] product compared to the synchronous design.
The results lead to the following conclusions. First, since the identical datapath is ap-
plied to all designs, the energy differences are due to the difference in energy consumed by
the controllers. The clock-gating synchronous controller consumes the least energy, fol-
lowed by the asynchronous T4PFB controller, and followed by the asynchronous PCFB
and ZO T4PFB which consume equivalent power. Additionally, the results show the ef-
fectiveness of bit-slice partitioning in that a smaller bit-slice consumes less energy than a
larger one. In particular, a zero input data consumes far less energy than the others.
Second, it is obvious that in the asynchronous designs a smaller bit-slice operates
faster than a larger one. However, due to its large control overhead, the PCFB controller
looses its speed advantage over the synchronous design when more than two bit-slices
are active while the T4PFB controller is only slower when all bit-slices are active and the
ZO T4PFB run at equal speed when all bit-slices are active. Furthermore, the results of
the bound analysis suggests that compared to the synchronous design the cycle time of
T4PFB and ZO T4PFB design are between 12-28% and 13-30% faster and the cycle time
of the PCFB falls somewhere between 7% slower and 12% faster. Lastly, the simulation
with the three real images indicates that the typical performance gain over synchronous
design is approximately 30% for the ZO T4PFB-based design, 28% for the T4PFB-based
design, and 11% for the PCFB-based design.
85
Third, the asynchronous designs can tradeoff performance for low-power. Without
voltage scaling, our designs gives 11-30% higher performance with a 4-11% energy
penalty. If the power supply is scaled, energy can be quadratically reduced. We adopt the
Eτ
2
metric to quantify this advantage. The results show that, compared to the synchronous
counterpart, the PCFB-based design has a 18% Eτ
2
advantage while both ZO T4PFB and
T4PFB-based designs have up to a 49% Eτ
2
advantage.
3.8 Conclusion
This chapter demonstrates the use of an efficient asynchronous bundled-data pipeline
design methodology on matrix-vector multiplication for DCTs. Architectural optimiza-
tions that takes advantage of zero and small-valued data, typical in DCT and IDCT, yield
both high average performance and low power. Novel control circuit templates and data-
dependent delay lines are proposed to create low overhead integrated control circuits ca-
pable of handling nonlinear pipelines and enabling high average throughput. Comparisons
with comparable gated-clocking synchronous counterpart suggest that the proposed asyn-
chronous design yields 30% higher throughput with negligible energy overhead and has a
49% better Eτ
2
metric.
86
Chapter 4
Comparisons of Asynchronous Pipelines
In this chapter, we focus on the energy/performance comparisons of two well-known asyn-
chronous design styles. The first design style, bundled-data pipeline or micropipeline dis-
cussed in the previous chapter, uses single-rail synchronous datapath with asynchronous
controllers driving novel speculative delay lines [110], yielding low area, good average
performance, and low power. The cost of this design style, however, is the significant in-
crease in effort and risk associated with verifying all the setup and hold constraints typical
of bundled-data design. The second design style, quasi-delay-insensitive (QDI) fine-grain
2-D pipelines, has the advantages of robustness and high throughput. In this design style,
large functional blocks are decomposed into small communicating cells communicating
through asynchronous channels. The cells are implemented using the QDI design style
which means they will work correctly regardless of wire delays except for very loose tim-
ing assumptions on some internal wire forks [73]. The asynchronous channels use 1-of-N
87
ZD
Stage
MUL
Stage
ACC
Stage
Req
Info
OUT_S
Stage
En
Req
En
Req
En
Req
En
Req
En
Info
Info
Info Info
Info
Figure 4.1: Matrix multiplier using a 4-stage bundled-data pipeline.
rail signaling providing delay-insensitive communication. The cells are arranged in a so-
called 2-D pipeline that facilitates very high-throughput independent of the width of the
datapath [67]. The costs associate with this design style compared to other asynchronous
styles is generally more area and higher absolute power consumption.
4.1 Bundled-data pipeline design
We review the design of multiplier-accumulator using T4PFB control design. Our archi-
tecture consists of a 4-stage bundled-data pipeline shown in Fig. 4.1. The first stage sim-
ply detects whether the input is zero. The second and third stages make up the multiply-
accumulated logic containing three multipliers and four accumulators respectively. These
two stages dynamically operate on different input bit-widths controlled by bit-slice ac-
tivation control signals referred to as mask signals. In the multiplier stage, an input is
multiplied with three constant matrix coefficients a;c, and f . The accumulator stage then
adds the results from multipliers with the previous accumulated results. After the fourth
computation, the accumulator’s result is latched in the output pipeline stage.
88
F B B B
F B B
F B
F
B B
B B
F
F
(a) Bit-skewed datapath (b) Block-skewd datapath
Functional unit F
Buffer B
Figure 4.2: Skewed pipelines.
CSA
(non-skewed)
B
MA
(bit-/block-
skewed)
ACC
(bit-/block-
skewed)
X
CSA = Carry save arrays
B = Buffers
B
Y
MA = Merging-adders
ACC = Accumulators
Figure 4.3: Basic QDI pipeline design.
Experimental results in the previous chapter demonstrate that this architecture yields
significantly better Eτ
2
metric than the comparable synchronous design, particularly be-
cause it significantly reduces average cycle time τ with negligible energy overhead [110].
4.2 QDI pipeline design
4.2.1 Micro-architecture alternatives
Before describing the four QDI micro-architectures quantified, this section reviews several
design dimensions upon which a QDI micro-architecture may vary.
89
4.2.1.1 Delay-insensitive encoding selection
The most common encodings for QDI designs are 1-of-2 (dual-rail) and 1-of-4 (quad-rail)
encodings. To transmit 2 bits of data, quad-rail encoding uses five wires and 4 transitions
while two dual-rail encoded channels would require 8 transitions with six wires. Quad-
rail encoding is thus more energy and area efficient when transmitting multiple bits at a
time. Higher order encodings are even more energy efficient than quad-rail encoding but
incur an exponential hit in area efficiency.
4.2.1.2 Cell size choice
Asynchronous cells can operate on individual bits or groups of bits independent of the
encoding. For example, a cell operating on 4 bits, often called a nibble, may have data
inputs transmitted as four dual-rail channels or two quad-rail channels. In both cases, the
cells must handle completion sensing of multiple 4-bit quantities which will likely limit
its maximum cycle time. In this chapter, we focused on micro-architectures that use cells
that operate on one bit input tokens, enabling higher throughput than otherwise possible.
4.2.1.3 Skewed datapath choice
A skewed asynchronous datapath is necessary in 2-D pipelines when a wide datapath is
decomposed into small functional units (i.e. a one-bit adder) and a non-skewed structure
would require vertical communication. For example, in a ripple carry adder, a non-skewed
structure would require the carry bit to ripple downwards. This downwards ripple would
90
stall higher-order bits increasing the overall cycle time. One solution, as shown in Fig.
4.2(a), is a skewed pipeline, in which the vertical communication is converted into di-
agonal communication. Extra buffers are used to store higher bits until they are needed
thereby removing such stall conditions and maximizing system throughput in a process
called pipeline optimization or slack matching [13, 60, 67].
These buffers are not necessary between functional blocks that are both skewed but
are necessary at the boundary between a non-skewed and skewed blocks or when the
most significant bit of one block is needed in the computation of the least significant bit
of the subsequent block (e.g., in the case of a comparison determining a subsequent token
routing).
As shown in Fig. 4.2, we consider two types of skewed datapaths, bit-skewed datapath
and block-skewed datapath. Block-skewed datapaths align logic cells within each block
vertically but skew the blocks. The block size can vary from several bits to the width
of the entire datapath (creating at one extreme a non-skewed datapath). Block-skewed
datapaths require fewer slack-matching buffers than their bit-skewed alternatives but are
only possible when vertical communication within the blocks is not necessary or can be
avoided by appropriate choice of functional unit architectures. For example, to facilitate
block-skewed datapaths for our accumulators, we adopted a Brent-Kung-based tree adder
architecture for each block that does not require vertical communication within the block.
At first glance, it may appear that the number of slack-matching buffers required is
equal to the number of skewed stages. However, since buffer cells have a faster cycle
91
Split
B
Add/
Sub Add/
Sub
Add/
Sub
ACC_IN
ACC_OUT
Sel
Cin0
Cin1
Cin2
Cout0 Cout1 Cout2
(b) ACC with loop-unrolling
Add/
Sub
ACC_IN
Split
Merge
“0”
Sel_s
Sel_m
ACC_OUT
B
Cout
Cin
Slack-matching buffers
(a) ACC with loop
Op
Figure 4.4: Loop and complete loop-unrolled designs of the ACC block.
time than other logic cells, fewer slack-matching buffers are actually necessary to achieve
maximum system throughput [13, 105]. In our designs, buffer cells have a cycle time of
10 gate delays and logic cells have a cycle time of 14 gate delays, the number of slack-
matching buffers required is approximately half the number of skewed stages.
4.2.1.4 Complete loop-unrolling
An algorithmic loop within the architecture can sometimes limit the system throughput
by creating a pipeline stall. Also, small loops within the architecture require additional
pipeline buffers to prevent pipeline stalls associated with the delay of bubbles moving
backwards through the pipeline [105]. Both of these issues can be avoided for algorithmic
loops that have a deterministic number of iterations through a technique we call complete
loop-unrolling. Complete loop-unrolling involves implementing the iterative algorithm in
a linear structure at the cost of replicating the functional units multiple times.
92
4.2.2 Basic QDI pipeline architectures
Our basic QDI architecture is decomposed into three major functional blocks, as depicted
in Fig. 4.3. The first block contains three hard-wired carry save arrays (CSA) multipliers.
The second block is the merging-adders (MA) adding the results from the CSA. The third
block has four accumulators (ACC).
We assume that inputs X and outputs Y arrive in a non-skewed fashion. The CSA
multipliers are non-skewed because they require the fewest slack-matching buffers. The
merging adders and accumulators are skewed, either in bit-skewed or block-skewed of
8-bit chunks. Slack-matching buffers (B) are also inserted between the output of CSA
(non-skewed) and the input of MA (skewed) and between the output of ACC (skewed)
and the input of the right environment (assuming it is non-skewed).
4.2.3 QDI pipeline micro-architectural alternatives
Among many combinations of micro-architecture choices discussed early, we implement
four different QDI designs.
The first two designs, referred to as QDI-bit-loop and QDI-bit-unroll, are bit-skewed
designs in which the ACC block is implemented with and without complete loop un-
rolling, as depicted in Figure 4.4. Note that the complete loop-unrolled version consists
of three adder/subtractor blocks that implement the 3 different add/subtract operations. In
both cases the channels are implemented with dual-rail encoding and the cells operate on
bit-wide operands.
93
0
1
2
3
4
5
6
CSA (nJ) BUF(nJ) MA(nJ) ACC(nJ) TOTAL(nJ) CYCLE(ns)
bit_loop
bit_unroll
block_loop
block_unroll
Figure 4.5: Energy and cycle time statistics of the 4 QDI designs
The third and fourth designs, referred to as QDI-block-loop and QDI-block-unroll,
both use 8-bit block-skewed datapaths to reduce the number of extra buffers and use quad-
rail encoding to reduce energy consumption and vary in whether or not the ACC block is
unrolled. The block-skewed datapath are applied to the MA and ACC blocks shown in
Fig. 4.3. The quad-rail 8-bit block-skewed adders were implemented using a Brent-Kung
carry-look-ahead structure. This adder has 4 quad-rail input channel (8 data bits) and has
4 quad-rail sum output channels and one dual-rail carry output channel. We construct our
22-bit adders by connecting three such adder blocks together in a ripple carry fashion.
4.3 Experimental results
We implemented all five designs at the transistor-level using manual transistor sizing in
TSMC’s CMOS 0:25μ process. We simulated each design using Nanosim at a nominal
25
o
C environment and at the 2.5 V nominal supply voltage. We setup four simulations
for the QDI designs and calculate average cycle time and energy per computation for
94
20 random input vectors. The detailed results of each decomposed blocks are shown in
Fig. 4.5. The results suggest that required pipeline buffers in the block-skewed designs
accounts for 40% less energy than those in the bit-skewed designs. However, since these
extra buffers are a small portion of the whole design and the block-based designs are
more complex, the overall energy consumption of the bit-skewed designs are still lower.
Moreover, the data shows that bit-skewed pipelines operates faster than the block-skewed
pipeline primarily because the logic cells in bit-skewed pipeline are smaller. Finally,
the loop-unrolling micro-architecture reduces the total energy by 25% for the bit-skewed
design and 12% for the block-skewed design by eliminating the need for many slack-
matching buffers in the ACC block.
We obtain average performance of our bundled-data design by applying two sets of
20 random input vectors, one set for lower bound and another set for upper bound of
average performance. Both input sets imitate input statistics from [59]. Since our QDI
designs never intended to take advantage of energy saving from input statistics [70], to
obtain a more reasonable comparison, we measured the worst-case energy of the bundled-
data design in which all bit-sliced are active. Table 4.1 shows average cycle time and
energy per computation. We then combined this data in the last column where all designs
are compared using the Eτ
2
metric. Compared to the bundled-data design, the results
indicate that the best QDI design is QDI-bit-unroll yielding 22% better Eτ
2
, followed by
the QDI-bit-loop which yields a 4% better Eτ
2
. The other two QDI designs had inferior
Eτ
2
metrics.
95
Designs Area Throughput E/cyc τ Eτ
2
(Ktrs) (Mhz) (nJ) (ns) (nJ¤ ns
2
)
Bundled-data 26 184 0.48 5.43 14.15
QDI-bit-loop 125 595 4.83 1.68 13.63
QDI-bit-unroll 137 571 3.62 1.75 11.08
QDI-block-loop 123 510 4.92 1.96 18.90
QDI-block-unroll 200 510 4.31 1.96 16.55
Table 4.1: Area, cycle time, energy per cycle and Eτ
2
statistics
4.4 Conclusions
This chapter presents energy-throughput comparisons of two well-known asynchronous
design styles (bundled-data pipeline vs 2-D QDI fine-grain pipeline) applied to a matrix-
vector multiplication core of the DCT. We implement and compare bundled-data design
and four QDI designs of different micro-architecture choices considering bit/block skewed
datapath and the impact of loop-unrolled. The experimental results suggest that the best
QDI design is the bit-skewed QDI design with loop-unrolling. It yields significantly better
worst-case performance than the average-case performance of the bundled-data design
and has a 22% better Eτ
2
.
In summary, Chapter 3 and 4 compare three different pipelined design styles: syn-
chronous design, asynchronous bundled-data design and asynchronous QDI design. Our
synchronous design uses single-rail datapath and gated-clocking control, saving area and
power. However, the design’s performance is dictated by the delay of the longest pipelined
stage. The bundled-data design using the same datapath can offer higher average perfor-
mance with negligible area and power increases from the control complexity, compared
96
to the synchronous counterpart. Both designs provide medium throughput but require sig-
nificant efforts to verify setup and hold constraints. In contrast, the QDI design style using
quasi-delay insensitive communication coupled with 2-D datapath and small control over-
head per pipeline stage can offer higher performance, lower design effort, and improved
energy for a given performance at the cost of area penalty.
97
Chapter 5
High-level Synthesis of Highly
Concurrent Asynchronous Systems
In the previous chapters, we discussed the performance, power, and area tradeoffs for three
different pipelined design styles, with a variety of architectures and micro-archiectural
choices. Finding the best architecture requires the designer to spend significant effort
implementing the designs at the transistor-level, yielding longer design time. This rea-
son motivate the development of an automated CAD tool for high-level synthesis. This
chapter addresses the fundamental challenges in high-level synthesis work. In particu-
lar, it addresses fundamental problems in scheduling, allocation, and binding for highly
concurrent systems with multi-threading and pipelined behaviors, which are typical in
asynchronous systems.
98
Control data flow graphs (CDFG) are widely used as the input model for most high-
level synthesis works. They describe data dependency and flow control. The specification
implicitly assume that there are no more than one problem instance (thread) executing in a
loop. The specification of multi-threading behaviors must be explicitly described, yielding
a larger specification typically in the order of the number of threads in a system. Because
of this, CDFGs cannot easily express highly concurrent systems with multiple threads
of execution. We propose to use marked graphs, a restricted form of Petri net, because
marked graphs can naturally express highly concurrent deterministic systems, including
systems with pipelined and/or multi-threading behaviors. Additionally, the computational
complexity for a multi-threaded problem using marked graphs is potentially lower than
that using CDFGs due to the smaller size of input specification. Note also that we as-
sume that marked graphs are manually derived by the designer. The automated translation
from the behavioral specification to marked graphs is proposed by [63]. This approach,
however, does not support the translation for a multi-threaded problem which needs more
research effort.
A marked graph is a directed cyclic graph, coupling initial state information with
well-defined execution rules. Due to the cyclic nature of marked graphs, two challenging
problems are how to define a valid schedule and how to calculate a valid time frame for
each operation during the scheduling process. We first define a valid schedule and pro-
pose approaches to calculate a valid time frame of operations for both exact and heuristic
scheduling and allocation algorithms. Our scheduling and allocation algorithms defines
99
the mapping of operations to time slots and the allocation of corresponding functional
units. It roughly defines the performance/cost tradeoff of the design, where the cost is
estimated as the number of functional units needed, without binding operations to specific
functional units or dealing with the associated control logic.
Although the allocation defines the number of resources needed, the optimization al-
gorithms do not model routing and control complexity. In fact, due to the cyclic nature of
the input specification, the scheduling and allocation algorithms may also create sched-
ules which can only be satisfied with a many-to-many binding relation from operations
to resources. In order to meet the allocation and performance constraints, operations may
need to be bound to different resources in different iterations. Typically, classic binding
algorithms cannot easily produce binding solutions for such schedules. Because of these
issues, we propose two novel performance-driven concurrent scheduling and binding al-
gorithms that minimize the area of the design, including the cost of routing and control
complexity. The first is an exact algorithm based on a MILP formulation and the other
is a heuristic algorithm based on well-known list and coloring algorithms. Both heuristic
algorithms dynamically compute the valid time frames of operations using linear pro-
gramming.
The proposed algorithms can control the degree of concurrency as follows. The algo-
rithms can reduce the concurrency which enable more resource sharing opportunity at the
expense of performance penalty or they can increase the concurrency by allowing multi-
ple instances of the same problem to execute simultaneously (multi-threading) with some
100
increased cost in area. To demonstrate our approaches, the algorithms are run on a variety
of single-thread and multi-threaded DSP applications and compared in terms of runtime
and quality of results.
5.1 Modeling highly concurrent systems
A Petri net, which was defined in the 1960s by C.A. Petri [92], is widely known as a
powerful way to express concurrent systems. The definition of a Petri net is the following.
Definition 1 A Petri net is a triple N =(P;T;F) where P is the finite set of places, T the
finite set of transitions and Fµ(P£T)[(T£P) the flow relation. We say for an element
x2(P[ T), that²x is the preset of x defined as²x=fy2(P[ T)j(y;x)2 Fg and x² is
the postset of x defined as x²=fy2(P[ T)j(x;y)2 Fg.
Tokens are abstract entities within a graph that represent the current position of prop-
agating data. A marking is a token assignment for the place and it represents the state of
the system. Formally, a marking is a mapping M : P!f0;1;2;:::g where the number of
tokens in a place p under marking M is denoted by M(p). M
0
is the marking representing
the initial state of the system.
The firing rule of Petri nets is defined as follows. A transition t is enabled at marking
M if M(p)¸ 1;8p2²t. Enabled transitions can fire. The firing of t removes one token
from each place in its preset and deposits one token to each place in its postset, leading to
a new marking M
0
and thus a new set of enabled transitions.
101
From the definition above, a Petri net is defined as a general representation, which
models a system with or without choice. Our work, however, is limited to a restricted
class of Petri net called a marked graph (a Petri net with no choice) using the fixed delay
model (i.e. the delays of all places are fixed).
Definition 2 A marked graph G=(N;M
0
) is a tuple where N is a Petri net in which every
place has at most one input and one output transition, i.e.j² pj· 1^jp²j· 1j;8p2 P.
The set of reachable markings is denoted by R(M
0
). G is live if and only if all transi-
tions will eventually be enabled from every M2 R(M
0
). It is k-bounded if M(p)· k for
8M2 R(M
0
);8p2 P.
An important property of a marked graph is the cycle metric. The cycle metric of a
cycle c is the sum of the delays (d(c)) of all associated places along the cycle, divided by
the number of tokens (m
0
(c)) that reside in the initial markings of the cycle, i.e., CM(c)=
d(c)=m
0
(c). The cycle time or maximum cycle metric of a marked graph is defined as its
largest cycle metric, i.e. c
m
= max[8
c
i
2C
CM(c
i
)].
Fig. 5.1 shows three examples of marked graphs. Places without tokens in the initial
marking are represented by regular arrows. Places with tokens in the initial marking are
represented by dark circles. The first example, shown in Fig. 5.1(a), models a three-stage
linear pipeline with a maximum cycle metric of 2. The second example, illustrated in
Fig. 5.1(b), shows a three-stage pipeline that allows for higher maximum concurrency
compared to the first. Particularly, this design allows at most two threads of execution per
pipeline stage. The third example, depicted in Fig. 5.1(c), represents a four-stage ring
102
b
1
b
2
b
3
bb
bg
2
2
2
0
0
0
0 2
(a)
(c)
a
c d
b
1
1
1
1 3
3
3
3
b
1
b
2
b
3
bb
bg
2
2
2
0
0
0
0 2
(b)
Figure 5.1: Examples of marked graphs
with a maximum cycle metric of 6 and illustrates a design with two concurrent threads of
execution.
Fig. 5.2(b) shows an equivalent CDFG of a four-stage ring with two threads running
at the same time. Since the CDFGs are restricted to express only one thread per loop,
each thread in the CDFGs is required to be specified explicitly. In this example, the first
thread (a1-d1) and the second thread (c2-b2) are shown on the left and right side of the Fig.
5.1(b) respectively. Thus, a problem modeled using marked graphs yields an equivalent or
smaller specification, potentially reducing computational complexity of HLS. Addition-
ally, this example also shows that marked graphs can be used as a generalized specification
which can express any closed system (single or multi-threaded), particularly it can express
highly concurrent hardware systems more effectively than the CDFGs.
103
(a) Marked graph
(b) CDFG
Figure 5.2: A marked graph (a) and an equivalent CDFG (b) for a four-stage ring design
5.2 Scheduling concurrent systems
This section defines a valid schedule and proves the existent of a valid schedule for a given
performance constraint.
Definition 3 A valid schedule S(G;τ) of a marked graph G with target cycle time of τ is
a set of tuples(t;`
t
),8t2 T and`
t
is non-negative, i.e., S(G;τ)=f(t;`
t
)j8t2 T;`
t
2Ng.
Each tuple (t;`
t
) indicates that an operation t is scheduled at time `
t
. For a schedule to
satisfy cycle time of τ, the following dependency constraints must be satisfied.
8p2 P[`
t
¸`
s
+ d(p)¡ m(p)¤ τ] (5.1)
where s;t2 T and(s; p)2 F,(p;t)2 F. m(p) and d(p) denotes a number of tokens and
the delay on a place p.
104
The key point here is that the term m(p)¤ τ in Eq. 5.1 captures the dependency con-
straints of a cyclic graph with multiple tokens and thus this property distinguishes our
work from other HLS works.
Fig. 5.3 illustrates a valid schedule of the three-stage pipeline example in Fig. 5.1(a)
with a target cycle time of 2. The right side of the figure shows the list of dependency
constraints. For example, a place from bg to b
1
has an initial marking token and a delay
of 2, so the dependency constraint can be determined as`
b
1
¸`
bg
+ 2¡ 2¤ 1.
Theorem 1 Given a marked graph G with the maximum cycle metric of c
m
, there exists a
valid schedule S(G;τ) for any τ¸ c
m
.
Proof Several authors [34, 69] addressed that the maximum cycle metric c
m
for live and
bound marked graph can be calculated by applying linear programming to Eq. 5.1. Beerel
et al. [13] further proved that for a given marked graph with cycle time of c
m
there exists
a non-negative solution satisfying Eq. 5.1, which represents a valid schedule S(G;τ) if the
delay d(c) in any cycle is less than or equal to m(c)¤c
m
. For any τ¸ c
m
, we conclude that
the delay of any cycle in G is less than or equal to m(c)¤ τ because m(c)¤ c
m
· m(c)¤ τ.
Thus there exists a valid schedule S(G;τ).
105
0
2
2
0
0
2
0
2
bb
b
3
b
2
b
1
bg
1 1
2 1 1 2
3 2 2 3
3 3
0; 2 2
2; 0 2
2; 0 2
2; 0 2
bg b b bg
b b b b
b b b b
bb b b bb
≥ + ≥ + - ≥ + ≥ + - ≥ + ≥ + - ≥ + ≥ + -
3 +
6 +
Dependency constraints
Figure 5.3: A valid schedule of a three-stage pipeline with target cycle time of 2
5.3 Scheduling and allocation
5.3.1 Problem definition
Given a live and bound marked graph (G) with maximum cycle metric of c
m
and target
cycle time of τ, our goal is to minimize design area subject to the target cycle time of τ.
In our problem, we assume that time is divided into discrete time steps. Assuming that
each unit time step takes u time units, we need L= τ=u discrete time steps to determine
resource allocation.
The scheduling of marked graphs using discretized time approach is flexible enough
to target both synchronous and asynchronous systems. For synchronous systems the time
step should naturally be the clock cycle period. For asynchronous systems time discretiza-
tion helps ensure that a resource is assigned to only one operation at any time step. In
particular, for asynchronous systems it is not expected that a centralized control unit will
force operations to occur in the time steps scheduled, but that distributed control will be
106
responsible for routing tokens between blocks as needed. Note that in asynchronous sys-
tems there is a tradeoff between the size of the time step and the accuracy of the model.
For both systems, however, the minimum number of resources required for each resource
type is defined as the maximum number of overlapped operations over all L time steps.
Note also that we assume that the latency and cycle time of resources are fixed. While
this is natural for synchronous systems, extensions to handle asynchronous systems with
stochastic delay models is an area of future work. Until then, stochastic delays must be
approximated with a fixed delays, e.g., one may estimate the stochastic delay with the
mean delay. The only consequence of this approximation is that the measured perfor-
mance of the resulting scheduled and bound systems may deviate from their target cycle
times. Additionally, this paper does not include latency and power metrics in the cost
function which can be easily integrated and thus left as future work.
5.3.2 Exact algorithm
The exact scheduling and allocation algorithm is unique in that it accepts a marked graph
as the input model for concurrent systems, then formulates and solves the problem us-
ing mixed integer linear programming (MILP). Additionally, it is the first to restrict the
schedule time of operations in the marked graph using a linear programming (LP) pre-
processing step. In particular, we define a set of valid scheduling time for each operation
in G as follows.
107
Definition 4 Given a reference operation t
0
and a operation t, we define the as close as
possible time from t to t
0
, denoted by ACAP
t
, as the minimum time separation from t
to t
0
such that all dependency constraints are met, i.e. ACAP
t
= min(`
t
¡`
t
0
) such that
Eq. 5.1 is met. Similarly, the as far as possible time from t to t
0
is defined as AFAP
t
=
max(`
t
¡`
t
0
) such that Eq. 5.1 is met. A set of valid scheduling times for t is defined as
δ
t
=f`jdACAP
t
=ue·`·bAFAP
t
=ucg.
The ACAP
t
and AFAP
t
values can be computed using linear programming. For ex-
ample, the valid scheduling times for the three-stage pipeline with target cycle time of 4
and a reference operation b
1
are the following: δ
bg
=f0;1;2g, δ
b
1
=f0g, δ
b
2
=f2;3;4g,
δ
b
3
=f4;5;6;7;8g, δ
bb
=f6;7;8;9;10;11;12g. We now present the MILP formulation.
Objective function: The objective function is to minimize the overall area approximated
as the summation of resource areas. Let m
k
be an integer variable that represents the
minimum number of resource type k2 R needed for a valid schedule. Let a
k
be the area
cost for each resource type k. The objective function is
min
∑
8k2R
a
k
¤ m
k
(5.2)
108
Assignment constraint: x
il
is a binary variable in which x
il
= 1 if an operation i is sched-
uled at time step l. Otherwise, x
il
= 0. Each operation must be scheduled in exactly one
time step during its valid scheduling times (δ
i
), i.e.8i2 T ,
∑
l2δ
i
x
il
= 1 (5.3)
Dependency constraint: All dependency constraints in Eq. 5.1 must be satisfied, i.e.
8p2 P where(s; p);(p;t)2 F,
∑
l2δ
t
x
tl
¤ l¸
∑
l2δ
s
x
sl
¤ l+ d(p)¡ m(p)¤ τ (5.4)
Resource allocation constraint: For each resource type k2 R, p
k
is the processing time
which the resource must remain occupied in order to complete the operation. Because of
this, the number of operations currently occupying the same resource for each time step
l2 L must be no greater than m
k
, i.e.8l2 L;8k2 R,
∑
8i2T
rt(i)=k
∑
8
j2δ
i
l2J(i; j)
x
i j
· m
k
(5.5)
where J(i; j) is a set of time steps where operation i is scheduled at time step j and utilizes
that resource until the end of p
k
, i.e. J(i; j)=fl2 Lj8
j·k· j+p
k
¡1
l =(k mod τ)g. In
other words, J(i; j) represents a window of p
k
time steps starting from time step j, where
109
1 +
2 +
3 +
[ 9] bb +
3
[ 4] b +
2
[ 2] b +
1
[ ] b [ ] bg
Figure 5.4: Resource allocation for a 3-stage linear pipeline with target cycle time of 4
operation i requires a resource to be allocated for it. Lastly, rt is a mapping function from
an operation i2 T to resource type k2 R, i.e. rt : T! R.
The above formulation can be solved with available public-domain or commercial
MILP solvers. As an example, the optimal schedule for the three-stage pipeline in Fig.
5.1(a) with c
m
= 2 and τ = 4 is S(G;4) =f(bg;0);(b
1
;0);(b
2
;2);(b
3
;4);(bb;9)g. To
evaluate the resource allocation, all operations are mapped within τ time steps shown in
Fig. 5.4:f(bg;(0 mod 4)= 0),(b
1
;(0 mod 4)= 0),(b
2
;(2 mod 4)= 2),(b
3
;(4 mod 4)=
0),(bb;(9 mod 4)= 1)g. The figure shows that 2 buffers are required because b
1
and b
3
cannot use the same resource and one of them can share the same resource with b
2
.
5.3.3 Heuristic list algorithm
Unfortunately, solving MILP problems is NP-complete and is generally limited to rela-
tively small problem instances. Consequently, we propose an alternative computationally-
efficient heuristic algorithm. There are several existing heuristic scheduling algorithms,
such as list scheduling, force-directed scheduling and path-based scheduling. We chose
110
iterative list scheduling and allocation algorithm because it is generally simple, efficient,
and produces results close to an optimal solution.
Algorithm 5.3.1: LISTSCHEDULING(G;τ)
ξà ESTIMATEMINRESOURCE(G;τ)
c
m
à CALMAXCYCLEMETRIC(G)
if(c
m
> τ)
then return(0)
while true
do
8
>
>
>
>
>
>
>
>
>
>
<
>
>
>
>
>
>
>
>
>
>
:
fS;κgà TRYTOSCHEDULE(G;ξ;τ)
if(S6= 0 (S is valid))
then return(S)
else ξà UPDATERESOURCEVECTOR(κ)
The top-level algorithm is shown in Algorithm 5.3.1. First, the minimum number of
resources needed for each resource type k (ξ
k
) is calculated as follows:
ξ
k
=d((number of operations type k)¤ p
k
)=Le (5.6)
Second, c
m
is calculated using the linear programming formulation in [69] and compared
to τ. According to Theorem. 1, if τ< c
m
, a valid schedule does not exist, so the program is
111
terminated. Third, the iterative loop trying to schedule G is executed until a valid sched-
ule is found. The tryToSchedule function contains a key portion of the list scheduling
algorithm, which returns a valid schedule S if it exists, otherwise it determines a list of
resource types (κ) that violate resource constraints. The updateResourceVector function
conservatively increments the minimum number of resources ξ for each resource type in
list κ, because no valid schedule was found for the previous amount of resources allocated.
The main computation of our algorithm is in the tryToSchedule function shown in Al-
gorithm 5.3.2. The following parameters are initialized by the first four lines: a reference
operation t
0
, an initial set of enabled operations T
e
and their time frames computed by the
calTimeFrame function, a set of scheduled operation T
s
(initialized to an empty set) and
a time step counter l (initialized to 0). Note that t
0
is initialized as an operation with the
minimum scheduled time satisfying Eq. 5.1, and that by normalizing t
0
to time step 0, a
scheduled time of any t will be non-negative.
The time frame of each enabled operation is the legal window of time in which these
operations can be scheduled while ensuring that both the dependency constraints and per-
formance target of the system can still be satisfied. To represent this time frame we use
variables l
t
and f
t
to represent the earliest legal scheduled time and the maximum free
slack of an operation t2 T
e
and define the time frame of t as ζ
t
=[`
t
;`
t
+ f
t
]. The compu-
tation details of the time frame will be described below.
112
Algorithm 5.3.2: TRYTOSCHEDULE(G;ξ;τ)
t
0
à f indRe f erenceTran(G;τ)
T
e
à getEnabledTrans(G);T
s
à Ø
calTimeFrame(S;T
e
;t
0
;τ)
là 0
while(T¡ T
s
6= Ø)
do
8
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
<
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
:
for each r2 R
do
8
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
<
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
:
T
er
Ãft2 T
e
jrt(t)= rg
T
srl
Ãft2 T
s
j(rt(t)= r)^ l2[l
t
;l
t
+ p
r
]g
fT
negSlack
;T
zeroSlack
;T
posSlack
gà classi f yTrans(T
er
;l)
if(T
negSlack
6= Ø)
then
8
>
>
<
>
>
:
l=(8
t2T
negSlack
min(l
t
+ f
t
))
Repeat inner do loop
else if(jT
srl
j+jT
zeroSlack
j> ξ
r
)
then
8
>
>
<
>
>
:
κà updateFailedResource(r)
continue
addToSchedule(S;T
zeroSlack
;T
posSlack
;l)
T
s
à updateScheduledTrans(T
zeroSlack
;T
posSlack
)
if(κ6= Ø)
then return(0;κ)
T
e
à U pdateNewEnableTrans(G;T
s
)
calTimeFrame(S;T
e
;t
0
;τ)
l++
return(S;0)
113
The next phase of the tryToSchedule function is iterative. The scheduler starts by as-
signing t
0
to time step l = 0. At every time step l, operations for each resource type are
scheduled one at a time. A set of enabled operations(T
er
) for resource type r2 R is first
extracted and classified into three sets ordered by the time fame end points as follows: a
set of operation with negative slack (T
negSlack
), zero slack (T
zeroSlack
) and positive slack
(T
posSlack
) where slack(t)=`
t
+ f
t
¡l;8t2 T
er
. If T
negSlack
is not empty, we find the mini-
mum time step l
neg
(i.e. l
neg
=(8
t2T
negSlack
min(l
t
+ f
t
))) where operations in T
negSlack
must
be scheduled and the algorithm returns to l
neg
, recalculating the slack of all operations
in T
er
. Thus, the most urgent operation(s) in T
negSlack
will now have zero slack, moving
them to T
zeroSlack
and all other enabled operations will have positive slacks in the updated
T
posSlack
. All T
zeroSlack
operations must be scheduled at the current time step l. Resource
constraints must be checked to guarantee that there are enough available resources for all
new zero slack operations T
zeroSlack
and and operations that have already been scheduled
T
srl
. In case there are not enough available resources, a resource of type r must be added
to the κ resource list. However, if there are extra resources available, we allow enough
positive slack operations to fill those extra resources, selected the operations with the least
slacks first. At the end of each time step, if κ is not empty, κ is returned to the top-level
program. Otherwise, a new set of enabled operations and its time frame are computed.
The same procedure repeats iteratively for every time step, incrementing l until all oper-
ations are scheduled. Note that we choose to schedule operations in order of least slack
(scheduling the operations with the least slack first) rather than in order of earliest or latest
114
valid schedule times. The earliest schedule time of a non-enabled operation may decrease
between consecutive time frame calculations as a consequence of fixing the schedule time
of the currently selected operation. This will cause the latest schedule time of its prede-
cessors, which might be other enabled operations, to also decrease, possibly to a value that
is before the current time step. Thus, scheduling operations in slack order, while guaran-
teeing a valid schedule, may require revisiting earlier time steps to schedule operations
with negative slack.
This list scheduling algorithm is demonstrated using the same example as used to
explain the exact MILP approach: the three-stage pipeline with c
m
= 2 and τ= 4. We
focus on the schedule of a resource type bufferfb
1
;b
2
;b
3
g. In the initialization phase,
ξ
bu f f er
=d(2¤3)=4e= 2 and t
0
= b
1
. Since b
1
is a reference operation, b
1
is first assigned
to time step 0. The next enabled operations are bg and b
2
with the time frames of [0,2]
and [2,4] respectively. Thus, bg is next to be scheduled at time step 0 and b
2
is scheduled
at time step 2 where the number of buffers used at time step 2 is 1(· ξ
bu f f er
). After this,
b
3
is the only enabled operation with the time frame of [4,4]. Thus, b
3
is scheduled next
at time step 4 and the number of buffers used at time step 4 is 2(· ξ
bu f f er
) (see resource
allocation in Fig. 5.4). Next bb is enabled and scheduled at time step 6. Finally, we have a
schedule S(G;4)=f(bg;0);(b
1
;0);(b
2
;2);(b
3
;4);(bb;7)g, which requires only 2 buffers
total.
The time frame in function calTimeFrame represents the most unique aspect of our
approach. Conventional approaches compute the time frame of all operations statically
115
once in the initialization phase. We also considered a one-pass approach in which we
would distribute the free-slack over all operations. Our preliminary results suggested that
this lead to poor schedules. For this reason, we present this iterative approach in which we
distribute the free-slack of the entire system onto only the currently enabled operations. To
do this, the value of f
s
for s = 2 T
e
is set to zero so that all s2 T
e
are given the maximum free
slack and are allowed to be scheduled as late as possible, enabling more resource sharing
earlier in the scheduling process. The free slack of enabled operations is then computed
using the following linear programming which guarantees that all relevant dependency
constraints are met:
max(
∑
8s2T
e
f
s
) such that
8p2²T
e
[`
t
¸`
s
+ d(p)¡ m(p)¤ τ+ f
s
] (5.7)
8p = 2²T
e
and s or t = 2 T
s
[`
t
¸`
s
+ d(p)¡ m(p)¤ τ]
Notice that we must re-run the above LP on every iteration, each time with a differ-
ent set of enabled operations. Fortunately, as shown later the run-times are still quite
manageable for all our examples and this leads to better results. In principle, however,
this iterative approach can be extended to work on any partition of operations and differ-
ent partitioning strategies could either reduce computational complexity or improve the
quality of the results.
116
The complexity of our heuristic algorithm is O(T)
1
¤ O(T)
2
¤(O(T)
3
+ O(lp(T))).
The function listScheduling will try to schedule O(T)
1
times in the worst case. The
tryToSchedule function is computed at most O(T)
2
iterations assuming that one operation
is scheduled per iteration. For each iteration, all functions are calculated in the order of
O(T)
3
except for function calTimeFrame, which is computed using linear programming
with O(l p(T))
1
.
5.4 Many-to-many resource binding problem
The cyclic nature of marked graphs forms a unique binding complication. Since the ex-
ecution time of each operation typically takes multiple time steps, an operation t
1
may
overlap with multiple operations at different time steps. Using the allocation approach
defined in Section 5.3.1, suppose resource r completes operation t
1
and begins a differ-
ent operation t
2
, there is no guarantee that t
2
will not overlap with t
1
in the subsequent
iteration. Because of this, multiple operations can be assigned to the same resource and
each operation may need to be assigned to multiple resources forming a many-to-many
resource binding relation.
A concrete example of this issue is depicted in Fig.5.5. The optimal schedule suggests
that 3 resources are required. Assume that at time step l+ 4, an operation a maps to a
resource r
1
, b maps to r
2
and c maps to r
3
. At time step l+ 6, c finishes its operation,
1
The simplex algorithm is a classic technique for solving linear programming problems. Although the
worst-case complexity of the simplex algorithm is exponential time, the simplex method is very efficient in
practice.
117
2
2
2
2 4
4
4
4
( , 12) a +
( , 4) b +
( , ) c
( , 6) d +
( )
( 2) +
( 4) +
( 6) +
( 9) +
Maximum number of
overlapped operations is 3
A marked graph with optimal schedule
that requires at least 3 resources as shown
in the right figure.
( , 12) a +
( , 4) b +
( , ) c
( , 6) d +
Figure 5.5: Many-to-many resource binding
hence d can be assigned to r
3
. As time progress to time step l of the next iteration, c
cannot map to r
3
as before because r
3
is still occupied by d. The available resources are
r
1
and r
2
, released by a and b in the previous iteration. Thus c, which initially maps to r
3
,
must now map to r
1
or r
2
. The forced migration of operation c between different resources
in subsequent iterations can also be observed among the other operations.
To handle many-to-many resource binding, a controller must be able to properly direct
an operation that maps to different resources in different iterations, yielding greater steer-
ing logic and more complicated connectivity. However, for applications whose controllers
are small compared to the datapath, this additional control cost might be negligible.
5.5 Concurrent scheduling and binding
There are three general approaches for solving the scheduling and binding problem: bind-
ing before scheduling, scheduling before binding, and concurrent scheduling and binding.
118
If binding is done before scheduling, either the binding must be done very conservatively
or the target performance may be very difficult to guarantee. If binding is done after
scheduling, the scheduling cannot reflect the control costs. In fact, if the scheduling ap-
proach above is used, the control may have to implement a many-to-many binding in
which operations are bound to different resources in different iterations. For these reasons
this section explores the third approach with a concurrent scheduling and binding algo-
rithm which uses a many-to-one (surjective) resource binding where multiple operations
are mapped to only a single resource.
First, we define the time execution function of a schedule and then we formally define
the cycle interval of an operation as the portion of the target cycle in which the operation
is performed. We prove that if the cycle intervals of two operations do not overlap, the
operations can be bound to the same resource, i.e., they can be sequenced, while still
meeting the target cycle time.
Definition 5 Given a schedule S(G;τ), we define its time execution function as ϒ : T£
N!R where ϒ
k
t
denotes the k
th
2 N instance of t2 T . For any t2 T , the execution time
of the k
th
instance of t is defined as ϒ
k
t
=`
t
+ kτ;k¸ 0.
Since we can define l
t
= k
0
τ+ o
t
; k
0
¸ 0, then ϒ
k
t
= o
t
+(k+ k
0
)τ; k;k
0
¸ 0 where
o
t
denotes the offset time of an operation t. In other words, operation t is instantiated on
every cycle of τ, shifted by its offset time o
t
. From the equation l
t
= k
0
τ+ o
t
; k
0
¸ 0, we
conclude that o
t
= l
t
mod τ.
119
1
0
B
ϒ
0
BG
ϒ
2
0
B
ϒ
1
1
B
ϒ
1
BG
ϒ
3
0
B
ϒ
2
1
B
ϒ
1
2
B
ϒ
2
BG
ϒ
3
1
B
ϒ
2
2
B
ϒ
1
3
B
ϒ
3
BG
ϒ
0
BB
ϒ
1 2 3
1 1 1
3
3
0; 0; 2; 4; 6
0 3*2 6; 0;
0 3*2 6; 0;
BG B B B BB
BG BG BG
B B B
k o
k o
τ
τ
= = = = =
ϒ = + = + = =
ϒ = + = + = =
2 2 2
3 3 3
2
1
0
2 2*2 6; 0;
4 1*2 6; 0;
6 0*2 6; 0;
B B B
B B B
BB BB BB
k o
k o
k o
τ
τ
τ
ϒ = + = + = =
ϒ = + = + = =
ϒ = + = + = =
Figure 5.6: An example of time execution
Fig. 5.6 shows the time execution of a schedule of the three-stage linear pipeline with
τ= 2. Notice that the first instance of t is executed at its scheduled time l
t
and the next
instance will repeat in a period of τ. The k
th
execution time interval of t is defined as
a period starting from its execution time of the k
th
instance until the time it finishes the
operation, i.e. [ϒ
k
t
;ϒ
k
t
+ p
t
¡ 1] where p
t
is the processing time of t.
Definition 6 The cycle interval of t denoted by ω
t
=[o
t
;o
t
+ p
t
¡1]
c
is defined as follows.
If (o
t
+ p
t
¡ 1) mod τ¸ o
t
, then ω
t
=[o
t
;o
t
+ p
t
¡ 1]. Otherwise, ω
t
=[0;(o
t
+ p
t
¡
1) mod τ][[o
t
;τ¡ 1].
Theorem 2 Given a target cycle time of τ, if the cycle intervals of a pair of operations s
and t do not overlap, then the execution time intervals of s and t do not overlap.
Proof (By contrapositive) Assume that the execution time intervals of operations s and
t overlap. We would like to prove that the cycle intervals of s and t overlap. Let γ be the
overlapped time that both s and t are executed, i.e. γ2[ϒ
m
t
;ϒ
m
t
+ p
t
¡1]^γ2[ϒ
n
s
;ϒ
n
s
+ p
s
¡
120
1]. Since ϒ
k
t
= o
t
+(k+k
0
)τ,(γ mod τ)2[o
t
;o
t
+ p
t
¡1]
c
and(γ mod τ)2[o
s
;o
s
+ p
s
¡1]
c
.
Therefore ω
t
\ ω
s
6= Ø.
Definition 7 We define surjective resource binding as follows. Assuming that operations
s and t are mapped to the same operation, a pair of operations (s, t) can be bound to the
same resource if the cycle intervals of s and t do not overlap, i.e., ω
s
\ ω
t
= Ø.
5.5.1 Problem definition
Given a live and bound marked graph (G) with maximum cycle metric of c
m
and target
cycle time of τ, our goal is to schedule and bind the design minimizing area subject to a
target cycle time of τ.
5.5.2 Exact algorithm
We introduce several new variables as follows. b
ir
is a binary variable that maps an opera-
tion i to a resource r where b
ir
= 1 if i is assigned to a resource r. Otherwise, b
ir
= 0. xb
irl
is a binary variable where xb
irl
= 1 if i is scheduled at time step l and bound to a resource
r. Otherwise, xb
irl
= 0. Lastly, f i
r
and f o
r
indicate the number of fanins and fanouts of a
resource r respectively.
Objective function: Our objective function is to minimize the design area including the
cost of resources and control logics. The cost of control logics is represented by a linear
function of the number of fanins and fanouts as follows.
121
min(
∑
8k2R
a
k
¤ m
k
+
∑
8r2N
h( fi
r
)+
∑
8r2N
g( f o
r
)) (5.8)
Assignment constraint: See Eq. 5.3.
Dependency constraint: See Eq. 5.4.
Resource mapping constraint: An operation i2 T with a resource type k must map to
one of a resource type k in a set of resources N, i.e.8i2 T ,
∑
8r2N
k=rt(i)=rt(r)
b
ir
= 1 (5.9)
Note that rt(i) is an overloaded function that can map either an operation i or a resource r
to a resource type k.
Surjective binding constraint: For each time step l2 L and for all operations with the
same resource type k, only one operation which occupies time step l is allowed to map to
a resource r of type k, i.e.8l2 L;8r2 N,
∑
8i2T
k=rt(i)=rt(r)
∑
8
j2δ
i
j:l2J(i; j)
xb
i jr
· 1 (5.10)
122
Additional constraints: The following constraints relate x
il
;b
ir
and xb
ilr
in linear equa-
tion form, i.e.8
l2δ
i
;8r2 N,
8
i2T
rt(i)=rt(r)
(b
ir
+ x
il
¡ 1)· xb
ilr
(5.11)
Since multiple operations can map to a single resource, we introduce a new binary variable
m
r
where m
r
is set to 1 if there exists an operation i that maps to a resource r. Otherwise,
m
r
is set to 0, i.e.8r2 N,
∑
8i2T
rt(i)=rt(r)
b
ir
·jTj¤ m
r
(5.12)
The following constraint counts the number of resources with the same resource type and
set this value to be no greater than m
k
, which indicates the number of available resources
of type k, i.e.8k2 R,
∑
8r2N
rt(r)=k
m
r
· m
k
(5.13)
Resource connectivity constraint: This constraint indicates the connectivity of a pair of
resources. We introduce a binary variable c
r
1
r
2
where c
r
1
r
2
= 1 if r
1
is connected to r
2
.
Otherwise, c
r
1
r
2
= 0, i.e.8(i; j)2 E,8r
1
;r
2
2 N,
b
ir
1
+ b
jr
2
¡ 1· c
r
1
r
2
(5.14)
Fanin-fanout constraint: This constraint counts the number of fanins and fanouts for
each resource r, i.e.
123
8r2 N;
∑
8r
2
2N
c
rr
2
= fi
r
(5.15)
8r2 N;
∑
8r
1
2N
c
r
1
r
= f o
r
(5.16)
We demonstrate an example of the concurrent scheduling and binding problem in Fig.
5.5. Since any pair of operations is overlapped (i.e. ω
s
\ω
t
6= Ø;8s;t2 T ), it is impossible
to schedule and surjectively bind all operations using only 3 buffers. In other words, by
using only 3 buffers, constraints in Eq. 5.10 are violated. The example shows that if c is
assigned to time step l, b to l+ 4, and d to l+ 6 where each operation maps to different
resources since their cycle intervals overlap each other, this restricts a to schedule between
time step 8 to 12. However, for any of those time steps, the cycle interval of a overlaps
with 3 resources, implying that 4 resources are required instead of 3.
5.5.3 Heuristic algorithm
This section presents a computationally-efficient list-based algorithm for the concurrent
scheduling and binding problem. The binding problem is solved using a heuristic coloring
approach, where each distinct resource is mapped to a color.
124
Algorithm 5.5.1: TRYTOSCHEDULE(G;ξ;τ)
t
0
à f indRe f erenceTran(G;τ)
T
e
à getEnabledTrans(G);T
s
à Ø
calTimeFrame(S;T
e
;t
0
;τ)
là 0
while(T¡ T
s
6= Ø)
do
8
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
<
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
:
for each r2 R
do
8
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
<
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
:
T
er
Ãft2 T
e
jrt(t)= rg
fC
overlapped
;C
available
gà classi f yColors(S;r;l)
fT
negSlack
;T
zeroSlack
;T
posSlack
gà classi f yTrans(T
er
;l)
if(T
negSlack
6= Ø)
then
8
>
>
<
>
>
:
l=(8
t2T
negSlack
min(l
t
+ f
t
))
Repeat inner do loop
else if(jC
overlapped
j+jT
zeroSlack
j> ξr)
then
8
>
>
<
>
>
:
κà updateFailedResource(r)
continue
addToSchedule(S;T
zeroSlack
;T
posSlack
;l;C
available
)
T
s
à updateScheduledTrans(T
zeroSlack
;T
posSlack
)
if(κ6= Ø)
then return(0;κ)
T
e
à U pdateNewEnableTrans(G;T
s
)
calTimeFrame(S;T
e
;t
0
;τ)
l++
return(S;0)
125
The top-level algorithm is the same as the top-level algorithm for scheduling and allo-
cation algorithm shown in Algorithm 5.3.1 and thus is not repeated. tryToScheduleBind
function shown in Algorithm 5.5.1 is similar to the Algorithm 5.3.2 except that we intro-
duce a function classi f yColor for handling the proposed coloring approach and modify a
function addToSchedule to manage the color assignment.
Additionally, to satisfy surjective binding, resource constraints must be checked to
guarantee that there are enough available resources for all new zero slack operations
T
zeroSlack
and operations that have already been scheduled C
overlapped
, represented by the
number of overlapped operations using the coloring approach shown in Algorithm 5.5.2.
Algorithm 5.5.2: CLASSIFYCOLORS(S;r;l)
for each t2 S
do
8
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
<
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
:
if(rt(t)!= r)
then go to next iteration
colorÃfcj(t;c)2 ColorMapg
ω
t
Ã[l
t
;l
t
+ p
r
]
inteval
l
Ã[l;l+ p
r
]
if((ω
t
\ interval
l
)6= Ø)
then
½
C
overlapped
à C
overlapped
[ color
C
used
à C
used
[ color
C
available
à C
used
¡C
overlapped
126
The function Classi f yColor shown in Algorithm 5.5.2 returns a set of overlapped col-
ors and a set of available colors for a given cycle interval, indicating which resources are
available for binding to the current enabled operations. In particular, the set of overlapped
colors correspond to the colors of already scheduled operations whose cycle interval over-
laps with the current cycle interval. The set of available colors C
available
is defined as the
current set of used colors minus the overlapped colors.
Algorithm 5.5.3: ADDTOSCHEDULE(S;T
zeroSlack
;T
posSlack
;l;C
available
)
for each t2 T
zeroSlack
do
8
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
<
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
:
SÃ S[(t;l)
if(C
available
= Ø)
then
½
colorà getNewColor()
else
8
>
>
<
>
>
:
colorÃfcjc2 C
available
g
ColorMapà ColorMap¡(t;c)
ColorMapà ColorMap[(t;color)
scheduleExtraTrans(T
posSlack
)
The function addToSchedule depicted in Algorithm 5.5.3 describes the procedure to
insert an operation t to the schedule S. This function adds a pair (t;l) into a schedule
and performs a color assignment for binding proposes. C
available
is a set indicating colors
127
2
2
2
2 4
4
4
4
( , 8) a + ( , 4) b +
( , ) c ( , 2) d +
( )
( 2) +
( 4) +
( 8) +
A marked graph with optimal schedule
that requires at least 3 resources as shown
in the right figure.
( , 8) a +
( , 4) b +
( , ) c
( , 2) d +
Figure 5.7: An example of the list-based concurrent scheduling and binding
available for binding. If C
available
is empty, a new color must be created for assignment to t.
A non-empty C
available
consists of colors that were previously assigned but are now free for
this particular interval. In this case, t can be assigned to any color c in C
available
, whereby
c must be removed from C
available
because each color can map to only one operation at
any time step. Additionally, the cost of steering logic and routing can be included in this
process to determine the lowest cost possible for any color in C
available
. Note that the
complexity of this algorithm is the same as the complexity of the heuristic scheduling and
allocation algorithm shown in the previous section.
Fig. 5.7 depicts an example of schedule S(G;10) using list-based algorithm. The
detailed operations are the following. The estimated number of resources isd(6¤4)=10e=
3. Next, the reference operation is computed. For this example, operation c produces the
minimum scheduled time and is selected as the reference operation. Notice that c is one
of the originally enabled operations. Second, the maximum cycle metric is computed
(c
m
= 8). Since τ¸ c
m
, a valid schedule exists. c is the first to be scheduled at time step
128
0. The next enabled operations are b and d with ζ
b
=[4;4] and ζ
d
=[2;6]. Thus, d is
next to be scheduled at time step 2 and the number of resources used at time step 2 is
2(· 3). After this, b is the only enabled operation with ζ
b
=[4;4]. b is then scheduled at
time step 4 and the number of resources used at time step 4 is 3(· 3). Next a and c are
enabled with ζ
a
=[8;8]. c is ignored because it has been previously scheduled. However,
the number of overlapped operations are 3 because the interval of a is overlapped with
all other operations shown in the figure. Since the number of required resources is not
enough, the program must increase the number of resources by 1, from 3 to 4. Thus
the final schedule in the figure requires 4 resources instead of 3, in order to preserve the
surjective binding condition.
5.6 Experimental results and comparisons
In this section, we discuss and compare experimental results of four proposed algorithms:
exact scheduling and allocation (ESA), exact scheduling and binding (ESB), list schedul-
ing and allocation (LSA), and list scheduling and binding (LSB). These experiments were
done on a Sun Ultra-60 workstation with 1 gigabyte of memory. Our testbench consists
of two simple applications as well as ten DSP applications, which are commonly used as
a benchmark suite in other high-level synthesis works. For ease of comparison, all exper-
iments use the total number of resources as the objective function. In particular, the LSB
and ESB experiments ignore the cost of steering logic and routing even though they can
be easily included.
129
Input Spec. Exact scheduling and alloc. Heuristic scheduling and alloc.
Test # of # of τ #of #of time #of #of time
Apps Trans Places (c
m
) ALUs MULs (sec) ALUs MULs (sec)
(ORG) (ORG) (ORG) (ORG)
10-stage 10 20 8 (c
m
) 10(10) - 0.5 10(10) - 2
ring I 10 6 - 0.8 8 - 3
10-stage 10 20 8 5 - 0.4 5 - 1.7
ring II 10 3 - 0.8 3 - 1
10-stage 10 22 12 (c
m
) 10(10) - 0.5 10(10) - 8.5
pipeline I 24 5 - 17.5 6 - 15.7
10-stage 10 22 12 5 - 33.6 5 - 3.8
pipeline II 24 3 - 13 3 - 3.6
Diffeq 11 31 9(c
m
) 2(4) 3(6) 1 3(4) 3(6) 2.9
12 1 2 1.8 2 2 3.7
18 1 2 2 1 2 2
Iir 17 46 7(c
m
) 3(6) 4(6) 7.1 3(6) 6(6) 10.7
10 2 3 71.5 3 4 10.2
14 2 2 6.5 2 6 17.2
Fir 22 56 14(c
m
) 2(6) 2(6) 24.8 3(6) 5(6) 53.9
21 1 2 365.3 2 2 53.4
28 1 1 1646 2 1 49.6
Lattice 22 63 12(c
m
) 3(9) 3(9) 38.2 3(9) 5(9) 15.3
18 2 2 10.9 2 4 14.8
24 2 2 9.3 2 2 6
V olterra 33 98 15(c
m
) 4(10) 5(17) 20hr 5(10) 8(17) 50.7
22 2 4 1:50hr 2 4 11.9
30 - - >24hr 2 3 23.5
Ellip 41 130 28(c
m
) 4(21) 2(8) 11.3 6(21) 4(8) 100.5
32 - - >24hr 3 2 48.5
36 - - >24hr 3 2 37
Wdf7 43 112 10(c
m
) -(16) -(19) >24hr 7(16) 11(19) 81.6
15 - - >24hr 7 9 89.8
20 - - >24hr 4 5 57
Dct 50 120 4(c
m
) 20(25) 16(16) 10 21(25) 16(16) 62.6
6 13 11 11 19 14 129.7
8 - - >24hr 14 10 90.4
Wavelet 57 167 22(c
m
) -(26) -(14) >24hr 8(26) 10(14) 241.5
33 - - >24hr 3 2 56.8
44 - - >24hr 2 2 34.9
Nc 60 175 17(c
m
) -(24) -(25) >24hr 11(24) 16(25) 334.7
25 - - >24hr 6 9 234.8
34 - - >24hr 3 5 95.9
Table 5.1: Experimental results of the exact and heuristic scheduling and allocation algo-
rithms.
130
First, we compare the results of the exact and heuristic scheduling and allocation al-
gorithms shown in Table 5.1. The leftmost column shows all applications that we tested,
ordered from small (10 transitions and 20 places) to large (60 transitions and 175 places)
applications. The next three columns indicate the input specifications including the num-
ber of input places, the number of input transitions, the target cycle time (τ), and the
maximum cycle metric (c
m
). Columns 3-5 and 6-8 shows the results from the ESA and
LSA algorithms respectively. For example, the differential equation solver (Diffeq) with
11-transitions, 31-places and c
m
= 9 unit delay is simulated with target cycle times of 9,
12 and 18 units of delay. With target cycle time of 9 units of delay, the schedule from the
ESA algorithm indicates that it requires 2 ALUs and 3 multipliers with an algorithmic run-
time of 1 second. The amount of resources needed if no resource sharing is allowed (one
resource per operation) is shown by the number in the parenthesis. For this example, 4
ALUs and 6 multipliers are required if no resource sharing is allowed, implying we saved
2 ALUs and 3 multipliers for the ESA algorithm. The percentage of area-savings is then
computed by the difference between the number of resources required without sharing
and the number of resources required after sharing, divided by the number of resources
required without sharing. For the Diffeq example, the ESA yields 50% in area-savings for
both ALUs and multipliers.
The first two applications, the 10-stage ring and the 10-stage pipeline, show one of
the key advantages of using marked graphs to model designs with multi-threading and
pipelined behaviors. In particular, these examples compare the results between when
131
operations are mapped to pre-charged resources
2
and when they are mapped to pipelined
resources which supports more concurrency. For example, a 10-stage ring consists of 5
threads of execution operating in parallel. The 10-stage ring I test shows the results when
using pre-charged resources while the 10-stage ring II test shows the results when using
pipelined resources. For a given target cycle time and forward latency, the 10-stage ring II
test requires less resources than 10-stage ring I test because pipelined resources are able
to accept new inputs faster than pre-charged resources, enabling more resource sharing.
We explore the quality of our novel heuristic algorithms by applying to ten single-
threaded DSP applications. Each operation maps to a pre-charged resource and has the
following timing properties: ALU resources have 2 units of forward latency, multiplier
resources have 3 units of forward latency, and both have backward latency of 1 unit. The
results suggest that area-savings can be achieved by extending the user cycle constraint.
For some small applications, the exact algorithm may run faster than the heuristic, but for
most complex applications the exact algorithm spends much longer computing, sometimes
even more than 24 hours. In terms of result quality in Table 5.1, schedules from the LSA
algorithm yield slightly less area-savings than the results from the ESA algorithm: more
precisely the area-savings of ALUs for the ESA and LSA algorithms are 66% and 57%
respectively and the area-savings of multipliers for the ESA and LSA algorithms are 60%
and 46% respectively, thus the ESA algorithm can save 9% more ALUs and 14% more
multipliers on average compared to the area-savings from the LSA algorithm. Notice
2
After processing one token, a pre-charged resource needs time to reset before being able to process a
second token.
132
Input Spec. Exact scheduling and alloc. Exact scheduling and binding
Test # of # of τ #of #of time #of #of time
App. Trans Places (c
m
) ALUs MULs (secs) ALUs MULs (secs)
(ORG) (ORG) (ORG) (ORG)
Diffeq 11 31 9(c
m
) 2(4) 3(6) 1 2(4) 3(6) 307.6
12 1 2 1.8 1 2 23
18 1 2 2 1 2 15
Iir 17 46 7(c
m
) 3(6) 4(6) 7.1 -(6) -(6) >24hr
10 2 3 71.5 - - >24hr
14 2 2 6.5 - - >24hr
Fir 22 56 14(c
m
) 2(6) 2(6) 24.8 2(6) 2(6) 1004
21 1 2 365.3 - - >24hr
28 1 1 1646 1 1 15:32hr
Lattice 22 63 12(c
m
) 3(9) 3(9) 38.2 -(9) -(9) >24hr
18 2 2 10.9 - - >24hr
24 2 2 9.3 2 2 322
V olterra 33 98 15(c
m
) 4(10) 5(17) 20hr -(10) -(17) >24hr
22 2 4 1:50hr - - >24hr
30 - - >24hr - - >24hr
Ellip 41 130 28(c
m
) 4(21) 2(8) 11.3 -(21) -(8) >24hr
32 - - >24hr - - >24hr
36 - - >24hr - - >24hr
Wdf7 43 112 10(c
m
) -(16) -(19) >24hr -(16) -(19) >24hr
15 - - >24hr - - >24hr
20 - - >24hr - - >24hr
Dct 50 120 4(c
m
) 20(25) 16(16) 10 -(25) -(16) >24hr
6 13 11 11 - - >24hr
8 - - >24hr - - >24hr
Wavelet 57 167 22(c
m
) -(26) -(14) >24hr -(26) -(14) >24hr
33 - - >24hr - - >24hr
44 - - >24hr - - >24hr
Nc 60 175 17(c
m
) -(24) -(25) >24hr -(24) -(25) >24hr
25 - - >24hr - - >24hr
34 - - >24hr - - >24hr
Table 5.2: Experimental results of the exact scheduling and allocation and exact schedul-
ing and binding algorithms.
133
that the applications with lower maximum cycle metric allow less area-savings since they
consists of smaller cycles which offer less resource sharing opportunity. For example, the
DCT with τ= 4 does not allow any area-saving for a multiplier resource while the Diffeq
with τ= 9 allows a 50% area-saving for a multiplier resource.
Table 5.2 compares the results from two exact algorithms: the ESA and ESB algo-
rithms. For most applications, the ESB algorithm cannot produce results within 24 hours
due to very high computational complexity. Since the ESA algorithm creates a schedule
that has better area-savings than the ESB algorithm, we will use the result of the ESA
algorithm as the baseline for area-saving comparisons to our heuristic algorithms.
Table 5.3 compares the results from the exact scheduling and allocation algorithm and
the results from two heuristic algorithms: the list scheduling and allocation and the list
scheduling and binding algorithms. We perform the experimentation applied to same ten
DSP applications as shown in Table 5.1 with the following results. On average, when
compared to the ESA algorithm, the schedules from the LSA yield 10% less area-savings
of ALUs and 16% less area-savings of multipliers while the LSB schedules yield 13% less
area-savings of ALUs and 15% less area-savings of multipliers, indicating only a modest
loss of optimality. Additionally, we observe that the difference between runtimes and
average area-savings of two heuristic algorithms are negligible implying that the many-
to-many binding problem has a small influence to the quality of results for most of our
test cases. However, when compared to the LSA algorithm, the LSB algorithm has more
134
ESA LSA LSB
Test #of #of time #of #of time #of #of time
App. τ ALUs MULs (secs) ALUs MULs (secs) ALUs MULs (secs)
(ORG) (ORG) (ORG) (ORG) (ORG) (ORG)
Diffeq 9 2(4) 3(6) 1 3(4) 3(6) 2.9 3(4) 3(6) 2.6
12 1 2 1.8 2 2 3.7 2 2 3.5
18 1 2 2 1 2 2 1 2 1.8
Iir 7 3(6) 4(6) 7.1 3(6) 6(6) 10.7 3(6) 7(6) 13.3
10 2 3 71.5 3 4 10.2 3 4 10
14 2 2 6.5 2 6 17.2 2 3 7.5
Fir 14 2(6) 2(6) 24.8 3(6) 5(6) 53.9 3(6) 5(6) 24.3
21 1 2 365.3 2 2 53.4 2 2 9.6
28 1 1 1646 2 1 49.6 2 1 11.3
Lattice 12 3(9) 3(9) 38.2 3(9) 5(9) 15.3 3(9) 5(9) 13.7
18 2 2 10.9 2 4 14.8 2 4 15.8
24 2 2 9.3 2 2 6 2 2 5.7
V olterra 15 4(10) 5(17) 20hr 5(10) 8(17) 50.7 5(10) 8(17) 50.3
22 2 4 1:50hr 2 4 11.9 2 4 11.2
30 - - >24hr 2 3 23.5 2 3 22.4
Ellip 28 4(21) 2(8) 11.3 6(21) 4(8) 100.5 6(21) 4(8) 96.8
32 - - >24hr 3 2 48.5 3 2 47.4
36 - - >24hr 3 2 37 3 2 35
Wdf7 10 -(16) -(19) >24hr 7(16) 11(19) 81.6 7(16) 11(19) 83
15 - - >24hr 7 9 89.8 7 9 82.7
20 - - >24hr 4 5 57 5 5 68.4
Dct 4 20(25) 16(16) 10 21(25) 16(16) 62.6 26(25) 16(16) 123
6 13 11 11 19 14 129.7 19 16 150.5
8 - - >24hr 14 10 90.4 13 9 70.3
Wavelet 22 -(26) -(14) >24hr 8(26) 10(14) 241.5 8(26) 10(14) 223.2
33 - - >24hr 3 2 56.8 3 2 55
44 - - >24hr 2 2 34.9 2 2 35.3
Nc 17 -(24) -(25) >24hr 11(24) 16(25) 334.7 11(24) 16(25) 370
25 - - >24hr 6 9 234.8 6 9 233.3
34 - - >24hr 3 5 95.9 3 5 95.9
Table 5.3: Experimental results of the exact scheduling and allocation (ESA) to and two
heuristic algorithms (LSA and LSB).
135
advantages of performing scheduling and binding simultaneously and allowing to include
the cost of control complexity.
Lastly, we show the application of our algorithms to examples with multiple threads.
In particular, we modified the MG specifications of the nine DSP algorithms to enable
them to run with 2 and 3 threads of execution, as shown in Table 5.4. The results indicate
that this approach is very useful for applications where their algorithmic loops are the
bottleneck of the design (i.e. the maximum cycle metric is much greater than the local
cycle of the resource). By allowing multiple threads to run simultaneously, resources are
utilized more efficiently by allowing them to be reused by other problem instances. For
example, Diffeq results from the ESA algorithm require 2 ALUs and 3 multipliers for a
target cycle of 14, but when doubling the performance to a target cycle of 7 it requires
much less than double the hardware. In particular, it requires only 3 ALUs (instead of
4) and 5 (instead of 6) multipliers, saving 25% ALUs and 17% multipliers. In summary,
when compared to the results when applied a single-thread system, the ESA algorithm
applied to a multi-threaded system saves 30% ALUs and 9% multipliers on average, and
the LSB algorithm saves 32% ALUS and 41% multipliers on average.
5.7 Conclusion and future works
This chapter presents an approach for performance-driven high-level synthesis of highly
concurrent asynchronous systems. Marked graphs are used as the input model instead
of the well-known CDFG because it can express highly concurrent hardware systems
136
ESA LSB
Test #of #of time #of #of time
App. (τ ALUs MULs (secs) ALUs MULs (secs)
(# tokens) (ORG) (ORG) (ORG) (ORG)
Diffeq 5(2 toks) 3 5 0.8 4 6 3.5
Iir 4(2 toks) 5 7 6 6 7 5.4
Fir 7(2 toks) 3 4 4.5 3 6 8.4
5(3 toks) 4 5 3.7 6 6 13.4
Lattice 6(2 toks) 5 6 2.6 5 9 16.3
4(3 toks) 7 9 2.5 9 9 14.9
V olterra 8(2 toks) 4 9 8 6 10 27.7
5(3 toks) 6 14 32 10 17 58.9
Ellip 14(2 toks) 7 4 9.9 8 6 105
10(3 toks) 7 4 411.7 10 6 98.4
Wdf7 5(2 toks) 10 16 1.34hr 16 19 109
Wavelet 11(2 toks) 8 6 17hr 10 10 147
8(3 toks) - - >24hr 14 7 107
Nc 8(2 toks) - - >24hr 13 17 210
6(3 toks) - - >24hr 19 25 336
Table 5.4: Experimental results of the ESA and LSB algorithms applied to several multi-
threaded DSP applications.
with multiple threads of execution. We first propose both exact and heuristic algorithms
scheduling and allocation problem which do not address binding problem. Then, we
propose the exact and heuristic scheduling and binding algorithms which can include the
cost of control complexity associated with the binding. The algorithms tradeoff area and
performance by varying the degree of concurrency. The experimental results suggest that
the heuristic algorithms have far lower runtimes compared to the exact algorithms with
only modest loss of area-savings. Additionally, adding multiple threads to a system can
achieve a performance gain higher than the factor of area increase.
137
Our proposed synthesis works are the starting point for the development of a complete
CAD tool framework for asynchronous systems. Future works may include the follow-
ing. First, we look forward to including a form of pipeline optimization known as slack
matching [13] into this HLS framework. Second, we have not addressed the controller
generation in this work. The synthesis and optimization of the associated control circuitry
should be all an important aspect of future work. Third, our exact algorithms may be
improved by adding additional constraints to speed up the MILP. Moreover, alternative
heuristic algorithms may be considered for improving the quality of results and runtimes,
such as force-directed scheduling and path-based scheduling. Fourth, our work, which is
restricted to the use of marked graphs as the input model, cannot express systems with
choices. Future works may include approaches that use Petri net that can describe sys-
tems with choices. Additionally, we can implement the stochastic delay model for each
operation, which might be more accurate than the current fixed delay model.
138
Bibliography
[1] V . Akella and G. Gopalakrishnan. SHILPA: A high-level synthesis system for
self-timed circuits. In Proc. International Conf. Computer-Aided Design (ICCAD),
pages 587–591. IEEE Computer Society Press, November 1992.
[2] V .H. Allan, R.B. Jones, R.M. Lee, and S.J. Allan. Software pipelining. ACM
Computing Surveys, 27(3):367–432, 1995.
[3] P. Babighian, L. Benini, and E. Macii. A scalable algorithm for RTL insertion of
gated clocks based on ODCs computation. IEEE Transactions on Computer-Aided
Design, 24(1):29–42, January 2005.
[4] B.M. Bachman, H. Zheng, and C.J. Myers. Architectural synthesis of timed asyn-
chronous systems. In Proc. International Conf. Computer Design (ICCD). IEEE
Computer Society Press, October 1999.
[5] R.M. Badia and J. Cortadella. High-level synthesis of asynchronous digital circuits:
Scheduling strategies. Technical report, Universitat Polit` ecnica de Catalunya,
1992.
[6] R.M. Badia and J. Cortadella. High-level synthesis of asynchronous systems:
Scheduling and process synchronization. In Proc. European Conference on De-
sign Automation (EDAC), pages 70–74. IEEE Computer Society Press, February
1993.
[7] A. Bardsley. Implementing Balsa Handshake Circuits. PhD thesis, Department of
Computer Science, University of Manchester, 2000.
[8] A. Bardsley and D.A. Edwards. Synthesising an asynchronous DMA controller
with Balsa. Journal of Systems Architecture, 46:1309–1319, 2000.
[9] V .A. Bartlett and E. Grass. A low-power asynchronous VLSI FIR filter. In Ad-
vanced Research in VLSI, pages 29–39, March 2001.
[10] C.H.K. Beekel. Handshake Circuits: an Asynchronous Architecture for VLSI Pro-
gramming, volume 5 of International Series on Parallel Computation. Cambridge
University Press, 1993.
139
[11] P.A. Beerel. CAD Tools for the Synthesis, Verification, and Testability of Robust
Asynchronous Circuits. PhD thesis, Stanford University, 1994.
[12] P.A. Beerel. Asynchronous circuits: an increasingly practical design solution. In
Quality Electronic Design, pages 367–372, March 2002.
[13] P.A. Beerel, M. Davies, A. Lines, and N. Kim. Slack matching asynchronous de-
signs. In Proc. International Symposium on Advanced Research in Asynchronous
Circuits and Systems, April 2006.
[14] C.H.K. Berkel and A. Bink. Single-track handshaking signaling with application
to micropipelines and handshake circuits. In Proc. International Symposium on
Advanced Research in Asynchronous Circuits and Systems, pages 122–133. IEEE
Computer Society Press, March 1996.
[15] C.H.K. Berkel, R. Burgess, J. Kessels, A. Peeters, M. Roncken, F. Schalij, and
R.V .D. Wiel. A single-rail re-implementation of a DCC error detector using a
generic standard-cell library. In Asynchronous Design Methodologies, pages 72–
79. IEEE Computer Society Press, May 1995.
[16] C.H.K. Berkel, F. Huberts, and A. Peeters. Stretching quasi delay insensitivity
by means of extended isochronic forks. In Asynchronous Design Methodologies,
pages 99–106. IEEE Computer Society Press, May 1995.
[17] C.H.K. Berkel, M.B. Josephs, and S.M. Nowick. Scanning the technology: Appli-
cations of asynchronous circuits. Proceedings of the IEEE, 87(2):223–233, Febru-
ary 1999.
[18] C.H.K. Berkel, J. Kessels, M. Roncken, R. Saeijs, and F. Schalij. The VLSI-
programming language Tangram and its translation into handshake circuits. In
Proc. European Conference on Design Automation (EDAC), pages 384–389, 1991.
[19] I. Blunno, J. Cortadella, A. Kondratyev, L. Lavagno, K. Lwin, and C. Sotiriou.
Handshake protocols for de-synchronization. In Proc. International Symposium on
Advanced Research in Asynchronous Circuits and Systems, pages 149–158, April
2004.
[20] S.M. Burns. Automated compilation of concurrent programs into self-timed cir-
cuits. Master’s thesis, California Institute of Technology, 1988.
[21] S.M. Burns. Performance Analysis and Optimization of Asynchronous Circuits.
PhD thesis, California Institute of Technology, 1991.
[22] A. Bystrov and A. Yakovlev. Ordered arbiters. Electronics Letters, 35(11):877–
879, 1999.
140
[23] R. Camposano. Path-based scheduling for synthesis. IEEE Transactions on
Computer-Aided Design, 10(1):85–93, January 1991.
[24] L.P. Carloni, K.L. McMillan, and A.L. Sangiovanni-Vincentelli. Theory of latency-
insensitive design. IEEE Transactions on Computer-Aided Design, 20(9):1059–
1076, Sep 2001.
[25] A.P. Chandrakasan and R.W. Brodersen. Low Power Digital CMOS Design.
Kluwer Academic Publishers, 1995.
[26] S. Chaudhuri, R.A. Walker, and J.E. Mitchell. Analyzing and exploiting the struc-
ture of the constraints in the ILP approach to the scheduling problem. IEEE Trans-
actions on VLSI Systems, 2(4):456 – 471, dec 1994.
[27] T. Chelcea, A. Bardsley, D. Edwards, and S.M. Nowick. A burst-mode oriented
back-end for the Balsa synthesis system. In Proc. Design, Automation and Test in
Europe (DATE), pages 330–337, March 2002.
[28] T. Chelcea and S.M. Nowick. Resynthesis and peephole transformations for the
optimization of large-scale asynchronous systems. In Proc. ACM/IEEE Design
Automation Conference, June 2002.
[29] T. Chelcea and S.M. Nowick. Balsa-cube: an optimising back-end for the Balsa
synthesis system. In 14th UK Async. Forum, 2003.
[30] T.-A. Chu. Synthesis of Self-Timed VLSI Circuits from Graph-Theoretic Specifica-
tions. PhD thesis, MIT Laboratory for Computer Science, June 1987.
[31] J. Cortadella and R.M. Badia. An asynchronous architecture model for behavioral
synthesis. In Proc. European Conference on Design Automation (EDAC), pages
307–311. IEEE Computer Society Press, 1992.
[32] J. Cortadella, M. Kishinevsky, A. Kondratyev, L. Lavagno, and A. Yakovlev.
Petrify: a tool for manipulating concurrent specifications and synthesis of asyn-
chronous controllers. IEICE Transactions on Information and Systems, E80-
D(3):315–325, March 1997.
[33] J. Cortadella, A. Kondratyev, and L. Lavagno. Quasi-static scheduling for con-
curent architectures. In Proc. of the third international conference on application
of concurrency to system design, 2003.
[34] A. Dasdan. Experimental analysis of the fastest optimum cycle ratio and mean
algorithms. ACM Transactions on Design Automation of Electronic Systems,
9(4):385–418, October 2004.
141
[35] M.E. Dean, T.E. Williams, and D.L. Dill. Efficient self-timing with level-encoded
2-phase dual-rail (LEDR). In Carlo H. S´ equin, editor, Advanced Research in VLSI,
pages 55–70. MIT Press, 1991.
[36] D. Edwards and A. Bardsley. Balsa: An asynchronous hardware synthesis lan-
guage. The Computer Journal, 45(1):12–18, 2002.
[37] C. Farnsworth, D.A. Edwards, J. Liu, and S.S. Sikand. A hybrid asynchronous
system design environment. In Asynchronous Design Methodologies, pages 91–
98. IEEE Computer Society Press, May 1995.
[38] M. Ferretti and P.A. Beerel. Single-track asynchronous pipeline templates using
1-of-N encoding. In Proc. Design, Automation and Test in Europe (DATE), pages
1008–1015, March 2002.
[39] M. Ferretti, R.O. Ozdag, and P.A. Beerel. High performance asynchronous ASIC
back-end design flow using single-track full-buffer standard cells. In Proc. Inter-
national Symposium on Advanced Research in Asynchronous Circuits and Systems,
April 2004.
[40] R.M. Fuhrer, S.M. Nowick, M. Theobald, N.K. Jha, B. Lin, and L. Plana. Mini-
malist: An environment for the synthesis, verification and testability of burst-mode
asynchronous machines. Technical Report TR CUCS-020-99, Columbia Univer-
sity, NY , July 1999.
[41] S.B. Furber and P. Day. Four-phase micropipeline latch control circuits. IEEE
Transactions on VLSI Systems, 4(2):247–253, June 1996.
[42] S.B. Furber, J.D. Garside, P. Riocreux, S. Temple, P. Day, J. Liu, and N.C. Paver.
AMULET2e: An asynchronous embedded controller. Proceedings of the IEEE,
87(2):243–256, February 1999.
[43] H.V . Gageldonk, D. Baumann, C.H.K. Berkel, D. Gloor, A. Peeters, and
G. Stegmann. An asynchronous low-power 80c51 microcontroller. In Proc. Inter-
national Symposium on Advanced Research in Asynchronous Circuits and Systems,
pages 96–107, 1998.
[44] C.H. Gebotes. Throughput optimized architectural synthesis. IEEE Transactions
on VLSI Systems, 1(3):254–261, September 1993.
[45] C.H. Gebotes and M.I. Elmasry. Global optimization approach for architectural
synthesis. IEEE Transactions on Computer-Aided Design, 12(10):1266–1278,
September 1993.
142
[46] G. Goossens, J. Vandewalle, and H.D. Man. Loop optimization in register-transfer
scheduling for DSP-systems. In Proc. ACM/IEEE Design Automation Conference,
pages 826–830, 1989.
[47] R. Govindarajan, E.R. Altman, and G.R. Gao. A framework for resource-
constrained rate-optimal software pipelining. IEEE Transactions on Parallel and
Distributed Systems, 7(11):1133–1150, November 1996.
[48] G.D. Hachtel and F. Somenzi. Logic Synthesis and Verification Algorithms.
Kluwer Academic Publishers, 1996.
[49] D. Harris. Skew-Tolerant Circuit Design. Morgan Kaufmann Publishers, 2000.
[50] D. Harris and S. Naffziger. Statistical clock skew modeling with data delay varia-
tions. IEEE Transactions on VLSI Systems, 9(6):888–898, Dec 2001.
[51] C.A.R. Hoare. Communicating sequential processes. Communications of the
ACM, 21(8):666–677, August 1978.
[52] C.T. Hwang, Y .C. Hsu, and Y .L. Lin. Pls: A scheduler for pipeline synthesis. IEEE
Transactions on Computer-Aided Design, September 1993.
[53] C.T. Hwang, J.H. Lee, and Y .H. Hsu. A formal approach to the scheduling problem
in high level synthesis. IEEE Transactions on Computer-Aided Design, 10(4):464–
475, April 1991.
[54] K.S. Hwang, A.E. Casavant, and M.A. Abreu. Scheduling and hardware sharing in
pipelined datapath. In iccad, pages 24–27, November 1989.
[55] Y .-L. Jeang, Y .-C. Hsu, J.-F. Wang, and J.-Y . Lee. High throughput pipelined data
path synthesis by conserving the regularity of nested loops. In iccad, pages 450–
453, November 1993.
[56] H.S. Jun and S.Y . Hwang. Design of a pipelined datapath synthesis system for
digital signal processging. IEEE Transactions on VLSI Systems, 2(3):292–303,
Sep 1994.
[57] J. Kessels and P. Marston. Designing asynchronous standby circuits for a low-
power pager. In Proc. International Symposium on Advanced Research in Asyn-
chronous Circuits and Systems, pages 268–278. IEEE Computer Society Press,
April 1997.
[58] H. Kim, P.A. Beerel, and K.S. Stevens. Relative timing based verification of timed
circuits and systems. In Proc. International Symposium on Advanced Research in
Asynchronous Circuits and Systems, April 2002.
143
[59] K. Kim, P.A. Beerel, and Y . Hong. An asynchronous matrix-vector multiplier for
discrete cosine transform. In International Symposium on Low Power Electronics
and Design, pages 256–261, July 2000.
[60] S. Kim and P.A. Beerel. Pipeline optimization for asynchronous circuits: Com-
plexity analysis and an efficient optimal algorithm. In Proc. International Conf.
Computer-Aided Design (ICCAD), November 2000.
[61] T.K. Kim, N. Yonezawa, J.W.S Liu, and C.L Liu. A scheduling algorithm for con-
ditional resource sharing–a hierarchical reduction approach. IEEE Transactions on
Circuits and Systems, 13(4):425–438, April 1994.
[62] A. Kondratyev and K. Lwin. Design of asynchronous circuits by synchronous CAD
tools. In Proc. ACM/IEEE Design Automation Conference, June 2002.
[63] P. Kudva, G. Gopalakrishnan, and V . Akella. High level synthesis of asynchronous
circuit targeting state machine controllers. In Asia-Pacific Conference on Hardware
Description Languages (APCHDL), pages 605–610, 1995.
[64] P. Kudva, G. Gopalakrishnan, and H. Jacobson. A technique for synthesizing dis-
tributed burst-mode circuits. In Proc. ACM/IEEE Design Automation Conference,
1996.
[65] M. Ligthart, K. Fant, R. Smith, A. Taubin, and A. Kondratyev. Asynchronous de-
sign using commercial HDL synthesis tools. In Proc. International Symposium on
Advanced Research in Asynchronous Circuits and Systems, pages 114–125. IEEE
Computer Society Press, April 2000.
[66] D.H. Linder and J.C. Harden. Phased logic: Supporting the synchronous de-
sign paradigm with delay-insensitive circuitry. IEEE Transactions on Computers,
45(9):1031–1044, September 1996.
[67] A. Lines. Pipelined asynchronous circuits. Technical Report 1998.cs-tr-95-21,
California Institute of Technology, June, 1998.
[68] D.W. Lloyd and J.D. Garside. A practical comparision of asynchronous design
styles. In Proc. International Symposium on Advanced Research in Asynchronous
Circuits and Systems, pages 36–45. IEEE Computer Society Press, March 2001.
[69] J. Magott. Performance evaluation of concurrent systems using Petri Nets. Infor-
mation Processing Letters, 18(1):7–14, January 1984.
[70] R. Manohar. Width-adaptive data word architectures. In Advanced Research in
VLSI, pages 112–129, March 2001.
144
[71] R. Manohar, T.K. Lee, and A.J. Martin. Projection: A synthesis technique for
concurrent systems. In Proc. International Symposium on Advanced Research in
Asynchronous Circuits and Systems, pages 125–134, April 1999.
[72] A.J. Martin. Compiling communicating processes into delay-insensitive VLSI cir-
cuits. Distributed Computing, 1(4):226–234, 1986.
[73] A.J. Martin. The limitations to delay-insensitivity in asynchronous circuits. In
William J. Dally, editor, Advanced Research in VLSI, pages 263–278. MIT Press,
1990.
[74] A.J. Martin. Programming in VLSI: From communicating processes to delay-
insensitive circuits. In C. A. R. Hoare, editor, Developments in Concurrency and
Communication, UT Year of Programming Series, pages 1–64. Addison-Wesley,
1990.
[75] A.J. Martin. Towards an energy complexity of computation. Information Process-
ing Letters, 77:181–187, 2001.
[76] A.J. Martin, S.M. Burns, T.K. Lee, D. Borkovic, and P.J. Hazewindus. The
first asynchronous microprocessor: the test results. Computer Architecture News,
17(4):95–110, June 1989.
[77] M.C. McFarland, A.C. Parkers, and R. Camposano. The high-level synthesis of
digital systems. Proceedings of the IEEE, 78(2):301–318, February 1990.
[78] G.D. Micheli. Synthesis and Optimization of Digital Circuits. McGraw Hill, 1994.
[79] C. Myers. Asynchronous Circuit Design. John Wiley & Sons, 2001.
[80] T. Nanya, A. Takamura, M. Kuwako, M. Imai, T. Fujii, M. Ozawa, I. Fukasaku,
Y . Ueno, F. Okamoto, H. Fujimoto, O. Fujita, M. Yamashina, and M. Fukuma.
TITAC-2: A 32-bit scalable-delay-insensitive microprocessor. In Symposium
Record of HOT Chips IX, pages 19–32, August 1997.
[81] T. Nanya, A. Takamura, M. Kuwako, M. Imai, M. Ozawa, M. Ozcan, R. Morizawa,
and H. Nakamura. Scalable-delay-insensitive design: A high-performance ap-
proach to dependable asynchronous systems. In Proc. International Symp. on Fu-
ture of Intellectual Integrated Electronics, pages 531–540, Sendai, Japan, March
1999.
[82] L.S. Nielsen, C. Niessen, J. Sparsø, and C.H.K. Berkel. Low-power operation
using self-timed and adaptive scaling of the supply voltage. IEEE Transactions on
VLSI Systems, 2(4):391–397, December 1994.
145
[83] S.M. Nowick. Design of a low-latency asynchronous adder using speculative com-
pletion. IEE Proceedings, Computers and Digital Techniques, 143(5):301–307,
September 1996.
[84] S.M. Nowick, M.B. Josephs, and C.H.K. Berkel. Scanning the special issue on
asynchronous circuits and systems. Proceedings of the IEEE, 87(2):219–222,
February 1999.
[85] M. Nystr¨ om and A. Martin. Asynchronous Pulse Logic. Kluwer Academic Pub-
lishers, 2002.
[86] R.O. Ozdag and P.A. Beerel. High-speed QDI asynchronous pipelines. In Proc.
International Symposium on Advanced Research in Asynchronous Circuits and Sys-
tems, pages 13–22, April 2002.
[87] R.O. Ozdag, M. Singh, P.A. Beerel, and S.M. Nowick. High-speed non-linear
asynchronous pipelines. In Proc. Design, Automation and Test in Europe (DATE),
pages 1000–10007, March 2002.
[88] N. Park and A.C. Parkers. Sehwa: A software package for synthesis of pipelines
from behavioral specifications. IEEE Transactions on Computer-Aided Design,
7(3):356–370, March 1988.
[89] N.C. Paver, P. Day, C. Farnsworth, D.L. Jackson, W.A. Lien, and J. Liu. A low-
power, low-noise configurable self-timed DSP. In Proc. International Symposium
on Advanced Research in Asynchronous Circuits and Systems, pages 32–42, 1998.
[90] A. Peeters. Single-Rail Handshake Circuits. PhD thesis, Eindhoven University of
Technology, June 1996.
[91] M.A. Pena, J. Cortadella, E. Pastor, and A. Smirnov. A case study for the verifica-
tion of complex timed circuits: Ipcmos. In Proc. Design, Automation and Test in
Europe (DATE), March 2002.
[92] C.A. Petri. Fundamentals of a theory of asynchronous information flow. In Proc. of
IFIP, pages 386–390. Amsterdam: North Holland Publisher Company, April 1963.
[93] I. Radivojevic and F. Brewer. Analysis of conditional resource sharing using a
guard-based control representation. In Proc of ICCD, pages 434–439, October
1995.
[94] S.M. Sait and H. Youssef. VLSI Physical Deign Automation. IEEE Press, 1995.
[95] C.L. Seitz. System timing. In Carver A. Mead and Lynn A. Conway, editors,
Introduction to VLSI Systems, chapter 7. Addison-Wesley, 1980.
146
[96] M. Singh and S.M. Nowick. High-throughput asynchronous pipelines for fine-
grain dynamic datapaths. In Proc. International Symposium on Advanced Research
in Asynchronous Circuits and Systems, pages 198–209. IEEE Computer Society
Press, April 2000.
[97] A. Smirnov, A. Taubin, and M. Karpovsky. Automated pipelining in asic synthesis
methodology: Gate transfer level. In International Workshop on Logic and Synthe-
sis, June 2004.
[98] A. Smirnov, A. Taubin, M. Karpovsky, and L. Rozenblyum. Gate transfer level
synthesis as an automated approach to fine-grain pipelining. In Workshop on Token
Based Computing, June 2004.
[99] J. Sparsø and S. Furber, editors. Principles of Asynchronous Circuit Design: A
Systems Perspective. Kluwer Academic Publishers, 2001.
[100] D.L. Springer and D.E. Thomas. Exploiting the special structure of conflict and
compatibility graphs in high-level synthesis. IEEE Transactions on Computer-
Aided Design, 13(7):843–856, July 1994.
[101] K.S. Stevens, S. Rotem, R. Ginosar, P.A. Beerel, C.J. Myers, K.Y . Yun, R. Koi,
C. Dike, and M. Roncken. An asynchronous instruction length decoder. IEEE
Journal of Solid-State Circuits, 36(2):217–228, February 2001.
[102] L. Stok. Architectural Synthesis and Optimization of Digital Systems. PhD thesis,
Eindhoven University, The Netherlands, 1991.
[103] I. Sutherland and S. Fairbanks. GasP: A minimal FIFO control. In Proc. Interna-
tional Symposium on Advanced Research in Asynchronous Circuits and Systems,
pages 46–53. IEEE Computer Society Press, March 2001.
[104] I.E. Sutherland. Micropipelines. Communications of the ACM, 32(6):720–738,
June 1989.
[105] J. Teifel, D. Fang, D. Biermann, C. Kelly, and R. Manohar. Energy-efficient
pipelines. In Proc. International Symposium on Advanced Research in Asyn-
chronous Circuits and Systems, pages 23–33, April 2002.
[106] J. Teifel and R. Manohar. Static tokens: Using dataflow to automate concurrent
pipeline synthesis. In Proc. International Symposium on Advanced Research in
Asynchronous Circuits and Systems. IEEE Computer Society Press, April 2004.
[107] M. Theobald and S.M. Nowick. Transformations for the synthesis and optimiza-
tion of asynchronous distributed control. In Proc. ACM/IEEE Design Automation
Conference, June 2001.
147
[108] S. Tosun, O. Ozturk, E. Arvas, M. Kandemir, Y . Xie, and W-L. Hung. An ILP
formulation for reliability-oriented high-level synthesis. In ISQED, March 2005.
[109] S. Tugsinavisut and P.A. Beerel. Control circuit templates for asynchronous
bundled-data pipelines. In Proc. Design, Automation and Test in Europe (DATE),
page 1098, March 2002.
[110] S. Tugsinavisut, Y . Hong, D. Kim, K.Kim, and P.A. Beerel. Efficient asynchronous
bundled-data pipelines for DCT matrix-vector multiplication. IEEE Transactions
on VLSI Systems, 13(4):448–461, April 2005.
[111] T.E. Williams. Self-Timed Rings and their Application to Division. PhD thesis,
Stanford University, June 1991.
[112] C.G. Wong and A.J. Martin. Data-driven process decomposition for the synthesis
of asynchronous circuits. In IEEE International Conference on Electronics, Cir-
cuits and Systems, 2001.
[113] C.G. Wong and A.J. Martin. High-level synthesis of asynchronous systems by
data-driven decomposition. In Proc. ACM/IEEE Design Automation Conference,
pages 508–513, June 2003.
[114] A. Yakovlev, M. Kishinevsky, A. Kondratyev, L. Lavagno, and M. Pietkiewicz-
Koutny. On the models for asynchronous circuit behaviour with OR causality.
Formal Methods in System Design, 9(3):189–233, 1996.
[115] A. Yakovlev, A.M. Koelmans, and L. Lavagno. High-level modeling and design
of asynchronous interface logic. IEEE Design & Test of Computers, 12(1):32–40,
Spring 1995.
[116] A. Yamada, S. Nakamura, N. Ishiura, I. Shirakawa, and T. Kambe. Optimal
scheduling for conditional resource sharing. In iscas, May 1995.
[117] T. Yoneda, A. Matsumoto, M. Kato, and C. Myers. High level synthesis of timed
asynchronous circuits. In Proc. International Symposium on Advanced Research in
Asynchronous Circuits and Systems. IEEE Computer Society Press, March 2005.
[118] K.Y . Yun, P.A. Beerel, V . Vakilotojar, A.E. Dooply, and J. Arceo. The design and
verification of a high-performance low-control-overhead asynchronous differential
equation solver. IEEE Transactions on VLSI Systems, 6(4):643–655, December
1998.
[119] K.Y . Yun and D.L. Dill. Automatic synthesis of extended burst-mode circuits: Part
i (specification and hazard-free implementation). IEEE Transactions on Computer-
Aided Design, 18(2):101–117, February 1999.
148
[120] K.Y . Yun and D.L. Dill. Automatic synthesis of extended burst-mode circuits:
Part ii (automatic synthesis). IEEE Transactions on Computer-Aided Design,
18(2):118–132, February 1999.
[121] Z. Zhang, Y . Fan;, M. Potkonjak, and J. Cong;. Gradual relaxation techniques with
applications to behavioral synthesis. In Proc. International Conf. Computer-Aided
Design (ICCAD), pages 529–535, November 2003.
149
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Efficient media access and routing in wireless and delay tolerant networks
PDF
Internet security and quality-of-service provision via machine-learning theory
PDF
Probabilistic analysis of power dissipation in VLSI systems
PDF
Iterative data detection: complexity reduction and applications
PDF
An argumentation-based approach to negotiation in collaborative engineering design
PDF
MUNet: multicasting protocol in unidirectional ad-hoc networks
PDF
The importance of using domain knowledge in solving information distillation problems
PDF
Spatial distribution of neuroendocrine motoneuron pools in the hypothalmic paraventricular nucleus
PDF
High-Level Synthesis For Asynchronous System Design
PDF
Initiation mechanism of estrogen neuroprotection pathway
PDF
On the spin-up and spin-down of a contained fluid
PDF
The problem of safety of steel structures subjected to seismic loading
PDF
The project life cycle and budgeting functions: planning, control, and motivation
PDF
The effects of marketing communication on consumers' choice behavior: the case of pharmaceutical industry
PDF
Protocol evaluation in the context of dynamic topologies
PDF
Interactions of diazene homologues with Azotobacter vinelandii nitrogenase enzyme and model systems
PDF
Magnesium-based photocathodes for triggering back-lighted thyratrons
PDF
Colorectal cancer: genomic variations in insulin-like growth factor-1 and -2
PDF
Electrical conductivities of pure soap-water systems
PDF
Elevated inflammation in late life: predictors and outcomes
Asset Metadata
Creator
Tugsinavisut, Sunan
(author)
Core Title
The design and synthesis of concurrent asynchronous systems.
School
Graduate School
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Degree Conferral Date
2006-05
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
engineering, electronics and electrical,OAI-PMH Harvest
Language
English
Contributor
Digitized by ProQuest
(provenance)
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c17-111534
Unique identifier
UC11349407
Identifier
3237182.pdf (filename),usctheses-c17-111534 (legacy record id)
Legacy Identifier
3237182.pdf
Dmrecord
111534
Document Type
Dissertation
Rights
Tugsinavisut, Sunan
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the au...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus, Los Angeles, California 90089, USA
Tags
engineering, electronics and electrical