Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Clustering techniques for coarse -grained, antifuse-based FPGAs
(USC Thesis Other)
Clustering techniques for coarse -grained, antifuse-based FPGAs
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
CLUSTERING TECHNIQUES FOR COARSE-GRAINED, ANTIFUSE-BASED
FPGAS
by
Chang Woo Kang
A Dissertation Presented to the
FACULTY OF THE GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(ELECTRICAL ENGINEERING)
May 2006
Copyright 2006 Chang Woo Kang
UMI Number: 3237159
3237159
2007
UMI Microform
Copyright
All rights reserved. This microform edition is protected against
unauthorized copying under Title 17, United States Code.
ProQuest Information and Learning Company
300 North Zeeb Road
P.O. Box 1346
Ann Arbor, MI 48106-1346
by ProQuest Information and Learning Company.
ii
DEDICATION
To my parents and family, my fiancée Eunju Lee, and my best friend Sung-Hoon
Kang, thanks for their unconditional support and love.
iii
ACKNOWLEDGEMENTS
I would like to offer my humble acknowledgment to Professor Massoud Pedram,
who supervised and guided me through this achievement. From him, I received not
only knowledge on my research but also emotional support whenever I encountered
frustration. I also thank Professor Jeff Draper and Professor Roger Zimmermann
for being on my thesis committee.
I would like to extend my deep gratitude to Professor Jeff Draper, who supported
me at the Information Sciences Institute for three years. His support has been an
enormous encouragement during study at USC.
I would like to thank all of the SPORT group members who have given freely of
their time, hearts, and resources to support this research. A partial list includes:
Chanseok Hwang, Kihwan Choi, Ali Iranli, Yazdan Aghahiri, Peng Rong, Yu Hou,
Afshin Abdollahi, Wonbok Lee, Maryam Soltan, Morteza Maleki, Hanif Fatemi,
Soroush Abbaspour, Hwisung Jung, and Behnam Amelifard.
Finally, I would like to express my deep affection to Yunjung Choi, Ihn Kim,
Joongseok Moon, Kisup Chong, and Kihoon Jeong.
Chang Woo Kang
USC, May 2006
iv
TABLE OF CONTENTS
Dedication ................................................................................................................. ii
Acknowledgements.................................................................................................. iii
List of Tables .......................................................................................................... vii
List of Figures ........................................................................................................ viii
Abstract ......................................................................................................................x
CHAPTER 1 Introduction..........................................................................................1
1.1 Motivation..................................................................................................1
1.2 Dissertation Outline ...................................................................................6
CHAPTER 2 Coarse-grained FPGAs and Previous Work ........................................8
2.1 Overview of Coarse-grained FPGA Architecture......................................8
2.2 FPGA CAD Flow.....................................................................................13
2.2.1 Technology mapping...............................................................................13
2.2.2 Clustering Techniques.............................................................................14
2.2.3 Placement and Routing ...........................................................................23
2.3 Summary ..................................................................................................26
CHAPTER 3 Tool Flow and Cell Library Generation ............................................27
3.1 Introduction..............................................................................................27
3.2 Tool Flow.................................................................................................28
3.3 Cell Library Generation ...........................................................................29
3.4 Cost Assignment ......................................................................................34
3.5 Summary ..................................................................................................35
v
CHAPTER 4 Area-driven Clustering Algorithm with Considerations of
Interconnect Connectivity and Circuit Speed ..........................................................36
4.1 Introduction..............................................................................................36
4.2 Lower-bound Calculation ........................................................................38
4.2.1 Problem Statement and Dynamic Programming Approach ....................38
4.2.2 Set containment relations ........................................................................42
4.2.3 Minimum number of pASIC3 logic cells with given base gates ............43
4.2.4 Type distribution table ............................................................................44
4.2.5 Problem formulation and solution...........................................................46
4.3 Area-driven Clustering Technique...........................................................50
4.3.1 Interconnect-aware Clustering ................................................................51
4.3.2 Timing Slack-driven Clustering..............................................................56
4.4 Experiment Results ..................................................................................61
4.5 Summary ..................................................................................................64
CHAPTER 5 Timing-driven Clustering ..................................................................66
5.1 Introduction..............................................................................................66
5.2 Problem statement....................................................................................67
5.3 Multi-dimensional labeling algorithm .....................................................68
5.4 Signal Path Aware Slack-time relaxation ................................................71
5.5 Merging algorithm ...................................................................................73
5.6 Experiment Results ..................................................................................74
5.7 Summary ..................................................................................................77
CHAPTER 6 Low-power Clustering with Minimum Logic Replication ................78
6.1 Introduction..............................................................................................78
6.2 Design Flow and Problem Description ....................................................83
6.3 Low Power Clustering .............................................................................86
6.3.1 Cluster generation and power-delay curves ............................................86
6.3.2 Correct accounting of logic replication...................................................87
6.4 Cluster selection.......................................................................................94
6.5 Implementation and Experimental Results ..............................................96
6.6 Summary ..................................................................................................99
vi
CHAPTER 7 Conclusion and Future Work...........................................................100
7.1 Dissertation Summary............................................................................100
7.2 Future Work ...........................................................................................102
Bibliography...........................................................................................................104
vii
LIST OF TABLES
Table 3.1: Cell distribution after cell personalization from base gates....................31
Table 3.2: Cell distribution after identifying common primitive cells
among base gates..............................................................................................31
Table 3.3: Filtered primitive cells ............................................................................33
Table 4.1: The type distribution table for primitive cell to base-gate
mapping............................................................................................................45
Table 4.2: Results of lower-bound calculation ........................................................62
Table 4.3: Results of different clustering objectives with the minimum
area solution .....................................................................................................63
Table 5.1: Results of timing-driven clustering.........................................................75
Table 5.2: Results of slack-time relaxation..............................................................76
Table 6.1: Low-power clustering results: Area and delay .......................................97
Table 6.2: Low-power clustering results: Power and CPU time..............................98
viii
LIST OF FIGURES
Figure 1.1: Virtex II CLB Element. ...........................................................................3
Figure 1.2: Coarse-grained, antifuse-based FPGA: (a) pASIC3 logic cell,
(b) FPGA architecture, and (c) antifuse switch..................................................5
Figure 2.1: Coarse-grained, SRAM-based FPGA [1] ..............................................10
Figure 2.2: Coarse-grained, antifuse-based FPGA ..................................................11
Figure 2.3: FPGA CAD flow. ..................................................................................12
Figure 2.4: Input reduction by adding a BLE. .........................................................17
Figure 2.5: BLE criticality assignment. ...................................................................19
Figure 2.6: Clustering: (a) before packing node B into cluster C and (b) after
packing node B into cluster C. .........................................................................20
Figure 3.1: Proposed CAD tool flow for pASIC3 family FPGA.............................29
Figure 3.2: Functions in Packer-pASIC3.................................................................29
Figure 3.3: pASIC3 base gates derived from the configurable logic cell................32
Figure 3.4: Venn’s diagram for the set of logic cells that can be personalized
from the base gates...........................................................................................33
Figure 4.1: Interconnect switch architecture for two different FPGAs....................37
Figure 4.2: One dimensional coin change problem. ................................................41
Figure 4.3: Examples of local neighborhood connectivity factor computation.......50
ix
Figure 4.4: Clustering nodes. ...................................................................................55
Figure 4.5: Packing un-clustered nodes by using linear assignment:
(a) partially clustered network; (b) bipartite graph for linear assignment........56
Figure 4.6: Selecting the best node for clustering: (a) greedy selection and (b)
intelligent selection. .........................................................................................58
Figure 4.7: Selecting the best node for delay improvement.....................................60
Figure 5.1: Multi-dimensional labeling algorithm...................................................70
Figure 5.2: Clustering example................................................................................71
Figure 5.3: Slack-time relaxation with awareness of signal path.............................73
Figure 6.1: An example of redundant logic replication in clustering: (a)
clusters and the corresponding area-delay points, (b) non-inferior
clusters, (c) circuit after logic replication (i.e., n1, n2, and n3 are
duplicated), and (d) a desired clustering solution. ...........................................82
Figure 6.2: PD curve generation for a node with a cluster.......................................88
Figure 6.3: Example of logic replication prediction. ..............................................92
Figure 6.4: Prediction of logic replication. ..............................................................93
Figure 6.5: Logic replication cases: (a) child node is replicated, and (b)
root node is replicated. .....................................................................................94
Figure 6.6: Logic replication for cluster selection. .................................................95
x
ABSTRACT
Coarse-grained, antifuse-based FPGAs have emerged as a compelling technology
to minimize the performance gaps between FPGAs and ASICs in area, speed, and
power dissipation. As the FPGA architectures prefer large, programmable logic
blocks, efficient clustering algorithms are vital to make use of the benefits from
those advanced architectures.
Circuit clustering is an important technique for coarse-grained FPGAs. First,
clustering can reduce the complexity of large circuit designs by a significant factor.
Second, clustering can improve the quality of the results of other operations such as
placement and routing.
In this dissertation, clustering techniques for area, delay, and power dissipation are
proposed. First, an area-driven clustering algorithm is presented to minimize the
number of macro logic cells required to cover a network. This algorithm calculates
the minimum number of the logic cells by a multi-dimensional coin-change
problem or a linear programming formulation. Subsequently, with the minimum
number of available macro logic cells, actual clustering, which packs nodes into
clusters, is performed to improve routability and delay. Next, a timing-driven
clustering algorithm is presented to minimize the number of macro logic cells on
the longest input-output path. The algorithm optimally labels nodes for the smallest
xi
delay and then minimizes redundant logic replication by using slack-time
relaxation during the clustering phase. Finally, a low-power clustering algorithm is
presented to minimize power dissipation with the minimum logic replication. The
algorithm accurately emulates logic replication to estimate the cost incurred by
logic replication to meet timing constraints. Based on this information, the
proposed algorithm substantially reduces size of the replicated logic, resulting in
benefits in area, delay, and power dissipation.
1
CHAPTER 1
INTRODUCTION
1.1 MOTIVATION
Getting a product to market quickly is a pivotal success factor in today’s ever-
changing electronics market. Field programmable gate arrays (FPGAs) can create
unique advantages over application specific integrated circuits (ASICs) because of
their quick and cost-effective validation of products.
There are two basic classes of FPGA devices in the market today: SRAM-based
FPGAs and antifuse based FPGAs. SRAM based FPGAs utilize look-up tables
(LUTs). A LUT is a small one-bit wide memory array, where the address lines for
the memory are inputs of the logic cell, and a single-bit output from the memory is
the LUT output. Generally, a programmable logic cell contains two or more LUTs
connected in some manner. Each LUT can realize any logic function of K inputs by
writing the logic function’s truth table directly into the memory. Loading
configuration data into the internal memory cells customizes the FPGAs. In
2
contrast, antifuse-based FPGAs are based on a structure similar to traditional gate
arrays. An antifuse resides in a high-impedance state, and can be programmed into
a low impedance or "fused" state. The antifuse technology is less expensive than
the RAM technology, but this device is a program-once device. Here, a
programmable logic cell comprises of simple gates and multiplexers. The logic cell
is programmed by assigning the input signals to constant binary values or shorting
them together. Consequently, such a logic cell can also realize a wide range of
Boolean functions. Antifuse-based FPGAs are, therefore, smaller in size when
compared to SRAM-based FPGAs with the same number of equivalent gate
capacity. SRAM-based interconnect contains transistor switches, while the antifuse
based interconnect can be considered as a standard metal interconnect found in
ASIC chips.
There are three primary classes of FPGA architectures: Coarse-grained, medium-
grained and fine-grained. Coarse-grained architectures contain very large
programmable logic blocks with 20-30 inputs and 4-6 outputs. One can think of
such large blocks as standard programmable logic devices (PLDs). This
architecture provides a cross-networked interconnect. Therefore, this architecture
supports interconnect flexibility between such large blocks. Medium-grained
architectures consist of large logic blocks, often containing two or more look-up
tables and two or more flip-flops. In a majority of these architectures, a four-input
3
Switch
Matrix
Slice
X1Y1
Slice
X0Y0
Slice
X0Y1
Slice
X1Y0
Fast connects
to neighbors
LUT
LUT
slice configuration
Figure 1.1: Virtex II CLB Element.
look-up table (think of it as a 16x1 ROM) implements the actual logic. The larger
logic block usually corresponds to improved speed. The other architecture type is
called fine-grained. In these devices, there are a large number of relatively simple
logic blocks. Fine-grained architecture usually only allows close connections,
which greatly reduces interconnect flexibility. These devices are good at systolic
functions and have some benefits for designs created by logic synthesis.
As interconnect became a dominant component in deep-submicron design
technology, coarse-grain FPGA architecture has become ubiquitous in FPGA
industries because of the benefits of area, delay, and power. Figure 1.1 shows the
organization of Xilinx Virtex II Family [69]. It has configurable logic blocks
(CLBs), which are organized in an array and may be used to build combinational
4
and sequential logic designs. Each CLB element is comprised of four similar slices.
Each slice includes two four-input function generators. Each function generator is a
four-input LUT. Therefore, each CLB has eight LUTs. QuickLogic pASIC3 Family
as shown in Figure 1.2 is based on an antifuse technology [52]. Devices in the
family are based on an array of highly flexible logic cells, which have been
optimized to efficiently implement a wide range of logic functions at a high speed.
Each cell can implement one large function, four independent smaller functions, or
any combination in-between. The logic cell has a fanin of 29 and four five outputs
including the output of a flip-flop.
The structure and granularity of the logic block has a significant impact on the area-
efficiency of the FPGA [1]. If the logic block is fine-grained, the circuit to be
implemented will be distributed over a larger group of logic blocks. This has a
negative impact on routability, since more blocks need to be interconnected. If
several LUTs are clustered into one logic block, signal sharing among LUTs can be
exploited. Since interconnect inside the logic blocks are hardwired, local
interconnect can be made to be very fast and operate efficiently. This improves
routability and decreases the load on the router, significantly. On the other hand, it
is not feasible to increase the complexity of the logic blocks beyond a certain limit.
If the logic blocks become too complex it becomes difficult to fully utilize these
5
pASIC3
Logic cell
Interconnect switches
1
1
1
1
1
1
AZ
NZ
FZ
mux2
mux3
mux4
OZ
QZ
Clock
QS
QR
Metal
SiO2
Metal
SiO2
Via
Antifuse
SiO2
(b)
mux1
(c) (a)
Figure 1.2: Coarse-grained, antifuse-based FPGA: (a) pASIC3 logic cell,
(b) FPGA architecture, and (c) antifuse switch.
blocks, which may lead to waste of logic, hence chip. This may cause diminishing
returns such as large area and low speed.
In a typical flow of FPGA CAD tools, clustering, which follows the technology
mapping step, is an important optimization phase because it maps the target circuit
netlist into an FPGA array. The clustering, therefore, refers to the task of grouping
logic gates in the circuit netlist and assigning each group to a configurable logic
block in the FPGA array. Since poor clustering may result in significant impact on
6
the final design in terms of area, delay and power, clustering must be done carefully
before placement and routing.
In this dissertation, our research focuses on clustering techniques for coarse-
grained, antifuse-based FPGAs. The clustering problem for coarse-grained,
antifuse-based FPGAs is quite different from typical clustering problems that
we’ve known for SRAM-based FPGAs. Coarse-grained, antifuse-based FPGA
architecture demands highly intelligent CAD algorithms, because the architecture
provides tremendous flexibility with the least hardware overhead. The hardware
overhead for having a large logic cell, with many inputs and multiple outputs, is
very little; the size of an antifuse to connect two metal wires is smaller than a via
[52]. On the other hand, the amount of functions, which the logic cell can
implement, increases exponentially. In order to utilize the logic cell efficiently,
CAD tools should be equipped with highly efficient algorithms.
1.2 DISSERTATION OUTLINE
There are four parts to this research: library cell generation, area-driven clustering,
timing-driven clustering, and low-power clustering.
In CHAPTER 2, an overview of coarse-grained FPGAs is provided and brief
review of the flow of FPGA CAD tools is presented. In CHAPTER 3, we present
the procedure for generating library cells from the target pASIC3 family FPGA
7
architecture. The library cells are used during technology mapping, before
clustering, to cover a logically optimized network. For the area-driven clustering, in
CHAPTER 4, we present two approaches to calculate the minimum number of
pASIC3 logic cells to cover a network mapped by library cells. First, we use multi-
dimensional coin change dynamic programming as a general solution for the
general logic cell architectures. Secondly, by fully exploring the architectural
characteristics of the pASIC3 logic cell, we set up two linear equations and find the
optimal packing solution. Under the constraints of the minimum area solution, we
present an interconnect-aware clustering algorithm and a timing-driven clustering
algorithm. In CHAPTER 5, we present a timing-driven clustering algorithm, which
minimizes the number of pASIC3 logic cells on the longest input-output path.
Logic replication is minimized by slack-time relaxation. A low-power clustering
algorithm is presented in CHAPTER 6. In the low power clustering algorithm, we
minimize the logic replication while meeting the timing constraint under low power
optimization objective. The algorithm accurately simulates logic replication caused
by timing constraints during the post-order traversal. This technique reduces the
size of duplicated logic substantially, resulting in benefits in area, delay, and power
dissipation. We provide a summary of this dissertation and possible extensions in
CHAPTER 7.
8
CHAPTER 2
COARSE-GRAINED FPGAS AND PREVIOUS WORK
This chapter contains an overview of the typical architectures of coarse-grained
FPGAs and related works from the past in the area of FPGA CAD tools.
2.1 OVERVIEW OF COARSE-GRAINED FPGA
ARCHITECTURE
FPGAs usually consist of small, configurable basic elements connected by rich
programmable interconnects [4]. Since routing resources grow faster than on-chip
logic resources, routing resources account for the major portion of the device’s
overall area and delay. Speed and area-efficiency of an FPGA are directly related to
the granularity of its logic block [1][47][48]. While coarse-grained blocks have
long, internal logic delays, they can reduce the placement and routing stress by
having fast local routing and significantly reduce external routing.
Typically, synthesis tools prefer “gate array-like” fine-grained architectures;
however, fine-grained FPGA architectures generally yield a very poor delay, due to
9
the long delays resulting from building functions with multiple levels of gates and
slow interconnect elements. Coarse-grained architecture gives tools the needed
degrees of freedom for the high logic utilization benefits of a fine-grained
architecture without sacrificing the high performance benefits of coarse-grained,
high fan-in architecture.
Figure 2.1 shows a typical SRAM-based FPGA. A basic logic element (BLE)
consists of a lookup table and a flip-flop; and those basic elements comprise a
programmable logic block (PLB). Dedicated routing is provided inside each cluster
for communication between the local BLEs. For clusters of size greater than one,
the architecture is fully connected, where each BLE input can be connected to any
of the cluster inputs or to the output of any output of the BLEs within the cluster.
Examples of coarse-grained FPGAs are the Xilinx Virtex [69] and the Apex and
Flex from Altera [5]. In these architectures, groups of basic logic elements are
clustered to provide better performance.
Figure 2.2 illustrates the logic cell architecture of pASIC3 from QuickLogic [52].
The pASIC3 family FPGA is based on antifuse-based programming technology.
The pASIC3 device architecture consists of an array of user-configurable logic
building blocks, called logic cells, set beneath a grid of metal wiring channels
similar to those of a gate array. Through antifuses, located at the wire intersections,
10
the output(s) of any cell may be programmed to connect to the input(s) of any other
cell.
Figure 2.1: Coarse-grained, SRAM-based FPGA [1]
11
The pASIC3 logic cell, shown in Figure 2.2(a), is a general-purpose building block
that can implement most gate array macro library functions. Since the logic cell has
multiple outputs, it can implement one large function or multiple smaller
independent functions in parallel. The function of a logic cell is determined by the
(a) Logic cell (b) Some of possible configurations of logic cell
1
1
1
1
1
1
1
1
Figure 2.2: Coarse-grained, antifuse-based FPGA
12
logic levels applied to the inputs of the AND gates and multiplexers. The high logic
capacity and fan-in of the logic cell accommodate many user functions with a
single level of logic delay. Figure 2.2(b) shows some of the possible configurations
of the logic cell. Since all connections within the cell are hard-wired, the various
functions are available in parallel. Thus, very wide, complex functions are
implemented with the same cell speed, as the much smaller “fragment” functions.
Related and unrelated functions can be packed into the same logic cell, increasing
effective density and gate utilization.
Logic Optimization
Technology Mapping
Clustering Basic Logic Elements into Programmable Logic Blocks
Placement
Routing
Circuits
Results
Figure 2.3: FPGA CAD flow.
13
2.2 FPGA CAD FLOW
Figure 2.3 illustrates the CAD tool flow that is typically encountered in the design.
The logic can be optimized by any RT-level optimization tool, e.g., SIS [59].
During the technology-mapping phase, the logic optimized circuit is mapped to
basic logic elements, and closely connected basic logic elements are packed
together into programmable logic blocks. Finally, placement and routing are
conducted with those programmable logic blocks. In the following sections, we
briefly review technology mapping techniques, clustering techniques, placement,
and routing algorithms.
2.2.1 TECHNOLOGY MAPPING
In a standard cell design procedure for the application specific integrated circuits
(ASIC), technology mapping maps the optimized circuits with a target library.
However, FPGAs have clusters with basic logic elements; and those basic logic
elements are ready to be programmed to implement certain functions. Therefore,
the technology mapping locates a feasible portion of circuits and implements
functions of that portion into that basic logic element. For two different types of
FPGAs, various mapping techniques have been developed. An extensive survey of
existing SRAM based FPGA mapping techniques is given by Cong and Ding [19].
They also developed FlowMap that guarantees to produce depth-optimal mapping
14
solutions [22]. CutMap [23] is an improvement over FlowMap, which considers
area minimization during delay optimization. For antifuse based FPGAs, Boolean
matching techniques have been used for technology mapping and research results
on technology mapping for antifuse logic cells have been reported [30]. Boolean
matching is therefore a key enabler for antifuse based FPGA mapping. Lai et al. in
[42] proposed a Boolean matching algorithm and introduced matching filters for
speedup. A more comprehensive review is provided in [6].
2.2.2 CLUSTERING TECHNIQUES
Once the technology mapping is accomplished, then the mapped netlist is provided
to a clustering algorithm. The clustering algorithm packs multiple basic logic
elements into a logic cluster.
Many clustering techniques, for SRAM-based FPGAs, have been based on
constructive clustering techniques. In the following section, the major
achievements on clustering techniques are reviewed in chronological order.
2.2.2.1 Rapid System Prototyping (RASP)
The RASP system is a general synthesis and mapping system for SRAM-based
FPGA [20]. It consists of a core with a set of synthesis and optimization algorithms
for technology independent logic synthesis and technology mapping for a generic
look-up table (LUT) network generation. It has also a set of architecture-specific
15
technology mapping routines to map the generic LUT network to programmable
logic blocks in various SRAM-based FPGA architectures.
The clustering algorithm is based on a sequence of maximum weighted matching
operations on a compatibility graph, which yields the proper grouping of LUTs into
programmable logic blocks (PLBs). For each step, a compatibility graph is formed
in which vertices represent the partial PLBs (initially LUTs) that will be considered
for grouping at this step. An edge is formed between two vertices if the two
corresponding partial PLBs that can be grouped into one. Then, weights are
assigned to the edges, to guide the matching algorithm to select the best merging of
partial PLBs. Different weights are assigned for different optimization objectives.
For delay optimization, a larger weight is given to an edge corresponding to the
grouping of two LUTs that may reduce the length of a critical path in the PLB
network. For routability, more significant weight is given to an edge that
corresponds to the grouping of two “close” LUTs, so that it does not create
complex interconnection patterns in the final mapping solution. The “closeness” of
two LUTs can be measured in the overlap of their fanin subnetworks. The weight
for edge (v, w) is
(, )
vw
p
vw
NN
Wvw
NN
∩
=
+
(2.1)
16
where N
v
and N
w
are sets of edges on the vertex v and w, respectively. The
compatibility graph becomes a bipartite graph whose edges have weight. Finally,
the bipartite matching problem is solved to find the best packing solution.
2.2.2.2 VPACK
VPACK [7] is a clustering algorithm which functions to minimize both the number
of logic clusters and the number of used inputs to each cluster. Minimizing used
inputs for each cluster is important for a routable design. The algorithm constructs
each cluster sequentially. First, a seed BLE is chosen, which has the most used
inputs among currently un-clustered BLEs. The inputs are a scarce resource. Then,
VPACK greedily selects the BLE that shares the most inputs and outputs with the
cluster being constructed. The attraction between a BLE B, and the current cluster
C, are the number of common nets that are shared.
(, ) ( ) ( ) Attraction B C Nets B Nets C =∩ (2.2)
This procedure of greedily selecting a BLE, to include in the cluster, continue until
either the cluster is full or until adding any additional un-clustered BLEs would
cause the number of distinct inputs needed by the cluster to exceed the number of
inputs allowed. If the cluster is full, a new seed BLE is selected and a new cluster is
generated to contain the new BLEs. If, however, the cluster is not full but no BLE
can be added because of exceeding the number of allowed inputs, the hill-climbing
17
phase is invoked to give a chance for packing into the cluster. If a BLE has all
inputs in the cluster and its output is connected to a BLE in the cluster, adding the
BLE to the cluster will reduce the number of inputs by one. Figure 2.4 illustrates an
example of reducing inputs by adding a BLE. The hill-climbing phase terminates
when the cluster is full; if the cluster is still infeasible, VPACK backs up to the last
point when the cluster was feasible.
C
B
C
B
Figure 2.4: Input reduction by adding a BLE.
2.2.2.3 T-VPACK
T-VPACK [49] is based on VPACK algorithm [7]. Its’ optimization goal is
minimizing the number of external connections (connections between clusters) on
the critical path. Since the external cluster routing delay is much larger than the
local routing inside a cluster, minimizing the number of routing on critical paths
can improve delay significantly.
18
The algorithm consists of two steps: static timing analysis and clustering. In the
step of static timing analysis, slack is estimated. Slack [33] is defined as the amount
of delay, which can be added to a connection without increasing the delay of the
entire circuit. The slack at BLE input pin, i is defined as:
( ) () ()
required arrival
slack i T i T i = −
(2.3)
where T
reqire
(i) and T
arrival
(i) are the required time and arrival time at input pin, i.
The criticality of connection in input i is described as:
()
_()1
slack i
Connection Criticality i
MaxSlack
=− (2.4)
where MaxSlack is the largest slack among all connections.
During the clustering phase, selecting a seed BLE and attracting BLEs take place.
The seed BLE is an un-clustered but has the most critical connection in the circuit.
Attraction function was formed to include timing information. The criticality of a
BLE is defined as the maximum Connection_Criticality value of all connections
which connect the BLE to BLEs within the cluster currently being packed. An
example is shown in Figure 2.5. The attraction function is defined as follows:
() ( )
(, ) ( ) (1 )
Nets B Nets C
Attraction B C Criticality B
G
αα
∩
=⋅ +− ⋅ (2.5)
19
where G is a normalization factor which is set to the maximum number of nets to
which any a BLE can connect. The time complexity of this algorithm is O(n
2
),
where n is the number of BLEs in the circuit.
Figure 2.5: BLE criticality assignment.
2.2.2.4 Routability-driven Packing (RPACK)
RPACK [10] is a routability-driven packing algorithm, which first identifies
routability factors, then prioritizes these factors into an improved clustering cost
function. Three factors have been introduced to achieve routable clustering:
• Ratio of pins per net
• Ratio of used pins to the total number of pins of the logic block
• Number of nets
The ratio of pins per net indicates the density of high-fanout nets in the circuit. The
ratio of used pins to the total number of pins of the logic block indicates the traffic
in and out around a logic block. The number of nets is also closely related to
routability.
20
In the first stage, a LUT and a register are packed into a basic logic block when
possible. After that, the logic blocks are packed into clusters using a heuristic
approach. In this approach, clusters are constructed sequentially. A seed is chosen
to generate a new cluster. The best choice for the seed for each cluster is the un-
clustered logic block with the most used inputs, as indicated by VPACK and
TVPACK. After choosing a seed, the basic logic element, which gives the highest
gain, is selected to be added to the current cluster if the number of external inputs
does not exceed the number of input pins of the cluster.
Figure 2.6: Clustering: (a) before packing node B into cluster C and (b)
after packing node B into cluster C.
In equation (2.2), the gain is the number of inputs and outputs they have in
common between a BLE, B, and a current cluster, C. However, contributions of
each net between two blocks can be significantly different in terms of routability.
By considering just the number of shared inputs and outputs, the packing algorithm
cannot differentiate among the candidate blocks, which have different impact on
21
routability. For example, three nets, N1, N2, and N3 in Figure 2.6 have different
contributions respectively. By moving the BLE B, one terminal can be reduced in
N1, one terminal and one input pin can be saved in N2, and one terminal and one
unit of output congestion can be reduced in N3. The authors have driven a table of
gains of the candidate block according to a single net, which stores gains for
different connections of a net. The table was used to compute the total gain of
packing the logic element to a cluster.
The results showed that the major portion of the decrease in the number of the nets
is due to decrease in the number of two-terminal nets. The complexity of this
algorithm is O(M
2
), where M is the number of basic logic elements (BLEs).
2.2.2.5 Interconnect Resource Aware Clustering (iRAC)
Packing closely connected components together, by considering spatial uniformity
in the clustered design, using Rent’s Rule [27], iRAC [61] reduces the external
routing requirement in clustered FPGAs. It alleviates routing congestion for
clustered FPGAs by absorbing as many small nets into clusters as possible, and
depopulating clusters according to Rent’s rule, in order to achieve spatial
uniformity in the clustered netlist.
Characterizing the complexity of a cluster can be done with the well-known
exponential relationship in Rent’s rule [27]:
22
p
io
NKB = (2.6)
where N
io
is the number of pins in a cluster, B is the number of basic logic elements
(BLEs), K is the average number of connections per BLE in the cluster, and p (0 <
p < 1) is the Rent’s parameter (exponent). Given a specific FPGA architecture, a
Rent’s parameter, Pa, can be estimated from the equation (2.6), because N
io
, K, and
B can be obtained from the architecture. The connectivity factor is defined by the
ratio:
2
separation
c
degree
=
(2.7)
where degree is the degree of a BLE as the number of nets incident to that BLE and
separation is the sum of all terminals of nets incident to the BLE. The smaller c
value a BLE has, the more BLEs are located in a given BLE’s neighbor hood. In
the second step, a BLE with the highest degree and lowest connectivity factor is
chosen as a seed for a new cluster. Then gains, for neighbors connected to the BLE,
are assigned and the un-clustered BLE with the highest gain is absorbed into the
cluster. The gain function for BLE X with a net x to the cluster C is defined as
follows:
( ) (, , ) 2 () 1
x
GX Cx nw x α =×+ (2.8)
23
where α
x
is the number of pins of net x already inside cluster C, n is the cluster
size, and w(x) is the weight of net x (w(x) = 2/r where r is the number of pins on x).
In order to guarantee spatial uniformity of the clustered netlist, the algorithm limits
the number of available pins using Rent’s rule. By limiting it, the Rent’s parameter
of any cluster is no more than the Rent’s parameter of the FPGA architecture, Pa.
The area saved is 35%, on an average, compared to previously published results
[10].
2.2.3 PLACEMENT AND ROUTING
Placement is the process by which a netlist of logic blocks is mapped into physical
locations in an FPGA. The locations where the blocks are mapped can significantly
effect the performance of the FPGA.
Simulated annealing (SA) placement has been commonly used, since the number of
logic blocks is manageable so far within a reasonable amount of time. The
simulated annealing algorithm mimics the annealing process used to gradually cool
molten metal to produce high-quality metal structures [57][40]. A simulated
annealing placer initially places logic blocks randomly into physical locations in an
FPGA. Then the placement is iteratively improved by randomly swapping blocks
and evaluating the quality of each swap with a cost function. If the move will result
in a reduction in the placement cost, then the move is accepted. If the move would
24
cause an increase in the placement cost, then the move may still be accepted even
though it makes the placement worse. The purpose of accepting some bad moves is
to prevent the placer from being trapped in a local minimum.
VPR placer [8] employed a simulated annealing algorithm for FPGA placement to
minimize the wirelength of circuits. The placer uses a bounding-box based “linear
congestion” [4][8] cost function to estimate wirelength requirements. The linear
congestion cost function is expressed as follows:
() () ()
1
nets
N
xy
i
Cost q i bb i bb i
=
⎡ ⎤ =+
⎣ ⎦
∑
i (2.9)
where N
nets
is the number of nets, bb
x
(i) and bb
y
(i) are horizontal span and vertical
span of net i, respectively.
The problem of routing FPGAs can be stated simply as that of assigning signals to
routing resources, in order to successfully route all signals, while achieving a given
overall performance. The problem of routing FPGAs bears a considerable
resemblance to the problem of global routing for custom integrated circuit design.
However, the two problems differ in several fundamental respects. First, routing
resources in FPGAs are discrete and scarce, while they are reasonably continuous
in custom integrated circuits. For this reason FPGAs require an integrated approach
using both a global and a detailed router. A second difference is that the global
25
routing problem for custom ICs is rooted in an undirected graph. In FPGAs the
switches are often directional for SRAM-based FPGAs. Both of these distinctions
are important, as they prevent direct application of much of the work that has been
done in custom IC routing to FPGAs [28].
PathFinder is a router developed based on an iterative maze-type router [50]. Nets
are routed sequentially, and once a track segment has been used for one net, other
nets are allowed to use that segment, but must pay a higher cost. Consequently, nets
tend to avoid overusing a segment unless it is necessary or particularly efficient. At
the end of the first iteration (after all nets have been routed), either there are no
segments overused and the routing is successful, or some segments are overused
and more routing iterations are executed to try and resolve the contention. In each
of these subsequent routing iterations, every net is ripped up and re-routed. Since
the cost of over-used track segments increases with every routing iteration, they
become more expensive and are less likely to be used by more than one net. This
gradual reduction in routing violations is a very successful routing approach.
The VPR router [63] is based on the PathFinder routing algorithm [28]. It has two
enhancements to increase the speed of the basic breadth-first search maze router.
The first is to employ a depth-first search, which directs the router to head towards
specific targets. The second is to reduce the amount of activity on the routing
26
expansion list for higher-fanout nets, by only placing segments on the expansion
list that are in the neighborhood of the target.
2.3 SUMMARY
In this chapter, the overview of coarse-grained FPGA architectures was provided.
Mainly, SRAM-based FPGAs and antifuse-based FPGAs were introduced. The
CAD flow for FPGAs was presented and brief reviews for each step were presented.
The flow consists of logic optimization, technology mapping, clustering, placement,
and routing.
27
CHAPTER 3
TOOL FLOW AND CELL LIBRARY GENERATION
In this chapter, the proposed tool flow and the procedure of generating library cells
from the pASIC3 logic cell are discussed. The cell library is used during the
technology mapping phase to cover a network.
3.1 INTRODUCTION
The cell library is the key to implementing the application specific integrated
circuit (ASIC). Each library cell has a specified function, layout, and various kinds
of information such as pin delays from inputs to outputs, area of the cell, and
expected power dissipation, and so forth. Library cells are not required, until the
technology mapping phase, in the flow of computer-aided design for ASIC but it
represents the physical implementation of functions optimized by logic synthesis.
Each cell has a few different sizes, to provide different characteristics with the
same functionality and thus area, drive strength, and power dissipation is different
for different sized cells. Similarly, the cell library for antifuse-based FPGAs is a
28
collection of cells, each of which has its own function, delay, and power
dissipation. However, they cannot have multiple size cells for the same function in
a library, because all of the library cells have been generated from a logic cell like
pASIC3 in Figure 2.2. On the other hand, all of the library cells might have the
same delay, not exactly but very close. In addition, library cells from a logic cell
provide a more complex function than the ASIC library cells.
The organization of this chapter is as follows: A brief overview of the proposed
tool flow is presented in section 3.2. In section 3.3, the procedure of constructing
the cell library is presented. The cost assignment for each library cell is discussed
in section 3.4. Finally, we conclude in section 3.5.
3.2 TOOL FLOW
Figure 3.1 shows our CAD tool flow for pASIC3 family FPGAs. We generate both
a cell library set and configuration information from the pASIC3 logic cell. A
target circuit is synthesized by SIS [59] and then the circuit is mapped with cells in
the cell library by a technology mapper in SIS [59]. Our clustering tool, called
Packer-pASIC3, packs nodes with four different functions as shown in Figure 3.2.
A cluster is assigned to a pASIC3 logic cell. The VPR [8] places and routes the
clustered network with the architecture description of pASIC3 family FPGAs.
29
Figure 3.1: Proposed CAD tool flow for pASIC3 family FPGA.
Packer-pASIC3
Area
Timing
Interconnect
Timing
Low Power
Figure 3.2: Functions in Packer-pASIC3.
3.3 CELL LIBRARY GENERATION
After logic optimization, the technology-mapper needs to map a circuit with the
physical library cells. Every cell in the library contains an area of the cell and delay
30
information for each pin [59]. Based on the information, the mapper will generate a
mapped circuit, with those cells, to meet design constraints such as area, delay, and
power dissipation. For this research, we used the pASIC3 logic cell in Figure 1.2. If
we consider only the combinational logic part, the logic cell has 26 inputs and 4
outputs. Obviously, the number of functions, which may be generated from this
logic cell, is enormous and it is almost impossible to identify all of the functions
due to the exponential increase of functions as the input increases. Furthermore,
four outputs raise the question of how the logic gate, should be utilized, to achieve
the design goals.
Since technology-mapping technique relies on the tree mapping dynamic
programming, only single output library cells are applicable for optimal output for
a tree [39]. Benini and Micheli in [6] proposed a Boolean matching technique for
multiple output functions but the complexity is still very high since it must go
through multiple expensive BDD (binary decision diagram) operations with
increased input space. In order to maximize the utilization of the logic cell, we have
split the logic cell into several programmable logic gate groups as shown in Figure
3.3. The gate groups can be derived by assigning values to the inputs of
multiplexers (cf. mux1 through mux4 in Figure 1.2) in the logic cell. Now we have
four different programmable gate groups inside a pASIC3 logic cell that are called
base gates. Notice that some of the base gates can cause a conflict when the cells
31
from the base gates cannot be implemented in the pASIC3 logic cell,
simultaneously.
After deriving the base gates, cell generation from each base gate follows. This cell
personalization is done either by assigning 1 or 0 to some of inputs or by
connecting some inputs together. We call the former operation “sticking” and the
later “bridging”. By applying all possible combinations of the two operations to a
base gate, cell personalization can be achieved and the distribution of cells is
shown in Table 3.1.
Table 3.1: Cell distribution after cell personalization from base gates.
Base-gate
A
Base-gate
B
Base-gate
C
Base-gate
D
Total cells
Number of
cells
14 27 4967 271 5279
Table 3.2: Cell distribution after identifying common primitive cells among
base gates.
S
AD
S
ACD
S
BCD
S
C
S
D
S
CD
S
ABCD
Number
of cells
3 5 21 4893 194 42 6
Since SIS could not finish analyzing base-gate D, in reasonable amount of time, we
could not generate all possible primitive cells and those uncollected cells have
relatively many literals compared to other cells. In other words, they are rarely used
for mapping. Therefore, those uncollected gates may not impact on the results
32
significantly. We call those personalized cells “primitive cells”. However, some of
primitive cells from different base gates may have the same function. Therefore, we
can draw a Venn’s diagram to depict the relation of primitive cells from different
base gates as shown in Figure 3.4. We defined a type of a primitive cell depending
on which base gate can realize the function of the primitive cell. For example, an
inverter can be implemented by all base gates, and then the type of inverter
becomes ABCD. The set of those primitive cells, which can be realized by base-
gate A, B, C, or D is defined as S
ABCD
. Table 3.2 shows the number of primitive
cells for each type.
Figure 3.3: pASIC3 base gates derived from the configurable logic cell.
More than five thousand primitive cells are too large for a mapper to finish its work
in a reasonable time, and it requires large memory capacity to store them. There
33
was another step of filtering out rarely used primitive cells. We have selected
thirteen circuits from MCNC91 benchmark and counted how many times each
primitive cell has become a match for a node. The larger the number of cases of
matching is, the higher the primitive cell is likely to be utilized by circuits. 886
primitive cells were used to map all circuits in the benchmark; those primitive cells
have at least one match in the circuits. Depending on the requirement of mapping
quality, the number of primitive cells will be able to be selected according to the
possibility of utilization of each primitive cell.
A
B
C
D
S
D
S
AD
S
ACD
S
ABCD
S
BCD
S
CD S
C
Figure 3.4: Venn’s diagram for the set of logic cells that can be
personalized from the base gates.
Table 3.3: Filtered primitive cells
S
AD
S
ACD
S
BCD
S
C
S
D
S
CD
S
ABCD
Number
of cells
3 5 20 714 110 28 6
34
There are obvious reasons in extracting only four base-gates from the pASIC3 logic
cell. First, primitive cells from those base gates cover important, most commonly
used functions. Too complicated functions, which might be generated from two
base-gate A’s and one base-gate B’s, are rarely used in real circuits, while the
computational cost of pattern matching during the technology mapping becomes
very expensive. Secondly, the number of primitive cells increases exponentially as
the input size of a base gate increases. Too many library cells will hamper
technology mapping procedure because of too many library cells. Thirdly, having
more complicated base gates does not improve results much in terms of the cost of
computational time.
3.4 COST ASSIGNMENT
The area cost of a library cell must be assigned carefully. Three factors determine
the cost of a library cell: freedom (f), coverage (c), and space usage (s). The
freedom parameter captures the total number of places in the base gates that the
library cell can fit. The coverage parameter accounts for the complexity of the logic
that the library cell realizes. It is measured in terms of the number of literals in the
minimal factored form representation of the logic function. The space usage
parameter represents the amount of space inside a pASIC3 logic cell that is used up
by the library cell. The area cost of each library cell is calculated by the following
equation:
35
s
cost =
f c ×
(3.1)
A simple inverter does not have the lowest cost because it consumes one out of four
slots. However, it only operates as an inverter. In other words, library cells with
higher freedom, larger coverage, and lower space usage are much more preferable.
The delay cost for each library cell is somewhat simple because all cells are derived
from a logic cell. For example, the delay from inputs to outputs of a logic cell
varies only by the number of fanouts. In 0.35μm technology, the delay of a pASIC3
logic cell was about 1.4ns [52].
3.5 SUMMARY
A tool flow was proposed and the procedure of constructing a library, which has
primitive cells personalized from different base gates, was presented. The pASIC3
logic cell could be split into multiple base gates by assigning values to
multiplexers. The library cells could be personalized from the base gates.
36
CHAPTER 4
AREA-DRIVEN CLUSTERING ALGORITHM WITH
CONSIDERATIONS OF INTERCONNECT CONNECTIVITY
AND CIRCUIT SPEED
This chapter presents area clustering techniques for coarse-grained, antifuse-based
FPGAs. For the minimum-area clustering, our algorithm minimizes the number of
required pASIC3 logic cells to cover a network. Our approach has two steps for the
goal: calculating the minimum number of required pASIC3 logic cells and physical
clustering. First, we calculate the minimum number of the logic cells. Secondly,
under the constraints of the minimum area solution, we present an interconnect-
aware clustering algorithm and a timing slack-driven clustering algorithm.
4.1 INTRODUCTION
For antifuse-based FPGAs, logic cells are an important resource, whereas switches
for interconnect are abundant. When antifuse switches are compared to SRAM
switches for interconnect, as shown in Figure 4.1, plentiful switches provide rich
37
routing capability. The routing area is also an important factor in area estimation
for antifuse-based FPGAs. However, routing information is not available until the
placement and routing phase. Hence, minimizing the number of logic cells becomes
the best choice for antifuse-based FPGAs.
(a) SRAM Interconnect Switch (b) Antifuse Interconnect Switch
SRAM
Metal
SiO2
Metal
SiO
2
Via
Antifuse
SiO
2
Figure 4.1: Interconnect switch architecture for two different FPGAs.
In this chapter, the area-driven clustering techniques, with considerations of both
interconnect connectivity and circuit speed in terms of the number of pASIC3 logic
cells on critical paths, are presented. Even though we target a specific logic cell
architecture, our method can be applied to similar types of coarse-grained, antifuse
FPGAs with slight modification. We have divided it into several base gates and
then mapped a network. After technology mapping, we found the minimum number
of macro logic cells to cover the network by setting up two linear equations. From
the equations, we found the minimum value under certain ranges. After calculating
38
the lower-bound for the minimum area clustering, we clustered the network by
using the minimum number of the logic cell with considerations of interconnect
and timing. The goal of the interconnect-aware clustering is to minimize the
number of inter-cluster interconnects by using the minimum number of logic cells.
For the timing slack-driven clustering, with the minimum number of logic cells, we
minimize the number of logic cells on the critical path by selecting nodes on the
critical path.
This chapter is organized as follows: The lower-bound calculating algorithm for the
minimum number of logic cells is presented in section 4.2. The area-driven
clustering algorithms, with interconnect awareness and delay optimization, are
presented in section 4.3. In section 4.4, experimental results are provided. Finally,
we conclude in section 4.5.
4.2 LOWER-BOUND CALCULATION
In this section, an algorithm is provided, to find the minimum number of pASIC3
logic cells to cover a mapped network.
4.2.1 PROBLEM STATEMENT AND DYNAMIC
PROGRAMMING APPROACH
After a mapped netlist is generated with technology mapping, we must solve the
problem of clustering the primitive cells used in the mapped netlist to the pASIC3
39
logic cells. The routing cost of the connections in the mapped netlist tends to be
large for the FPGAs. Since the mapping is performed before placement and
routing, physical information is not available. In addition, antifuse-based FPGAs
have relatively rich routing resources since routing switches are abundant and many
layers of metal wires can cross over the pASIC3 logic cells [52]. Thus, we opted to
minimize the total area taken by the pASIC3 logic cells during the clustering and
pASIC3 assignment step.
Problem 1: Given a mapped netlist, we want to find the minimum number of
pASIC3 logic cells that can realize the network.
First, we present a dynamic programming solution for the problem without
simplifying the problem by fully exploring the architectural characteristic of
pASIC3 logic cell. In the following sections, a simplified problem formulation and
solution will be presented in detail.
There are seven different primitive cell types, S
AD
, …, S
ABCD
as defined in section
3.3. There is a fixed number of ways to embed (pack) a number of these cells in
the pASIC3 logic cell. For example, two type-AD primitive cells and two type-
BCG primitive cells can be packed in a single logic cell. Alternatively, two type-
AG primitive cells and one type-C primitive cell can be packed in one pASIC3
40
logic cell. If we generalize one combination out of M cases with an equation, i-th
combination can be expressed as:
,
, 1,...,
iij
jS
LC C i M
∈
==
∑
(4.1)
where S is the set of sets of primitive cells, {S
AD
, …, S
ABCD
}, and C
i,j
is the number
of type j primitive cells for i-th combination.
The packing problem can be restated as follows. Given M cases of filling a pASIC3
logic cell by primitive cells derived from the base gate types and the netlist of cells
generated by the mapper, find the minimum number of logic cells to cover all cells
in the netlist. Let m
Γ
denote the number of the primitive cells of type Γ in the
mapped network. For example, m
AD
is the number of type-AD primitive cells, i.e.,
the number of those cells that belong to set S
AD
. This can be simplified with a
following expression:
,
11
..
== ∈
⎧ ⎫
×≥
⎨ ⎬
⎩⎭
∑∑ ∩
j
MM
iiiS
ii jS
j
Minimize n s t n C m
(4.2)
where m
j
is the number of primitive cells for type S
j
. This is the same problem as
the well-known coin change problem as defined next.
Coin Change Problem: Let c1, c2, ... cq be the coin types of a currency. Let C
i
denote the value of coin ci in cents and K be some integer. We assume C
1
=1. The
41
problem is to produce K cents of change by using a minimum number of coins (cf.
Figure 4.2). The recursive expression for the solution can be
[]
[] {}
:
00
min 1 0
i
i
iC K
if K
count K
count K C if K
≤
= ⎧
⎪
=
⎨
−+>
⎪
⎩
(4.3)
where count[K] is the minimum number of coins for K cents. We can compute the
optimal solution to the coin change problem by using a bottom-up approach. By
solving the optimal solution for values smaller than K, we can find the optimal
solution for the exact amount of K by referring to the optimal solutions of the
previously solved sub-problems. The running time complexity is O(qK).
1
1
..
=
=
= ×
∑
∑
q
i
i
q
ii
i
Minimize n
stK nC
Figure 4.2: One dimensional coin change problem.
To solve the cell-packing problem, we must extend the coin change problem. First,
instead of a single amount K, there will be seven different amounts, each of which
42
is the number of primitive cells in the mapping solution that are in each base set,
S
AD
thru S
ABCD
. The recurrence equation for this problem can be as follows:
()
() ()
,...,
0, 0
min , , ..., , 1 ,
∀
∀≤ ⎧
⎪
=
⎨
−−+
⎪
⎩
j
ACD ABCD
Sj
ACD ACD ABCD ABCD
i
count m m
i ∈ f m for S S
count m Ci s m Ci s otherwise
(4.4)
The complexity of this algorithm is polynomial in
( )
∈
∏
jUj
O M m
where M is the
number of compositions of the primitive cell types needed to fill up a pASIC3 logic
cell.
4.2.2 SET CONTAINMENT RELATIONS
Base gates can be put into two classes: simple and complex base gates. The
complex base gate is one that consists of multiple base gates and internal
multiplexers, while the simple base gate can not be composed by other base gates.
Base-gates C and D are complex, whereas base-gates A and B are simple. The
inclusion relationship between these base-gates is expressed as follows:
basegate B basegate C
basegate B basegate D
basegate A basegate D
⊂
⊂
⊂
(4.5)
43
Notice that when both a simple base gate and a complex base gate can implement a
primitive cell, the simple base gate will be selected for realizing the function of the
primitive cell. Realizing the function by the complex base gate not only wastes area
of the pASIC3 logic cell but also needlessly increases the circuit delay. Therefore,
we can safely state that base-gates C and D are inferior to base-gates A and B,
when they implement the same logic function.
4.2.3 MINIMUM NUMBER OF PASIC3 LOGIC CELLS
WITH GIVEN BASE GATES
Given the number of base gates for each type, the key question is how many
pASIC3 logic cells are required to contain all of the base gates. There are three
types of pASIC3: 2A + 2B, 2A+C, and A+B+D. A type 2A+2B pASIC3 logic cell
is defined as the pASIC3 logic cell that has two base-gate A’s and two base-gate
B’s in it. Other types can be defined similarly. Note that only one base-gate D can
fit in one pASIC3 logic cell from its logic architecture in Figure 1.2(c).
Theorem 1: Let n
A
denote the number of base-gates A. n
B
, n
C
, and n
D
are similarly
defined. The minimum number of pASIC3 logic cells N
pASIC3
needed to implement
a mapped netlist containing n
A
, n
B
, n
C
, and n
D
base-gates can analytically be
calculated as follows:
44
322
22
(, )
2
2
2 2
0 0
=++
−− − ⎧ ⎧
≥ +≥
⎪⎪
==
⎨⎨
⎪⎪
⎩ ⎩
pASIC A B C D
AC D BD
B D AC D
AB
NMAXNN nn
nn n nn
nn nn n
NN
otherwise otherwise
(4.6)
Proof: A base-gate C and a base-gate D can not be packed together and a base-gate
B and base-gate C can not be packed together. Therefore, the number of type 2A+C
pASIC3 logic cells is equal to the number of base-gate C’s. Similarly, the number
of type A+B+D pASIC3 logic cells is equal to the number of base-gate D’s. The
number of type 2A+2B pASIC3 logic cells is determined by dividing the maximum
number between the number of base-gate A’s and base-gate B’s. However, a type
2A+C pASIC3 logic cell can pack two base-gate A’s as well as one base-gate C.
Thus, the actual number of base-gate A’s for type 2A+2B pASIC3 logic cells must
be calculated by subtracting the number of base-gate A’s, which have been packed
by type 2A+C and A+B+D pASIC3 logic cells, from the original number of base-
gate A’s. Similarly, the actual number of base-gate B’s for type 2A+2B pASIC3
logic cells can be calculated. Finally, the total number of pASIC3 logic cells
becomes the sum of all three type pASIC3 logic cells.
4.2.4 TYPE DISTRIBUTION TABLE
Theorem 1 can be used to radically simplify the problem. After technology
mapping, we count the number of primitive cells of specific types. The problem can
be restated as follows:
45
Problem 2: Given a primitive cell library, generated from the pASIC3 logic cell
structure, and a mapped network comprising the primitive cells, we want to find the
best choices of base gates A, B, C and D for realizing all of the primitive cells in
the network so as to minimize the number of required pASIC3 logic cells.
Note that after the base gate counts are known, the minimum number of logic cells
can be computed straightforwardly based on Theorem 1.
Table 4.1: The type distribution table for primitive cell to base-gate
mapping.
# of Base-gate types # of primitive
cell types
A B C D
m
AD
m
AD
0 0 0
m
ACD
x 0 m
ACD
– x 0
m
BCD
0 m
BCD
0 0
m
D
0 0 0 m
D
m
C
0 0 m
C
0
m
CD
0 0 m
CD
–y y
m
ABCD
z m
ABCD
– z 0 0
Table 4.1 shows how a primitive cell of type Γ in the mapped network is realized
with a base gate of type A, B, C, or D. Notice that many of the primitive cell types
have a unique realization in a single base-gate type. Examples include types BCD
of primitive cells. Note that a type BCD primitive cell should be realized only using
type B base gates because of the inclusion relationship of the equation (4.5) and the
46
fact that complex base-gates are always more costly than the corresponding simple
base gates. Three of the primitive cell types, however, can be realized by using
either of two base gates. For example type ACD primitive cell can be realized as
either type A or type C base gate. This table shows that, to solve problem 2, all we
have to do is to determine variables x, y and z where x denotes the number of
primitive cells of type ACD that are realized as a type A base gate, y denotes the
number of primitive cells of type CD that are realized as a type D base gate, and z
denotes the number of primitive cells of type ABCD that are realized as a type A
base gate.
Problem 3: Given the occurrence count of different primitive cell types in a
mapped network, find the values of variables x, y and z so as to minimize the
number of pASIC3 logic cells required to cover the network.
4.2.5 PROBLEM FORMULATION AND SOLUTION
We formulate Problem 3 as a linear programming problem and then obtain the
optimal solution by finding either the minimum point of an intersected plane of two
equations [35] or the minimum point of an equation that is always above the other
within certain ranges of variables. Equation (4.6) can be restated as in equation
(4.7).
47
() { }
() ( )
()( )
3
max ( , , ), ( , , )
0;0 ;0
1
,2 3
2
,
1
,
2
=
≤≤ ≤ ≤ ≤ ≤
⎧
++ + + − + + − + + + ≥
⎪
=
⎨
⎪
+−+ +
⎩
+−+ −+ −+ +
=+−−
pASIC ACD BCD
ACD CD ABCD
AD D AD ACD C CD D
ACD
DACD C CD
BCD ABCD D ACD C CD
BCD BCD ABCD D
NMIN N xyzN xyz
xm y m z m
mxzm y ifm m m m m xyz
N
mm x m m otherwise
mm zm y m xm m
Nifmmmy 0
,
⎧
⎪
⎪
−≥
⎨
⎪
+−+ +
⎪
⎩
DACD C CD
z
mm x m m otherwise
0
≤ ≤
≤ ≤ ≤ ≤
( ××
ACD CD ABCD
m
(4.7)
The brute-force algorithm is to search for the optimal solution by trying out every
possible combinations of x, y, and z within their allowed ranges (0 x m
ACD
,
0 y m
CD
, 0 z m
ABCD
). The computational complexity, however, is
, which can be quite high. Fortunately, equation
) Om m
(4.7) has an
important property that allows us to speed up the search: As x, y, and z increase,
N
ACD
increases but N
BCD
decreases. Therefore, within allowed ranges of x, y, and z,
the equations for N
ACD
and N
BCD
may intersect in a plane or one equation is above
the other all the time. We explain the solution for the two cases as follows.
Case 1: When N
ACD
and N
BCD
intersect on a plane, at the intersected plane, N
ACD
and N
BCD
become equal:
() ( ) ( ) ,, ,, , , 0 =− =+++
ADC BCD
F x yz N x yz N x y z ax by cz d=
(4.8)
48
where a, b, c, and d are coefficients of an equation, of a plane, after the subtraction.
All points on this plane guarantee that logic cells are full because N
ACD
and N
BCD
are equal but choosing one arbitrary point on the plane may not give the optimal
solution. Therefore, we need to find the point that gives the optimal solution on this
plane. Notice that we should consider only points on the plane within the specified
ranges for x, y, and z. Furthermore, we need to check only corners of the plane
because of the property of N
ACD
and N
BCD
mentioned above.
Case 2: N
ACD
and N
BCD
may not intersect at all, resulting in one equation lying
above the other in the ranges of x, y, and z. In this case, simply, two points are
evaluated: (x=0, y= 0, z = 0) and (x = m
ACD
, y = m
CD
, z = m
ABCD
). If N
ACD
is larger
than N
BCD
at x=0, y= 0, and z = 0, N
ACD
(x=0 , y= 0, z = 0) is the minimum
solution. Otherwise, N
BCD
(x = m
ACD
, y = m
CD
, z = m
ABCD
) is the minimum solution.
The worst case of the above algorithm is when it requires checking all of the
candidate points. Those candidate points can be enumerated by setting minimum or
maximum values to variables except one variable. Therefore, the complexity is
where k is the number of variables. In this case, k=3. Notice that the
computational complexity of this algorithm is independent of the network size.
1
(2 )
k
Ok
−
⋅
From the optimal distribution of primitive cells, we can easily find out the kind of
logic cells and how many of those logic cells are required. Notice that there are
49
only three types of clusters: 2A + 2B, 2A+C, and A+B+D. n
A
indicates the number
of base-gate A required from the optimal distribution of primitive cells. Likewise,
n
B
, n
C
, and n
D
can be computed by adding numbers for each column in Table 4.1 as
followed:
=++
=+ −
=−+ + −
=+
AAD
B BCD ABCD
CACD C CD
DD
nm x z
nm m z
nm x m m
nm y
y
(4.9)
The numbers of logic cells for different cluster types (n
A+B+D
, n
2A+C
, n
2A+2B
), can be
computed by the following equations:
()
()
2
2
22
2
max ,
22
++
++
+ +
+
+
+
= +
=++−
=+ −−
= −+ + −
=−
⎛⎞ ⎡⎤⎡⎤
=
⎜⎟
⎢⎥⎢⎥
⎢⎥⎢⎥ ⎝⎠
AB D D
AAD ABD
B BCD ABCD A B D
AC ACD C CD
AA AC
AB
AB
nmy
nm xz n
nm m z n
nm xmm
nn n
nn
n
y
(4.10)
Knowing the numbers of logic cells for each cluster type can be used to guide
algorithms to achieve small area. We use this information for area constraints to
build a clustering solution.
50
4.3 AREA-DRIVEN CLUSTERING TECHNIQUE
In this section, we present an area-driven clustering, which considers both
interconnect connectivity and circuit delay. The algorithm improves routability and
delay under the constraints of the minimum number of pASIC3 logic cells for a
given circuit.
Figure 4.3: Examples of local neighborhood connectivity factor
computation.
Knowing the minimum number of pASIC3 logic cells is not good enough to assign
nodes mapped with primitive cells into pASIC3 logic cells. In other words, we can
calculate how many clusters of different types are required for a circuit by using the
algorithm described in the previous section, but that does not achieve a complete
clustering solution because the algorithm for calculating the lower-bound does not
51
take into account the connectivity between nodes in a circuit. Thus, there are two
conditions for minimum area clustering. First, nodes mapped to a certain type of
primitive cells can only have a limited number of different type base gates, and
secondly, there are upper-bounds on the number of different type base gates that fit
inside a cluster (e. g., a pASIC3 logic cell.) We refer to these two conditions as
resource constraints. Therefore, when we create a new cluster, we have to ensure
that the resource constraints are not violated.
4.3.1 INTERCONNECT-AWARE CLUSTERING
Since the routing area is one of the primary goals, we propose a heuristic algorithm
of interconnect-aware clustering algorithm. The problem can be stated as follows:
Problem 4: Given a mapped network to primitive cells and the numbers of
different cluster types for the packing solution with the minimum number of
pASIC3 logic cells, find a clustering solution that has the minimum number of
inter-cluster interconnect wires.
A net connecting two nodes that can be clustered together (due to lack of conflicts)
is called an absorbable net. Considering conflicts between base gates in a pASIC3
logic cell, motivated by [53], we define a local neighborhood connectivity factor of
node u for base gate b as follows:
52
( )
()
(, )
,
γ
=
⎡ ⎤
⎣ ⎦
d
d
Pu
cu b
Nub
(4.11)
where P
d
(u) is the number of nodes within distance d from node u, N
d
(u,b) is the
number of absorbable nets within distance d from node u, and γ ≥1 is an empirical
parameter. Smaller c values signify that more nodes are located in a given node’s
neighborhood or the number of connected absorbable nets is small. Now, within a
given distance from a node, n, if the local neighborhood connectivity factor of the
node n is small, then there is a higher chance for the neighbor nodes to be packed
into a cluster containing n.
Figure 4.3(a) (a) and (b) show examples of calculating local neighborhood
connectivity factors when d = 1 and γ = 2. For example, in Figure 4.3(c), nodes n
3
and n
5
can not be clustered because their base gates conflict for a pASIC3 logic
cell. In this case, the net connecting nodes n
3
, n
4
, and n
5
cannot be absorbed into a
cluster at all. Therefore, the c value of node n
3
is larger than the value in Figure
4.3(b). Figure 4.3(d) depicts a case in which node n
3
has two base gates, and the
two c values are calculated for each base gate as done for Figure 4.3(b) and (c).
From this notion, each node might have different c values, for different base gates,
for its function realization.
53
We propose a heuristic, interconnect-aware algorithm. Clustering is done in two
steps. In the first step, the local neighborhood connectivity factor, c, for each base
gate for each node is computed and a node with the lowest connectivity factor is
chosen as a seed for a new cluster. A new cluster is created with the seed node. The
type of the new cluster is determined based on the availability of cluster types for
the seed node: the cluster type with highest availability is chosen as the type of the
new cluster. Once the type of the cluster is determined, the number of available
clusters for the type is subtracted by one. In the second step, we find the most
tightly connected node to the new cluster and pack it into a new cluster. The
amount of a node’s attraction to a cluster can be quantified by a gain function.
Depending on which base gate realizes the function of a node, the gain of a node
can vary. In addition, if a net is a permanent inter-cluster interconnect, then the gain
of the net will be zero. This is because we do not have to calculate gain for a node
that cannot be absorbed due to resource conflicts, which is node Z in Figure 4.4.
We use a modified gain function presented in [61] to consider base gates and
normalization, which is for node X associated with the base gate b for the current
cluster C, as follows:
()
() ( )
1
(, , ) (, , , )
(, , , ) 0.5 () 1 α
∈∩
⎛⎞
=
⎜⎟
⎝⎠
=× × +
∑
xNetsC Nets X
x
GAIN X b C G X b C x
MaxGain
GX bC x w x
(4.12)
54
where α
x
is the number of pins of net x that are already inside cluster C, w(x) is the
weight of net x i.e., w(x) = 2/r where r is the number of pins on x, and MaxGain is
the possible maximum gain in a network, which is the maximum number of pins of
a node. Figure 4.4 depicts an example of gain calculation. Notice that net2 cannot
be absorbed into cluster C because it is a permanent external net. Therefore either
node X or node Y will be clustered. If we consider only gain, node Y must be
packed into cluster C because it has higher gain than node X. However, we have to
ensure that the resource constraint is satisfied. For example, when we choose node
Y with base gate A or B, the cluster type of the new cluster becomes 2A+2B. If
there is no available pASIC3 logic cell of type 2A+2B, node Y cannot be packed
into cluster C. On the other hand, when there is an available cluster for type
A+B+D, node X can be clustered into cluster C. When a node has multiple base
gates and an equal value for gains, the base gate with the highest availability is
selected. If the new cluster is not full, or none of the nodes connected to the cluster
can be packed, we go to the second step. Otherwise, we go to the first step. We
continue clustering until there is no available pASIC3 logic cell for the minimum
area solution.
However, there can be un-clustered nodes when no available pASIC3 logic cells
are left. A cluster cannot find a node among the immediate neighbor nodes when all
neighbors have been clustered, or all un-clustered neighbor nodes cause resource
55
conflicts with nodes in the cluster. In the third step, we pack those un-clustered
nodes into clusters. In order to measure Euclidean distances between clusters and
un-clustered nodes, we place those clusters and un-clustered nodes by using a VPR
high-temperature simulated annealing placer [8]. This problem can be solved
optimally by using the linear assignment approach.
Figure 4.4: Clustering nodes.
Figure 4.5 shows an example of transforming the partially clustered network into a
bipartite graph for linear assignment. The cost of the edge in the bipartite graph is
the Euclidean distance between an un-clustered node and a cluster. Since we
guarantee that only available pASIC3 logic cells are used for the minimum area
clustering, the number of empty spaces for base gates in clusters is equal to or
larger than the number of un-clustered nodes.
56
D
A
C
B
A
C
A
A
A
B
B
A
n1
CL
1
CL
4
CL
3
CL
2
AB
AB
A
A
A
B
A
AB
A
n2
n3
CL
3
CL
4
n1
n2
n3
(a) (b)
AB
Un-clustered nodes
Available base gates in
clusters
Figure 4.5: Packing un-clustered nodes by using linear assignment: (a)
partially clustered network; (b) bipartite graph for linear assignment.
4.3.2 TIMING SLACK-DRIVEN CLUSTERING
The delay caused by inter-cluster interconnect, which connects pASIC3 logic cells
through interconnect wires and antifuses, tends to be much larger than the delay
caused by intra-cluster interconnect. Therefore, we can assume that inter-cluster
delay has a unit delay while the intra-cluster delay is negligible. This assumption is
reasonable because no placement and routing information is known and the inter-
cluster interconnect delay is much longer than the intra-cluster interconnect delay.
The timing slack-driven clustering problem can be stated as follows.
57
Problem 5: Given a mapped network with primitive cells and the numbers of
different cluster types for the packing solution with the minimum number of
pASIC3 logic cells, find a clustering solution that has the minimum number of
pASIC3 logic cells on the timing-critical paths of the circuit.
We use the notion of criticality of a node in a network as described in [49]. The
criticality of a node X is redefined to have the range from 0 to 1 as follows:
()
( )
1
slack X MinSlack
Criticality X
MaxSlack MinSlack
−
=−
−
(4.13)
where slack(X) is the slack time of node X, MinSlack and MaxSlack are the
minimum slack and the maximum slack in a network, respectively. When the
criticality of a node is higher than that of the other nodes, then the node will be on a
more critical timing path compared to the other nodes.
In [49], a node with the highest criticality is absorbed into a cluster in a manner that
possibly reduces the number of clusters on the critical path. However, we observed
that packing nodes in such a greedy manner could increase the number of clusters
on the critical path. Figure 4.6 depicts a situation where greedily selecting a node
with the largest criticality can cause a worse clustering solution. Suppose that node
n
5
is a seed node. After packing node n
4
and n
3
, in Figure 4.6(a), the greedy
algorithm will choose node n
7
because its criticality is greater than the criticalities
58
of node n
1
and n
2
. Note that this will prevent another cluster from absorbing node
n
6
, n
7
, and node n
8
, which are on the critical paths. On the other hand, by selecting
node n
2
in Figure 4.6(b), node n
6
, n
7
, and n
8
on the critical paths can be packed into
a cluster. From this example, we notice that clustering a node with higher criticality
than the connected nodes in a cluster can lower the chance of reducing the number
of pASIC3 logic cells on the critical path.
n2 n1
n5
n3
n8
0.8
0.8
1.0
1.0
1.0
1.0
0.7
n4
1.0
n6
n2 n1
n
5
n3
n8
0.8
0.8
1.0
1.0
1.0
1.0
0.7
n4
1.0
(a)
(b)
n7
n6
n7
Figure 4.6: Selecting the best node for clustering: (a) greedy selection and
(b) intelligent selection.
Neighboring nodes of a cluster, denoted as neighbor_nodes, are un-clustered nodes,
which are connected directly to some node in the cluster. Let the criticality of each
node, x, in the network be denoted by crit(x). Consider a partially-formed cluster C
comprising of nodes x
1
, …,x
m
. Now let the absorbable neighbors of C be denoted
by y
1
,…,y
p
. Let N(y
j
,C) denote the set of neighbors of y
j
in C. Furthermore, let
59
max_neigh_crit(y
j
,C) and min_neigh_crit(y
j
,C) denote the maximum and the
minimum criticalities of any node in N(y
j
,C), respectively. Our approach selects the
best node for packing with an order of the following priority: 1) a neighbor node,
y
j*
, such that crit(y
j*
) is maximum among all y
j
’s and crit(y
j*
) =
max_neigh_crit(y
j*
,C); 2) a neighbor node, y
j*
, such that crit(y
j*
) is maximum
among all y
j
’s and crit(y
j*
) < max_neigh_crit(y
j*
,C); and 3) a neighbor node, y
j*
,
such that crit(y
j*
) is minimum among all y
j
’s and crit(y
j*
) > min_ neigh_crit(y
j*
,C).
Figure 4.7 describes the algorithm to choose the best node to improve circuit delay.
search_nodes_cluster(u, cluster) finds nodes in cluster connected to node u. In
order to prioritize, there can be three cases: first, if criticality(u) is equal to
criticality(v), criticality(u) is set as the best_criticality when it is larger than the
current best_criticality; secondly, if criticality(u) is less than criticality(v), the
criticality(u) is set as max_criticality when it is larger than the current
max_criticality,; finally if criticality(u) is larger than criticality(v), criticality(u) is
set as the min_criticality when it is smaller than the current min_criticality. The
priority for selecting the best node is in order of best_criticality, max_criticality,
and min_criticality. By doing this, we can reduce the chance of intervening timing
slack-driven clustering on critical paths.
We use the same flow for the timing slack-driven clustering under the minimum
area constraint as the interconnect-aware clustering. For seed selection, we give a
60
Algorithm select_best_node_timing (neighbor_ nodes, cluster)
1. max_criticality = best_criticality = -∞
2. min_criticality = ∞
3. for each node u for neighbor_nodes
4. node_list = search_nodes_cluster(u, cluster)
5. for each node v for node_list
6. if criticality(u) == criticality(v) and criticality(u) > best_criticality
7. best_criticality = criticality(u)
8. best_node = u
9. else if criticality(u) > criticality(v) and criticality(u) < min_criticality
10. min_criticality = criticality(u)
11. min_node = u
12. else if criticality(u) > max_criticality
13. max_criticality = criticality(u)
14. max_node = u
15. end if
16. end for
17. end for
18. if best_criticality != -∞
19. return best_node;
20. else if max_node != -∞
21. return max_node
22. else
23. return min_node
24. end if
Figure 4.7: Selecting the best node for delay improvement.
higher chance of being a seed for those nodes on critical paths. Since nodes on
critical paths have the same criticality, we select the node with the lowest
connectivity value, and otherwise, break the ties arbitrarily. In the next step, to
61
select the best node for clustering, we use the algorithm based on an ordering of the
priorities.
4.4 EXPERIMENT RESULTS
We have selected 18 large combinational circuits from the MCNC91 benchmark.
SIS [59] reads the circuits in blif format. To evaluate our library generation and
area-driven clustering, we compare our results to those from a commercial tool,
called QuickWorks 4.1 from QuickLogic [53]. For QuickWorks 4.1, options for
logic optimization are set for overnight, area minimization, and no buffer insertion.
The option for placement and routing is set for overnight to get the best results.
QuickWorks uses the term cell fragment to indicate a library cell generated from a
pASIC3 logic cell. The results were taken after placement. For our simulation set-
up, the library was read and script.rugged was used to optimize a circuit. SIS was
used for technology mapping with the library. We estimated the minimum number
of logic cells by using our algorithms, Packer-pASIC3-area. Table 4.2 reports the
results of the area-driven clustering. In most of the cases, Packer-area used fewer
primitive cells than QuickWorks. Packer-pASIC3-area reduced the number of
pASIC3 logic cells by 12.29% on an average compared to QuickWorks.
Table 4.3 shows the results of clustering with different objectives such as the
minimum number of external wires and the minimum pASIC3 logic cells on the
62
critical path. Compared to the Packer-pASIC3-area-interconnect, the Packer-
pASIC3-area-timing generated more inter-cluster interconnects by 7%. In order to
Table 4.2: Results of lower-bound calculation
QuickWorks4.1 Packer-pASIC3-area
Circuits
cell
fragments
PASIC3
logic
cells
Primitive
cells
PASIC3
logic
cells
primitive
cells
pASIC3
logic
cells
i9
356 95 374 96 -5.06 -1.05
rot
398 104 350 88 12.06 15.38
i8
706 184 568 142 19.55 22.83
pair
925 243 802 213 13.30 12.35
vda
514 131 312 79 39.30 39.69
x1
176 45 165 42 6.25 6.67
C6288
1904 476 1513 448 20.54 5.88
C5315
996 264 699 196 29.82 25.76
alu4
500 125 419 113 16.20 9.60
apex6
360 124 329 84 8.61 32.26
C880
218 57 177 54 18.81 5.26
C3540
705 181 672 175 4.68 3.31
alu2
262 66 216 57 17.56 13.64
C1355
224 57 210 53 6.25 7.02
C1908
221 56 215 55 2.71 1.79
C432
121 31 120 31 0.83 0.00
C499
201 58 210 53 -4.48 8.62
Average Improvement (%)
12.17 12.29
63
Table 4.3: Results of different clustering objectives with the minimum area
solution
QuickWorks4.1 Packer-pASIC3-area-
interconnect
Packer-pASIC3-area-
timing
Circuits
Total
Wires Max Depth
Max.
Depth
Inter-
cluster
Wires
Wire
Length
Max.
Depth
Inter-
cluster
Wires
Wire
Length
i9
462 9 8 285 4750 6 287 4649
rot
485 15 11 338 4881 9 379 5794
i8
701 9 8 475 9159 8 478 9599
pair
975 15 18 675 12080 14 761 11272
vda
329 10 11 247 3026 8 257 3123
x1
216 6 5 140 1890 5 152 1987
C6288
1545 91 69 1167 12049 67 1323 13659
C5315
894 16 18 661 14175 15 718 14471
alu4
433 25 23 284 3540 19 320 3843
apex6
464 9 8 338 6285 7 357 5899
C880
237 20 17 183 2060 17 197 1922
C3540
722 23 26 440 6743 23 547 6887
alu2
226 32 17 134 1650 16 164 1779
C1355
251 17 11 185 1350 12 186 1382
C1908
248 25 16 174 1506 14 186 1578
C432
156 25 22 101 802 16 113 804
C499
251 13 11 186 1352 11 182 1529
Average Change 1.35 1.14 1 1 1 1.07 1.04
measure the total wirelength after placement and routing, we used the VPR [8].
Packer-area-interconnect provided a reduction on the total wirelength by 4%,
64
compared to and the Packer-area-timing. In terms of the number of clusters on the
critical path, the Packer-area-timing provided a shorter critical path than
QuickWorks, and the Packer-area-interconnect by 35%, and 14%, respectively.
4.5 SUMMARY
In this chapter, we presented area-driven clustering algorithms for coarse-grained,
antifuse-based FPGAs. We generated a library set from the pASIC3 logic cell and
mapped a network with the library set. For the mapped network, we presented a
dynamic programming solution as a general solution to find the minimum number
of pASIC3 logic cells. By considering the architectural characteristic of the
pASIC3 logic cell, we set up a pair of linear equations and found the optimal
solution. With this minimum area requirement, we proposed an interconnect-aware
clustering algorithm and a timing slack-driven clustering algorithm. The
interconnect-aware clustering algorithm used connectivity information among
nodes under the constraint for the minimum area. The timing slack-driven
clustering algorithm intelligently packs nodes into clusters to minimize the number
of clusters on the critical path, by avoiding false selection of critical nodes. For the
minimum number of pASIC3 logic cells, our low-bound calculating algorithm
provided approximately a 12% reduction, when compared to QuickWorks from
QuickLogic. The interconnect-aware clustering also required a 21% reduction on
the number of inter-cluster interconnects, when compared to a simple clustering
65
algorithm based on placement and proximity among nodes. The timing slack-driven
clustering algorithm reduced the number of pASIC3 logic cells on the critical path
by 35%, compared to QuickWorks.
66
CHAPTER 5
TIMING-DRIVEN CLUSTERING
In this chapter, a clustering algorithm is presented, to minimize the number of
pASIC3 logic cells on the longest input-output path. In addition, slack-time
relaxation minimizes redundant logic replication while timing constraints are
satisfied.
5.1 INTRODUCTION
Circuit clustering is a technique that groups the gates of a circuit into clusters
within a certain boundary, such as area and pin constraint, to optimize certain
metrics. Commonly used metrics include maximization of the connectivity within
clusters or minimization of the delay of the clustered circuits. In this chapter, we
focus on the delay minimization of the clustered circuit.
Delay caused by inter-cluster interconnect, which connects pASIC3 logic cells
through interconnect wires and antifuses, tends to be much larger than the delay
67
caused by intra-cluster interconnect. Therefore, we can assume that inter-cluster
delay has a unit delay while the intra-cluster delay is negligible.
For timing-driven clustering, we minimized the number of pASIC3 logic cells on
the critical path by optimal labeling and clustering. Slack-time relaxation was
applied to minimize logic replication without violating the maximum required
labels at primary outputs.
This chapter is organized as follows: In section 5.2, we provide the problem
statement, an optimal labeling algorithm for the problem is presented in section 5.3,
slack-time relaxation is described in section 5.4, merging clusters is discussed in
section 5.5, and experimental results and the conclusion follows in section 5.6 and
5.7.
5.2 PROBLEM STATEMENT
The timing-driven clustering problem can be stated as follows.
Problem statement: A combinational network can be represented as a directed
acyclic graph G = (V, E), where V is the set of nodes, and E is the set of directed
edges. Each node in V represents a primitive cell in the network and each edge (u,
v) in E represents an interconnection between primitive cells u and v in the
network.
68
Problem: Given a network G mapped with the library generated, we want to find a
clustering solution so that the number of pASIC3 cells on the critical path is
minimal.
Notice that each cluster must be feasible, in the sense, that it must be realizable
with a single a pASIC3 logic cell. label(u) denotes the label of node u in the
network.
5.3 MULTI-DIMENSIONAL LABELING ALGORITHM
Lawler’s algorithm is a labeling algorithm [43]. The label given to node v indicates
the maximum delay along any path from an input of the network to node v. The
Lawler’s algorithm results in a clustering C of the nodes, with the smallest possible
largest label subject to the capacity constraint. This largest label equals the
maximum delay in the network. Notice that the clustering constraint is monotone if
and only if any connected subset of nodes in a feasible cluster is also feasible.
The clustering constraint for our problem is monotone as well. More precisely,
when a collection of primitive cells in the mapped network can be realized in a
pASIC3 logic cell, a subset of these primitive cells can also be realized in a single
pASIC3 cell. We call this constraint a resource constraint. We propose a labeling
algorithm to guarantee the optimal solution, which is subject to the resource
constraint. The pseudo-code for the algorithm is provided in Figure 5.1. Notice that
69
since multiple base gates can realize a node, those base gates must be checked to
create a cluster. In addition, according to the topological containment described in
section 4.2.2, some base gates are inferior to others. Therefore, inferior base gates
can be dropped as was done for the area-driven clustering. This significantly
reduces the complexity of generating clusters during the labeling phase.
The algorithm starts by setting all labels of primary inputs to zero. In line 12,
candidate base gates for the node are found to create clusters for different base
gates. In lines 13 to 21, we create all feasible clusters comprising of node v and its
fanin nodes with label l. If there exists any feasible cluster, the label of node v
remains at l. If no feasible cluster exists, the label is incremented by one.
The total number of clusters for feasibility test in line 14 is an important factor to
determine the computational complexity of the algorithm. We denote this number
with k. In addition, up to four base-gates can fit in a pASIC3 logic cell. Therefore,
if the number of fanin nodes with label l in line 14 is larger than three fanin nodes,
then we will not have to generate clusters for the feasibility test. We denote the
number of fanin nodes with label l, which is less than four, with f. m denotes the
number of base gates for node v. Since there can be f fanin nodes with label l, each
of which can have up to k clusters, node v can choose a base gate out of m base
gates, the total clusters for feasibility test generated in line 14 will be at most mk
f
,
70
which is independent of the network size. Therefore, the complexity of this
algorithm becomes O(|V|mk
f
), where |V| is the number of nodes.
1. Algorithm Multi-dimensional labeling
2. Begin
3. foreach primary input v do
4. label(v) = 0;
5. end for;
6. Generate list T of non-primary inputs in
7. topological order;
8. While T is not empty, then
9. Remove node v from the head of T;
10. l = max{label(u) | u ∈ input(v)};
11. clusterSet(v) = ∅;
12. Generate list M of base gates for node v;
13. foreach base gate from M,
14. R = form clusters from node v and clusters with
15. label l in fanins of node v
16. foreach cluster from R,
17. if cluster is feasible for pASIC3 realization,
18. clusterSet(v) = clusterSet(v) ∪ cluster;
19. end if;
20. end for;
21. end for;
22. if clusterSet(v) ≠ ∅, then
23. label(v) = l ;
24. else
25. label(v) = l + 1;
26. end if;
27. end while;
28. End
Figure 5.1: Multi-dimensional labeling algorithm.
71
An example is provided in Figure 5.2. For the sake of simplicity, only two cluster
(pASIC3 logic cell) types are considered: 2A+2B and 2A+C. Each character
annotation in a node represents a candidate base gate for realization. There can be
different cluster types. Selecting the cluster type for an area minimization during
the merging phase is an open question. We will describe our strategy in section 5.5.
AC
A
B
AB
AC
B AC
B
AB AB
A
A(0)
C(0)
A(0)
A-A-B(0)
C-A-A(0)
B(0)
A(0)
B-A-A(0)
B-A-B(0)
B-A-B-A(0)
A(1)
B(1)
A-A(1)
A-C(1)
B-A(1)
A-A-B(1)
B-A-B(1)
A-B(1)
B-B(1)
cluster1
cluster2
cluster3
cluster4
Figure 5.2: Clustering example.
5.4 SIGNAL PATH AWARE SLACK-TIME
RELAXATION
The labeling algorithm generates a cluster solution where some clusters can cover
identical nodes. If there is no slack-time relaxation, the nodes must be duplicated in
these clusters. For example, cluster3 and cluster4 cover the same node. If a designer
specifies a required maximum label in the primary outputs, then we can compute
the slack time for each node by subtracting the label of the node from the required
72
label at the node. A positive slack of a node denotes the amount by which the node
can be slowed down without increasing the maximum label. Therefore, we can sort
nodes by their descending slack values. Next, we process each node from the sorted
order. If other clusters cover a fanin node of the current node and the slack of the
node is positive, this will tell us that the fanin node can be removed from the
cluster, which contains the node and its fanin node. By doing this, the current node
will decrease its slack by one. If this operation changes the slack time of the fanin
node, then the slack time of transitive fanin cone of the node is computed again.
This procedure prevents unnecessary node duplication while the required maximum
label is still met.
By being aware of the signal paths of nodes, we can avoid node duplication. In
Figure 5.3, node n
4
has a positive slack and it seems that we can avoid duplicating
node n
2
. By duplicating the node, the label of node n
1
and n
4
is incremented, which
is 2. However, the label of node n
3
should increase because the label of node n
1
is
incremented. Since the slack of node n
3
was zero, it will have a negative slack,
which is prohibited. Observing carefully a signal path from node n
1
to n
3
, we can
notice that there is no actual increase in terms of the number of clusters on the
signal path. Node n
3
still has one cluster on its signal path.
73
(1, 0)
(label, slack)
(2, 0)
(1, 1)
(1, 0)
(1, 0)
(2, 0)
(3, 0)
(2, 0)
(1, 0)
(1, 0)
n1
n5
n4
n3
n2
n1
n5
n4
n3
n2
Figure 5.3: Slack-time relaxation with awareness of signal path.
5.5 MERGING ALGORITHM
After locating all clusters in a network, we may be able to merge some of these
clusters, if the merging still results in a feasible cluster (one that can be mapped to a
single pASIC3 logic cell.) In order to gain insight into how to merge clusters, we
perform a global placement of the clustered network by using DRAGON2000 [71].
We then randomly choose a seed cluster and locate the closest cluster to it (one
with the shortest Euclidean distance from the seed cluster.) We then attempt to
merge the two clusters into one. Of course, the merged cluster must be feasible. If
the merged cluster still has room in it, then we continue to look for another nearby
cluster. This process is continued until the merged cluster is full. Next, we do the
same expansion/merging process starting with another seed cluster. This procedure
74
is repeated until no cluster with low area utilization is left or until no more merging
is possible.
5.6 EXPERIMENT RESULTS
We have selected 18 large combinational circuits from the MCNC91 benchmark.
SIS [59] reads the circuits in blif format. Results of the timing-driven clustering
algorithm, called the Packer-pASIC3-timing, are provided in Table 5.1.
QuickWorks is set to minimize delay during logic optimization. Placement and
routing is also set to the overnight mode for the best result. The maximum depth of
a circuit in the table is the number of pASIC3 logic cells on the longest input-
output path. The comparison of the numbers of primitive cells before and after
slack-time (ST) relaxation shows that our proposed method effectively avoids logic
replication. The number of clusters for each circuit was reduced after the merging
phase. For this paper, we do not give the threshold of distance between placed
clusters, and some far-away clusters might have been merged together. However,
we can control the results by changing the threshold. As a result, QuickWorks used
much more pASIC3 logic cells than the results from the area-driven clustering.
Because of the logic replication, our algorithm on average uses more pASIC3 logic
cells after merging. Compared to QuickWorks, PackGen-delay reduced the
maximum depth of the circuit by 44.75% on average with a 12.05% area overhead.
Table 5.2 shows the results of the slack-time relaxation. The slack-time relaxation
75
required only 12.08% logic replication, while 38.74% logic replication was needed
without it.
Table 5.1: Results of timing-driven clustering
QuickWorks4.1 Packer-pASIC3-timing Improvement (%)
Number of clusters
(pASIC3 logic
cells)
Circuits
Number
of
fragment
cells
Number
of
pASIC3
logic
cells
Max
depth
Before
merging
After
merging
Max
depth
Number of
pASIC3
logic cells
Max
depth
i9 381 97 10 262 148 5 -52.58 50.00
rot 404 111 15 199 88 7 20.72 53.33
i8 771 201 9 449 210 5 -4.48 44.44
pair 900 238 13 478 231 10 2.94 23.08
vda 546 139 10 163 111 5 20.14 50.00
x1 193 50 6 88 56 3 -12.00 50.00
C6288 1872 771 82 829 496 55 35.67 32.93
C5315 992 430 18 448 284 13 33.95 27.78
alu4 521 131 25 222 170 12 -29.77 52.00
apex6 361 112 9 230 91 6 18.75 33.33
C880 219 55 18 123 65 14 -18.18 22.22
C3540 710 179 25 392 198 15 -10.61 40.00
alu2 281 71 26 128 90 10 -26.76 61.54
C1355 223 57 14 138 88 7 -54.39 50.00
C1908 223 56 25 119 75 9 -33.93 64.00
C432 127 35 25 75 46 10 -31.43 60.00
C499 200 54 13 138 88 7 -62.96 46.15
Average Improvement -12.05 44.75
76
Table 5.2: Results of slack-time relaxation
Packer-pASIC3-timing
Number of primitive
cells after duplication
Replication Ratio (%) Circuits Number of
primitive
cells after
technology
mapping
Before ST-
relaxation
After ST-
relaxation
Before ST-
relaxation
After ST-
relaxation
i9 384 626 398 63.02 3.65
rot 338 429 342 26.92 1.18
i8 607 1007 715 65.90 17.79
pair 818 1052 822 28.61 0.49
vda 304 383 333 25.99 9.54
x1 170 193 171 13.53 0.59
C6288 1506 1620 1525 7.57 1.26
C5315 671 774 692 15.35 3.13
alu4 438 570 473 30.14 7.99
apex6 330 489 334 48.18 1.21
C880 178 230 180 29.21 1.12
C3540 654 993 693 51.83 5.96
alu2 224 320 233 42.86 4.02
C1355 210 312 296 48.57 40.95
C1908 219 280 262 27.85 19.63
C432 109 201 159 84.40 45.87
C499 210 312 296 48.57 40.95
38.74 12.08
77
5.7 SUMMARY
In this research, we developed timing-driven clustering algorithms for coarse-
grained, antifuse based FPGAs. We proposed a labeling algorithm so that it can
generate the minimum number of clusters on the critical path. A slack-time
relaxation was used to avoid redundant logic duplication without violating the
timing constraint. In addition, a random merging was used to cluster closely placed,
partially filled clusters. Experimental results showed the timing-driven clustering
algorithm reduced the maximum depth by 44.75% on an average with 12.05% area
overhead.
78
CHAPTER 6
LOW-POWER CLUSTERING WITH MINIMUM LOGIC
REPLICATION
This chapter presents a minimum area, low-power driven clustering algorithm for
coarse-grained, antifuse-based FPGAs under delay constraints. The algorithm
accurately predicts logic replication caused by a timing constraint, during low-
power driven clustering. This technique reduces the size of duplicated logic
substantially, resulting in benefits in area, delay, and power dissipation.
6.1 INTRODUCTION
As the quick market catch becomes a key to success in the multimedia industry
these days, power dissipation becomes an important factor for FPGAs. Power
dissipation in SRAM-based FPGAs has been analyzed thoroughly in [44][60].
There have been three approaches for low power in FPGAs. First, authors in [31]
proposed a new SRAM-based FPGA architecture. They looked at the architectural
optimization process to evaluate the trade-offs between the flexibility of the
79
architecture, and the effect on the performance metrics. Different circuit techniques
also were introduced to reduce the performance overhead of some of the dominant
components. Power dissipation for configuration was also analyzed. Secondly,
various kinds of logic synthesis and technology mapping techniques have been
proposed [61][41][16][4]. In [61], an efficient incremental network flow
computation method was proposed to compute and select low-power K-feasible
cuts. The authors in [16] proposed a logic transformation algorithm to reduce
switching densities of the output edges of programmable logic blocks. Thirdly, a
clustering technique has been introduced for low power [61]. By reducing the
number of external wires, they achieved low power clustering. In their approach,
low power is a byproduct of the congestion-driven clustering algorithm; but they
didn’t use any knowledge of expected switching activities in the circuits. The
authors in [67] addressed leakage power breakdown based on circuit types for
SRAM-based FPGAs. Configuration SRAM cells, interconnect multiplexers, and
LUTs consume 88% of the total leakage power.
In addition, the leakage power becomes significant as technology shrinks down. In
developing the configuration of logic cells, leakage power dissipation must be
considered. The leakage current in modern chips is going up at a frightening rate
[44]. The solutions to reduce it are not simple. A 1.6nm gate can leak 10nA and
with over 100 million transistors in an FPGA, that leads to Amps of leakage
80
current. Clock gating can cut the dynamic power by not switching portions of the
circuit that are not in use. However, Cutting static leakage requires the supply to be
removed from the circuits. One technique is to add sleep transistors, which gate the
supply. However, this can not easily be done with SRAM-based FPGAs, as the
program would be lost. In [44], the authors reported that the average leakage power
percentage could be up to 59% when the LUT size is large. Therefore, we believe
that leakage power reduction is critical for future power efficient FPGA
architectures
In a typical flow of FPGA CAD tools, clustering, which follows the technology
mapping step, is an important optimization because it maps the target circuit net list
into an FPGA array. The clustering, therefore, refers to the task of grouping logic
gates in the circuit netlist and assigning each group to a configurable logic block in
the FPGA array (in the case of our target architecture, this means packing gates into
pASIC3 cells.) Logic replication, which is often needed to meet the timing
constraints, is an indispensable part of the clustering step. Logic replication directly
effects the area and power dissipation of the FPGA synthesis solution. This
increase is especially true with respect to leakage power since this leakage is a
direct function of the size of the logic circuit implementation. The authors in [21]
report that 33% logic replication is observed as a result of the timing-driven
clustering in SRAM-based FPGAs.
81
In this research, we present a low-power driven clustering algorithm with minimal
logic replication for coarse-grained, antifuse based FPGAs. As stated earlier, in the
context of our problem, a cluster refers to a group of circuit nodes that can fit in a
pASIC3 logic cell. We use a dynamic programming-based clustering technique
where starting from the circuit inputs moving toward the circuit outputs, we
incrementally generate a set of power-delay curves for all nodes in the network
[66][68]. Each such curve, stored at some intermediate node, describes the set of
non-inferior clustering solutions for the subgraph rooted at that node. A critical
factor in determining the quality of the clustering solution for a circuit is how
accurate and complete the set of power-delay curves are; this strongly depends on
the accuracy of incremental cost (i.e., power dissipation) calculation at each node in
the network. We have seen that existing heuristics for this cost calculation, which
divide the cost of a multiple fanout node equally among its fanout nodes [66][15],
can result in significant computational errors, thereby, degrading the quality of the
overall solution. In this paper, we present a new heuristic approach for the cost
propagation across multiple fanout nodes, in a Boolean network, that allocates the
cost of logic cone rooted at a multiple fanout node to its fanout nodes in
proportionately. More precisely, we simply determine the cost allocation to each
fanout node it by traversing backward in the circuit, to compute the amount of logic
replication.
82
This chapter is organized as follows: In section 6.2, the flow of our clustering
algorithm and problematic logic duplication are introduced. We present the low
power clustering algorithm in section 6.3. In section6.4, the cluster selection
algorithm is discussed. Implementation and experimental results are provided in
section 6.5. We conclude in section 0.
Figure 6.1: An example of redundant logic replication in clustering: (a)
clusters and the corresponding area-delay points, (b) non-inferior clusters, (c)
circuit after logic replication (i.e., n1, n2, and n3 are duplicated), and (d) a
desired clustering solution.
83
6.2 DESIGN FLOW AND PROBLEM DESCRIPTION
A cluster i, denoted by CL, is defined as a group of circuit nodes that can be
realized in a single pASIC3 logic cell without any resource conflicts. The set of
i
nodes that drive nodes in cluster CL
i
is referred to as its leaf set and denoted by Λ
i
.
The clustering algorithm is comprised of two steps: cluster generation and cluster
selection. During the cluster generation, clusters rooted at nodes in the network are
generated and power-delay curves are computed in a postorder traversal of the
network starting from primary inputs going toward the primary outputs. For cluster
selection, clusters are determined during a preorder traversal from primary outputs
back toward the primary inputs. The design flow can be described as follows:
1. Select the logic cone rooted at a primary output, which has the largest
number of un-clustered nodes.
2. Traverse the cone in postorder to create power-delay (PD) curves.
3. Select a power-delay point from the PD curve of a primary output and form
a cluster based on the point.
4. Select power-delay points from PD curves at leaf nodes of the previous
cluster and do this in preorder until all nodes in the logic cone are clustered.
5. Go to step 1 if any logic cone is not clustered yet.
84
A clustering solution at node u is characterized by a power-delay point (PD-point)
which is a pair {p
u
, d
u
}, where d
u
gives the delay value (i.e., latest signal arrival
time) associated with the PD-point, and p
u
gives the corresponding power
dissipation of the clustering solution rooted at node u.
Consider intermediate nodes ni and nj, in a Boolean network (a circuit netlist with
signal direction specified) where there exists a common multiple fanout node, n
k
, in
their transitive fanin cones. In typical timing-driven clustering, to minimize the
arrival time to n
i
and/or n
j
, logic replication of logic under n
k
may become
necessary. An example of this scenario is shown in Figure 6.1(a) where when
finding clustering solutions at nodes n
4
or n
5
, it may become necessary to replicate
n
1
, n
2
and n
3
. Assume that there are two possible clustering solutions, CL
2
and CL
3
(CL
4
and CL
5
), at n
4
(n
5
).
1
There is also a single clustering solution CL
1
at node n
3
.
The area of each cluster is 1 whereas the delay depends on the topology of the logic
mapped to the cluster. We calculate the AD curve of n
4
as follows. For the
clustering solution CL
3
, the AD value is (1,0.8) whereas for CL
2
, the area value is
1+1/2=1.5 and the delay is 1. The area cost calculation is done in this way because
the cost of cluster CL
1
is divided equally between its two fanout nodes. This
generates a new AD value of (1.5,1). Notice however that (1.5,1) is inferior to
(1,0.8), and therefore, it will be dropped, resulting in the AD curve of {(1,0.8)} for
1
In this example, because the area cost is easier to depict pictorially, the area cost is used in place of the power
cost.
85
n
4
.
1
Similarly, the AD curve of n
5
will be pruned to {(1,0.9)}. However, by
dropping the two inferior points from the AD curves of n
4
and n
5
we force a logic
clustering solution whereby three nodes (n
1
, n
2
and n
3
) must be replicated as shown
in Figure 6.1(b) and (c). The overall area cost of this clustering solution is 2 and
the worst-case delay is 1.1. Suppose that the required time at node n
4
and n
5
is 2.
Now, in fact, there is a better solution whereby CL
2
is chosen at n
4
and CL
5
at n
5
(cf. Figure 6.1(d).) The area cost of this solution is 2, while its worst-case delay
cost is 1.1. However, there is no logic duplication, which means that the utilization
of one of the pASIC3 logic cells in the latter solution is much lower, thereby,
potentially allowing the future packing of extra logic into that pASIC3 cell. The
reason that the area cost of the solution given in part (d) is 2 is that CL
5
can be
treated as multiple-output Boolean functions providing both the signal that goes out
of n
5
and the signal that goes out of n
3
and feeds into cluster CL
2
. Therefore, there
is no need to replicate n
1
, n
2
and n
3
to separately generate the signals from n
3
into
CL
2
, as would have been the case if the cluster was treated as a single-output
Boolean function.
We have identified the aforesaid problem as a key reason behind the significant
increase in the logic replication cost of a mapping solution to pASIC3 arrays.
Therefore, in the remainder of this paper, we focus on developing a heuristic
1
An inferior point (p’, d’) is inferior to (p, d) if (p’ ≥ p and d’ > d) or (p’ >p and d’ ≥ d).
86
solution to calculate the replication cost across multiple-fanout nodes of the circuit
during the post-order traversal.
6.3 LOW POWER CLUSTERING
In this section, we present a clustering procedure with the accurate calculation of
logic replication cost during the forward traversal of the Boolean netlist.
6.3.1 CLUSTER GENERATION AND POWER-DELAY
CURVES
A technology mapped network consists of primitive cells. In the cluster generation
phase, we postorder from the primary inputs to the primary outputs. This ordering
ensures that when a node is processed, all of its fanin nodes have already been
processed. When constructing the PD curves for some node, n, we first invoke a
matching algorithm described in [34] to enumerate all possible cluster matches at
that node. For each cluster match, we then calculate its PD value as follows. The
(dynamic programming) power value of the cluster is the summation of the
(dynamic programming) power values of all its inputs plus the power cost of the
cluster itself. Similarly, the (dynamic programming) delay value of the cluster
match is the maximum of the (dynamic programming) delay values of its inputs
plus the delay thru the cluster itself.
87
Figure 6.2 illustrates the PD curve generation at node n5 with a cluster CL
n
. PD
curves of leaf set nodes n
1
, n
2
, and n
4
have already been computed. The PD curve
for CL
n
matched at node n
5
is created by PD curves from the leaf nodes. In the
conventional calculation method of [66][15], the power dissipation at node n for
cluster match CL
n
is calculated as:
()
( )
()
()
2
,
1
,()
2
∈
∈
⎛⎞
⎜⎟
=×× × +
⎜⎟
⎝⎠
∑∑
i
n
n
dyn i n
dyn n dd fo u
u nodes in CL
i
i
n inputs CL
PnCL
PnCL V f C u sw
fanout n
(6.1)
where C
fo
(u) is the capacitance driven by node u
,
sw
u
is the transition probability of
node u, and fanout(n
i
) is the number of fanouts that node n
i
drives. The arrival time
at node n
5
with CL
n
is simply the maximum arrival time among arrival times from
the leaf nodes plus the delay thru the cluster.
1
6.3.2 CORRECT ACCOUNTING OF LOGIC REPLICATION
Logic replication may be needed to meet a timing constraint at a node. It occurs
when a selected cluster, rooted at the node, covers nodes that have already been
covered by another cluster. Logic replication potentially occurs on the boundary of
logic cones associated with primary outputs.
1
For pASIC3 mapping problem, there is no “unknown load problem” [15], which often complicates the
calculation of the dynamic programming power and delay values in ASIC design flows. This is because the
load ahead of a node during the post-order traversal is always the load imposed by another pASIC3 logic cell.
Notice that the input pin capacitances of all inputs to a pASIC3 logic cell are the same.
88
Figure 6.2: PD curve generation for a node with a cluster.
We propose an algorithm, which estimates the cost of logic duplication by
simulating the clustering procedure for each PD point during the postorder
traversal. The algorithm assumes that the delay of each PD point at a node is close
enough to the required time at the node. Notice that, given a required time at a
node, the best PD point has the largest delay, which is equal to or less than to the
required time, and the smallest cost. Therefore, being selected as the best PD point
means that the required time at that node is very close to the delay in the PD point.
Therefore, we use the delay in the PD point as the required time at the node.
Under this assumption, being aware that the maximum path delays from a fanin in a
cluster is the largest delay from the fanin node to the root node of the cluster, we
can calculate the required times of fanin nodes of a cluster by subtracting the
maximum path delays from the required time at the root node. If any required time
89
of fanin nodes, which has been covered by clusters, is equal to or larger than the
arrival time of the fanin, there is no logic duplication on the logic cone boundary.
This leads to zero cost toward transitive fanin of the crossing boundary, whereas
the typical approach divides the cost by the size of fanout of fanin nodes. If logic
duplication is mandatory to meet the timing constraint, we only add the cost caused
by the duplicated logic. Notice that duplication operation can go toward primary
inputs until no timing violation occurs. Let’s assume that logic cone PO
0
has been
covered by clusters, and node d has a PD point having a node b and e as fanin
nodes. Figure 6.3(a) depicts the case in which no duplication is necessary. Since
our approach does not account for the cost of unduplicated nodes, the cost toward
transitive fanin of node b is zero. Therefore, we simply add the cost at node e and
the cost of node d to the total cost at node d. On the other hand, if duplication is
required as shown in Figure 6.3(b), the cost caused by the duplicated nodes is
added to the total cost.
An example in Figure 6.3(c) illustrates this notion in detail. An un-clustered logic
cone, Φ(PO
i
), is defined as the set of un-clustered nodes in the transitive fanin cone
of primary output, POi. For the moment, our focus will only be on the solid closed
curves, ignoring the dashed ones.
In Figure 6.3(c), Φ(PO
0
) is clustered first. In our proposed heuristic accounting of
the logic replication cost, during the postorder traversal of the circuit graph, when
90
calculating the dynamic programming (DP) power cost of CL
5
at n
5
, we divide the
DP cost of CL
1
at n
3
by its fanout count inside the logic cone (which is two) and
add to this quantity the power cost of CL
5
. Note that when we calculate the DP cost
of CL
2
at n
8
, we would account for the cost of CL
1
exactly once (1/2 contribution
coming from the n
3
→n
5
branch, the other coming from the n
3
→n
4
branch.)
1
Suppose that after preorder traversal of logic cone Φ(PO
0
), we select a clustering
solution in which CL
1
is matched at n
3
while CL
2
is matched at n
8
. Next, we start
clustering logic cone Φ(PO
1
). Consider generating the PD curve at node n
6
(having
first processed node n
11
, creating a cluster match of CL
6
at that node.) For cluster
CL
4
matched at node n
6
, we need to compute the DP power cost of its fanin nodes
n
3
and n
11
. At n
3
(n
11
), we have the PD curve of all possible clustering solution
rooted there. However, we do not know what specific clustering solution for the
cone rooted at n
3
will be used for each PD point at n
6
. This is the key difficulty in
the estimation of logic replication cost. Consider two extreme cases where in one
case, CL
1
match at n
3
is used as the best solution for Φ(PO
1
) resulting in no logic
duplication; in the other case all of the cone under n
3
is replicated since no common
signals exist between the best matching solution of this sub cone under Φ(PO
0
) and
Φ(PO
1
). The way we solve this problem is to calculate the PD curve of CL
4
matched at node n
6
, by completely ignoring the fact that cone Φ(PO
0
) has already
1
By reducing the power dissipation contribution of CL , we tend to favor an overall clustering solution in
which multiple fanout nodes are preserved after mapping, which reduces logic replication and improves the
final mapped power dissipation as was done in [66].
1
91
been processed and a mapping solution has been obtained. Suppose a PD curve of
X={x
1
,x
2
,…,x
m
} at node n
6
is generated in this way, where x
i
=(p
i
,d
i
). Take any
point say xi corresponding to a clustering solution with CL
4
matching at n
6
. We go
ahead and calculate the required time at output of n
3
as di-delay(CL
4
). If this
required time is larger than the arrival time at n
3
coming from the synthesis solution
for Φ(PO
0
), then for the calculation of the DP power cost of cluster CL
4
at n
6
, the
DP power cost of subcone rooted at n
3
is set to zero. Otherwise (i.e., a timing
violation will occur if the solution generated for Φ(PO
0
) is used), we find the
optimum clustering solution of logic subcone rooted at n
3
and use the power cost of
this solution toward the calculation of the power cost of cluster CL
4
at n
6
. The
dashed enclosed curves show a case in which the subcone rooted at n
3
must be
resynthesized in order to meet a timing requirement at PO
1
. Figure 6.3(b) shows the
replicated logic netlist. Notice that in the case of logic duplication, the duplicated
copy of n
3
needs to drive only node n6; therefore, the PD curve at node n
3
must be
updated to reflect this change in load. The arrival time of CL
4
becomes the
maximum value among arrival times of different input paths. Arrival times through
duplicated nodes can be calculated based on the arrival times of clustered nodes.
Accounting for logic replication, the total power dissipation at node n with cluster
CL
n
can be extended from equation (6.1) and can be given by:
92
Figure 6.3: Example of logic replication prediction.
() ()
( )
()
()
2
,,
1
,,
2,
∈∈
⎛⎞
Φ
⎜⎟
Φ= × +
⎜⎟ Φ
⎝⎠
∑∑
i
ni
dyn i n n
dyn n n dd fo u
u nodes in CL n inputs n
in
PnCL
PnCL V f C usw
fanout n
(6.2)
where Φ
n
is a logic cone to which node n belongs, C
fo
(u) is the capacitance driven
by node u, sw
u
is the transition probability of node u, and fanout(n, Φ
n
) is the
number of fanouts of node n inside Φ
n
.
93
Algorithm predict_logic_duplication(node, Network)
1. PD_curve = read_pd_curve(node)
2. for each point p for PD_curve
3. Λ = leaf_set_of(p)
4. if any node in Λ is not clustered
5. continue
6. end if
7. r = p.delay
8. compute_load_cluster(p.cluster)
9. compute_required_time(Λ, r)
10. for each node u for Λ
11. cycle_time = u.required_time
12. predict_cluster_selection(u, p.cluster, cycle_time)
13. end for
14. end for
Algorithm predict_cluster_selection(node, ParentCluster,
cycle_time)
1. if node is a primary input or node is not clustered
2. return
3. end if
4. update_arrival_time(node)
5. if node.arrival ≤ cycle_time
6. return
7. end if
8. update_pd_curve(node)
9. p = get_best_point(node.PD_curve)
10. Λ = leaf_set_of(p)
11. compute_load_cluster(p.cluster)
12. compute_required_time(Λ, cycle_time)
13. for each node u for Λ
14. cycle_time = u.required_time
15. predict_cluster_selection(u, p.cluster, cycle_time)
16. end for
Figure 6.4: Prediction of logic replication.
Figure 6.4 gives a pseudo code to account for the effect of logic replication. When
a node has to select a cluster, the function predict_logic_replication is executed. It
first checks to see if the replication is necessary by checking if any node in a leaf
94
set has been clustered. In order to compute the required times for nodes in the leaf
set, the capacitance of a node is computed as if the node has been clustered. The
required times for nodes in the leaf set are computed and passed on to the next level
logic replication prediction, in the recursive function predict_cluster_selection.
Figure 6.5: Logic replication cases: (a) child node is replicated, and (b) root
node is replicated.
6.4 CLUSTER SELECTION
After the PD curves for all nodes in the transitive fanin cone of a primary output
are computed during the postorder traversal of the circuit, a suitable point on the
PD curve of the root node is chosen, given the required time at the root of the logic
cone. The cluster for the point at the root is identified and the required times for its
inputs are computed. The preorder traversal resumes at its child nodes, to satisfy
the new required time while minimizing the power dissipation. Our approach is
95
similar to the PDMAP presented in [66]. Logic replication occurs during the
preorder traversal.
Algorithm logic_duplication(Cluster, ParentCluster, Network)
1. V= reverse_order_topological_sort(Cluster)
2. for each node u for V
3. if node u has been clustered
4. u’ = duplicate_node(u)
5. new_pd_curve = duplicate_pd_curve(u)
6. for each fanout fo for u
7. if fo is a node in Cluster or fo is in ParentCluster
8. add_fanout_to_new_node(u’, fo)
9. add_new_node_to_fanout_as_fanin(fo, u’)
10. end if
11. end for
12. add_new_node_into_network(u’, Network)
13. update_pd_curves(u, u’);
14. end if
15. end for
Figure 6.6: Logic replication for cluster selection.
The logic replication procedure starts by traversing from the root node of a cluster.
A duplicated node for a cluster is either a child node or a root node in the cluster.
Two cases must be considered to make the final network correct. Figure 6.5(a)
shows that the duplicated node n
1
’ is a child node in cluster CL
2
. In this case, we
can duplicate node n
1
and the new node n
1
' drives all fanouts located in cluster CL
2
.
On the other hand, Figure 6.5(b) shows the case where the root node of cluster CL
2
must be duplicated. In this case, cluster CL
3
is a parent cluster of CL
2
because it
was just selected before CL
2
. Fanouts of node n
3
in CL
3
must be driven by the
96
duplicated root node n
3
’. Once a node is duplicated, PD curves for both the original
node and the duplicated node must be updated since output loads of both nodes
change. Because of the delay change due to logic replication, all PD curves of
nodes in transitive fanout cones rooted at input nodes driving duplicated nodes
must be updated. Figure 6.6 shows the pseudo code of the logic replication
algorithm.
6.5 IMPLEMENTATION AND EXPERIMENTAL
RESULTS
We have implemented the clustering algorithm based on SIS [59] and used a 90nm
CMOS technology process model to estimate delay and power dissipation
information of primitive cells. Large combinational circuits were selected from the
MCNC91 benchmark. We first ran low power technology mapping by using
PDMAP [66] and then applied our low power, minimal logic replication clustering
algorithm to the network.
In FPGAs, inter-cluster interconnect capacitance interconnect is not ignorable.
Thus, we use constant values representing those capacitances in the pASIC3 family
FPGAs. However, in this research we assumed that the capacitances of intra-cluster
interconnect is ignorable. The key limitation of the present work is that of assuming
a fixed capacitance of inter-cluster interconnections. Inter-cluster interconnect is a
major component in FPGAs and accurately estimating the capacitance is crucial in
97
Table 6.1: Low-power clustering results: Area and delay
Without replication prediction With replication prediction
Ckts Nodes Clusters Delay (ns) Nodes Clusters
Delay
(ns)
i9 432 188 0.57 440 195 0.55
rot 514 189 0.46 392 175 0.43
i8 721 336 0.62 620 319 0.62
pair 1135 422 0.64 914 398 0.58
vda 563 206 0.36 408 161 0.34
x1 229 87 0.14 170 75 0.15
C5315 1016 391 0.47 883 378 0.46
alu4 672 224 0.68 545 224 0.68
apex6 546 170 0.27 419 174 0.29
C880 317 100 0.66 262 112 0.67
C3540 1085 357 0.88 858 346 0.85
alu2 343 112 0.48 276 108 0.47
C1355 306 109 0.33 271 96 0.32
C1908 301 108 0.50 241 87 0.49
C499 306 109 0.33 271 96 0.32
1 1 1 0.82 0.95 0.98
the early stages of the design, in order to increase the efficacy of this clustering
algorithm.
Table 6.1 and Table 6.2 show the experimental results. Our approach could reduce
the total number of nodes by 18% on an average, resulting in savings in area and
98
Table 6.2: Low-power clustering results: Power and CPU time
Without replication prediction With replication prediction
Ckts
Dynamic
power (μW)
Leakage
power (μW)
CPU
time (s)
Dynamic
power (μW)
Leakage
power (μW)
CPU
time
(s)
i9 317 362 11 314 284 17
rot 380 450 6 306 336 14
i8 492 610 18 425 521 27
pair 809 996 28 686 791 91
vda 217 514 14 149 368 16
x1 179 195 1 136 142 1
C5315 866 872 9 748 757 18
alu4 354 594 36 293 471 116
apex6 346 477 4 295 355 6
C880 226 280 6 201 230 26
C3540 692 954 49 577 735 205
alu2 212 295 13 172 234 10
C1355 229 266 3 174 237 6
C1908 203 263 3 159 210 7
C499 229 266 2 174 237 8
1 1 1 0.84 0.80 2.8
power dissipation without any sacrifice of speed. The run time increases due to the
repeated invocation of the logic replication predictor for the same node with the
same required time.
99
6.6 SUMMARY
In this paper, a minimal-area clustering algorithm for low power was proposed. The
proposed algorithm builds PD curves for nodes in a network by predicting the
amount of logic replication based on the timing constraint. The prediction provides
accurate power dissipation on a cost point on the curves.
Experimental results indicate that the proposed algorithm generates much less
duplicated logics, with less delay and power dissipation, compared to the traditional
cost distribution method. The algorithm achieved 18% reduction on the total
number of nodes, resulting in saving both dynamic power and leakage power
dissipation by 16% and 20% respectively, without any sacrifice of delay.
100
CHAPTER 7
CONCLUSION AND FUTURE WORK
This chapter provides the summary of this dissertation and suggestions for future
work with the coarse-grained, antifuse-based FPGAs.
7.1 DISSERTATION SUMMARY
As the size of a cluster became large and complex, clustering algorithms to pack
multiple nodes into clusters became a major step to improve area, speed, and power
dissipation. In this dissertation, we provided clustering techniques targeting at the
pASIC3 family FPGAs, whose architecture is coarse-grained and based on antifuse
technology.
In CHAPTER 3, a procedure of generating library cells from pASIC3 logic cell
was presented. We divided the pASIC3 logic cell into four base gates by applying 1
or 0 to the control inputs of internal multiplexers. Library cells were personalized
from each base gate by “sticking” and “bridging” operations on the inputs of base
101
gates. Since the number of library cells was not manageable, we selected 886 cells
out of 5279 cells.
In CHAPTER 4, we presented two algorithms to calculate the minimum number of
pASCI3 logic cells to cover a network. Since we knew the combinations of base
gates to fill a pASIC3 logic cell, a dynamic programming approach, which is an
extended version of the coin change problem, was utilized as a general solution. In
addition, we provided a special solution to the architecture by setting up two linear
equations, which fully explored the architectural characteristics of pASCI3 logic
cell. Clustering algorithms utilizing the optimal solution were provided. An
interconnect-aware clustering algorithm packs nodes in order to minimize the
number of inter-cluster interconnects. The algorithm selects a seed node with the
best connectivity; and relies on a gain function to evaluate nodes to be packed into
a cluster. A timing slack-driven clustering algorithm with the minimum area
constraint was also described to minimize the number of pASIC3 logic cells on the
timing critical paths of the circuit. The presented algorithm prevents nodes on
critical paths from being packed into a cluster on non-critical paths.
In CHAPTER 5, we presented a timing-driven clustering algorithm, which
optimally clusters nodes in network to minimize the number of clusters between
any input-output pair. The algorithm is based on Lawler’s labeling algorithm; it is
in fact an extended algorithm with consideration of the architecture of pASIC3
102
logic cell. Slack-time relaxation was used to minimize redundant logic duplication
along the clustering phase from primary outputs.
Finally, in CHAPTER 6, we presented a low-power clustering algorithm, which
minimized logic replication. The algorithm uses a power-delay curve to represent
both power and delay of a cluster rooted at a node. The replication cost was
predicted by simulating the replication procedure, under the assumption that the
delay at the node with the cluster is close to the possible required time. Minimizing
the logic replication resulted in reductions on area and power dissipation without
any sacrifice of speed.
This dissertation thus provided efficient clustering techniques for area, timing, and
low power.
7.2 FUTURE WORK
The pASIC3 logic cell has four outputs. We simplified the technology mapping by
dividing the pASIC3 logic cell into base gates, so that library cells could be
generated from those base gates. Dividing the logic cell precludes a portion of the
potential utilizations of the cell. There are two research issues in using the logic cell
without any simplification. First, technology mapping with multiple output library
cells may significantly improve mapping results. Secondly, we will need an
efficient Boolean matching algorithm for multiple output cells.
103
By developing the technology mapper, we will be able to utilize a large portion of
the possible logic realization from the logic cell. Since in this way, we will be able
to utilize multiplexers in the pASIC3 logic cell, area, speed, and power dissipation
may significantly improve.
104
BIBLIOGRAPHY
[1] E. Ahmed, “The effect of logic block granularity on deep-submicron FPGA
performance and density,” Ph.D. Thesis, University of Toronto, 2001.
[2] E. Ahmed and J. Rose, “The effect of LUT and cluster size on deep-
submicron FPGA performance and density,” in Proc. FPGA, 2000, pp. 3 –
12.
[3] A. Aho and S. Johnson, “Optimal code generation for expression trees,”
Journal of ACM, vol. 23, no. 3, pp. 488 – 501, June 1976.
[4] J. H. Anderson and F. N. Najm, “Power-aware technology mapping for
LUT-based FPGAs,” in Proc. IEEE international Conference on Field-
Programmable Technology (FPT), 2002, pp. 211 – 218.
[5] APEX FPGA Datasheet, Altera Corporation (http://www.altera.com)
[6] L. Benini and G. De Micheli, “A survery of Boolean matching techniques
for library binding,” ACM Trans. on Design Automation of Electronic
Systems, vol. 2, no. 3, pp. 193 – 226, July 1997.
[7] V. Betz and J. Rose, “Cluster-based logic blocks for FPGAs: area-efficiency
vs. input sharing and size,” in Proc Custom Integrated Circuits Conference,
1997, pp. 551 – 554.
[8] V. Betz and Jonathan Rose, “VPR: a new packing, placement, and routing
tool for FPGA research,” Int’l Workshop on FPL, 1997, pp. 213- 222.
[9] V. Betz, J. Rose, A. Marquardt, Architecture and CAD for Deep-Submicron
FPGAs, Kluwer Academic Publishers, 1999.
105
[10] E.Bozogzadeh, S. Ogrenci-Memik, M. Sarrafzadeh, “Rpack: routability-
driven packing for cluster-based FPGA,” in Proc. Asia-South Pacific
Design Automation Conference, pp. 629 – 634, 2001.
[11] S. Brown and J. Rose, “Architecture of FPGAs and CPLDs: a tutorial,”
IEEE Design and Test of Computers, vol. 12, no. 2, pp. 42- 57, 1996.
[12] R. Bryant, “Graph-based algorithms for Boolean function manipulation,”
IEEE Transactions on Computers, C-35(8): 677 – 691, August 1986
[13] D. R. Chase, “An improvement to bottom-up tree pattern matching,”
Journal of the Association for Computing Machinery, pp. 168-177, 1987.
[14] K. Chaudhary and M. Pedram, “A near optimal algorithm for technology
mapping minimizing area under delay constraints,” in Proceddings of the
ACM/IEEE Design Automation Conference, 1992.
[15] K. Chaudhary and M. Pedram, “Computing the area versus delay trade-off
curves in technology mapping,” IEEE Trans. on Computer Aided Design,
Vol. 14, No. 12, 1995, pp. 1480-1489.
[16] C. Chen, T. Hwang, and C. L. Liu, “Low power FPGA design – a re-
engineering approach,” in Proc. Design Automation Conference, 1997, pp.
656 – 661.
[17] D. Chen, J. Cong, and Y. Fan, “Low-power high-level synthesis for FPGA
architectures,” in Proc. International Symposium on Low Power Electronics
and Desings, 2003.
[18] D. Chen, J. Cong, F. Li, and L. He, “Low-power technology mapping for
FPGA architectures with dual supply voltages,” in FPGA, 2004.
[19] J. Cong and Y. Ding, “Combinational logic synthesis for SRAM-based
field-programmable gate arrays,” ACM Transactions on Des. Automat.
Electron. System, April, pp. 145 – 204, 1996.
[20] J. Cong, J. Peck, and Y. Ding, “RASP: a general logic synthesis system for
SRAM-based FPGAs,” in Proc. FPGA, pp. 137 – 143, 1996.
[21] J. Cong and M. Romesis, “Performance-driven multi-level clustering with
application to hierarchical FPGA mapping,” in Proc. Design Automation
Conference, 2001, pp. 389 - 394.
106
[22] J. Cong and Y. Ding, “FlowMap: An optimal technology mapping
algorithm for delay optimization in lookup-table based FPGA designs,”
IEEE Transactiosn on Computer-Aided Design, Feb. 1994,vol. 13, no. 1, pp.
1- 12.
[23] J. Cong and Y. Hwang, “Simultaneous depth and area minimization in
SRAM-based FPGA mapping,” UCLA Computer Science Dept. Tech.
Report CSD-950001, Jan. 1995.
[24] J. Cong, “ An interconnect-centric design flow for nanometer technologies,”
in Proc. Workshop on Synthesis and System Integration of Mixed
Technologies, 2001, pp. 199-205.
[25] J. Cong, and Y. Ding, “LUT-based FPGA technology mapping under
arbitrary net-delay models,” in Computer Graphics, vol. 18, no. 4, pp. 137 –
148, 1994.
[26] T. H. Cormen, C. E. Leiserson, and Ronald L. Rivest, Introduction to
Algorithms, The MIT Press, 2000.
[27] W. E. Donath, “Placement and average interconnect requirements of
computer logic,” IEEE Trans. Circuits and Systems, CAS-26:272-277, 1974.
[28] C. Ebeling, L. McMurchie, S. A. Hauck, and S. Burns, “Placement and
routing tools for the Triptych FPGA,” Transactions on VLSI, Dec. 1995, pp.
463 – 482.
[29] W. C. Elmore, “The transient response of damped linear networks with
particular regard to wide-band amplifiers,” Jour. Appl. Physics, vol. 19. no.
1, pp. 55-63, January 1948.
[30] S. Ercolani and Giovanni De Micheli, “Technology mapping for electrically
programmable gate arrays,” in Proc. 28th ACM/IEEE Design Automation
Conference, 1991, pp. 234- 239.
[31] V. George and J. M. Rabaey, Low-Energy FPGAs, Kluwer Academic
Publishers, 2001.
[32] V. George, Hui Zhang, and Jan Rabaey, “The design of a low energy
FPGA,” in Proc. International Symposium on Low Power Electronics and
Desings, pp. 188 – 193, 1999.
107
[33] R. Hitchcock, G. Smith, and D. Cheng, “Timing analysis of computer-
hardware,” Technical report, IBM Journal of Research and Development,
Jan. 1983.
[34] C. M. Hoffman and M. J. O’Donnell, “Pattern matching in trees,” Journal of
the Association for Computing Machinery, pp. 68-95, January 1982.
[35] http://mathworld.wolfram.com.
[36] C-W. Kang and M. Pedram, “Clustering techniques for coarse-grained,
antifuse-based FPGAs,” in Proc. Asia and South Pacific Design Automation
Conference, 2005, pp. 785 - 790.
[37] C-W. Kang, A. Iranli, and M. Pedram, “Technology mapping and packing
for coarse-grained, antifuse-based FPGAs,” in Proc. Asia and South Pacific
Design Automation Conference, 2004, pp. 209 - 211.
[38] S. Kaptanoglu, Greg Bakker, Arun Kundu, and Ivan Corneillet, “A new
high density and very low cost reprogrammable FPGA architecture,” in
Proc. FPGA, 1999, pp. 3 – 12.
[39] K. Keutzer, “Dagon: technology binding and local optimisation by dag
matching, “in Proc. Design Automation Conference, 1987, pp. 341 – 347.
[40] S. Kirkpatrick, C. Gelatt and M. Vecchi, “Optimization by simulated
annealing,” Science, May 13, 1983, pp. 671 – 680.
[41] B. Kumthekar, L. Benini, E. Macii, and F. Somenzi, “Power optimization of
FPGA-based designs without rewiring,” in Proc. Compute. Digit. Tech., vol.
147, no. 3, May 2000, pp. 167 – 174.
[42] Y. Lai, S. Sastry, and M. Pedram, “Boolean matching using binary decision
diagrams with applications to logic synthesis and verification,” in Proc.
IEEE International Conference on Computer Design, 1992, pp. 42 – 458.
[43] E. L. Lawler, K. N. Levitt, J. Turner, “Module clustering to minimize delay
in digital networks,” IEEE Transactions on Computers, vol. C-18, no. 1,
January 1969, pp. 47 – 57.
[44] F. Li, Deming Chen, Lei He, and Jason Cong, “Architecture evaluation for
power-efficient FPGAs,” in Proc. FPGA, 2003, pp. 175-184.
108
[45] F. Li, Y. Lin, L. He, and J. Cong, “Low-power FPGA using pre-defined
dual-Vdd/dual-Vt fabrics,” in FPGA, 2004.
[46] H. Li, W. Mak, and S. Katkoori, “Power minimization algorithms for LUT-
based FPGA technology mapping,” ACM Transactions on Design
Automation of Electronic Systems, vol. 5, 2003, pp. 111-128.
[47] A. Marquardt, “Cluster-based architecture, timing-driven packing and
timing-driven placement for FPGAs,” Ph.D. Thesis, University of Toronto,
1999.
[48] A. Marquardt, V. Betz, and J. Rose, “Speed and area tradeoffs in cluster-
based FPGA architecture,” IEEE Transactions on Very Large Scale
Integration (VLSI) Systems,” vol. 8, no. 1, February 2000, pp. 84 – 93.
[49] A. Marquardt, V. Betz, and J. Rose, “Using cluster-based logic blocks and
timing-driven packing to improve FPGA speed and density,” in Proc. FPGA,
pp. 37- 46, 1999.
[50] R. Nair, “A simple yet effective technique for global wiring,” IEEE
Transactions on Computer-Aided Design, vol. 6, no. 6, March 1987, pp.
165 – 172.
[51] V. T. Paschos, “A survey of approximately optimal solutions to some
covering and packing problems,” ACM Computing Surveys, vol. 29, no. 2,
June 1997.
[52] pASIC3 FPGA Family Datasheet, Corporation
(http://www.quicklogic.com).
[53] M. Pedram and B. Preas, “ Interconnection length analysis for standard cell
layouts,” IEEE Transactions on Computer-Aided Design of Integrated
Circuits and Systems, vol. 18, no. 10, Oct. 1999, pp. 1512 – 1519.
[54] QuickLogic.com, QuickWorks UserManual
[55] R. Rajaraman and D. F. Wong, “Optimal clustering for delay
minimization,” in Proc. ACM/IEEE Design Automation Conference, 1993,
pp. 309 – 314.
[56] J. Rose and D. Hill, “Architectural and physical design challenges for one-
million gate FPGAs and beyond,” in Proc. ACM Symposium on FPGAs,
Feb. 1997, pp. 129 – 132.
109
[57] S. M. Sait and Habib Youssef, VLSI Physical Design Automation, World
Scientific Publishing, 1999.
[58] A. Sangiovanni-Vinceltelli, A. E. Gamal, and J. Rose, “Synthesis Methods
for Field Programmable Gate Arrays,” in Proc The IEEE, “vol. 81, no. 7,
July 1993, pp. 1057 – 1083.
[59] E. Sentovich, K. Singh, L. Lavagno, C. Moon, R. Murgai, A. Saldanha, H.
Savoj, P. Stephen, R. Brayton, and A. Sangiovanni-Vincentelli, “ SIS: A
system for sequential circuit synthesis,” University of California, Berkeley,
Tech. Report, May 1994.
[60] Li Shang, Alireza S. Kaviani, and Kusuma Bathala, “Dynamic power
consumption in Virtex-II FPGA family,” in Proc. FPGA, 2002, pp. 157 –
164.
[61] A. Singh, G. Parthasarty, and M. Marek_sadowska, “Efficient circuit
clustering for area and power reduction in FPGAs,” ACM Trans. Design
Automation of Electrnic Systems, vol 7. no. 4. pp. 643 – 663, 2002.
[62] M. Smith and J. Sebastian, Application-Specific Integrated Circuits,
Addison Weseley Longman, 1997.
[63] J. S. Swartz, V. Betz, and J. Rose, “A fast routability-driven router for
FPGAs,” in Proc. FPGA, 1998, pp. 140 – 149.
[64] The 2004 International Technology Roadmap for Semiconductors.
http://public.itrs.net
[65] M. Tom and G. Lemieux, “Logic block clustering of large designs for channel-
width constrained FPGA,” in Proc. Design Automation Conference, pp. 726-731,
2005.
[66] C. Tsui, M. Pedram, and A. M. Despain, “Technology decomposition and
mapping targeting low power dissipation,” in Proc. Design Automation
Conference, 1993, pp. 68-73.
[67] T. Tuan, and B. Lai, “Leakage power analysis of a 90nm FPGA,” In Proc.
Custom Integrated Circuits Conference, pp. 21 – 23, 2003.
[68] H. Vaishnav and M. Pedram, “Delay-optimal clustering targeting low-
power VLSI circuits,” IEEE Transactions on Computer-Aided Design of
Integrated Circuits and Systems, vol. 18, no. 6, pp. 799 – 811, 1999.
110
[69] Virtex FPGA Datasheet, Xilinx Corporation (http://www.xilinx.com)
[70] K. Wang and T. Hwang, “Boolean matching for incompletely specified
functions,” in Proc. ACM/IEEE Design Automation Conference, 1995, pp.
48 – 53.
[71] M. Wang, X. Yang, and M. Sarrafzadeh, “Dragon2000: standard-cell
placement tool for large industry circuits,” in Proc. International
Conferecence on Computer Aided Design, 2000, pp. 260-263.
[72] K. Zhou, R. Kraft, J. F. McDonald, “Gigahertz FPGAs with new power
saving techniques and decoding logic,” in Proc. NASA/DOD Conference on
Evolvable Hardware, 2002.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
A template-based standard-cell asynchronous design methodology
PDF
Dynamic voltage and frequency scaling for energy-efficient system design
PDF
Encoding techniques for energy -efficient and reliable communication in VLSI circuits
PDF
Error resilient techniques for robust video transmission
PDF
A thermal management design for system -on -chip circuits and advanced computer systems
PDF
Efficient PIM (Processor-In-Memory) architectures for data -intensive applications
PDF
Effects of non-uniform substrate temperature in high-performance integrated circuits: Modeling, analysis, and implications for signal integrity and interconnect performance optimization
PDF
A CMOS frequency channelized receiver for serial-links
PDF
Efficient acoustic noise suppression for audio signals
PDF
Design and analysis of server scheduling for video -on -demand systems
PDF
Adaptive video transmission over wireless fading channel
PDF
Energy -efficient information processing and routing in wireless sensor networks: Cross -layer optimization and tradeoffs
PDF
Alias analysis for Java with reference -set representation in high -performance computing
PDF
Design and analysis of MAC protocols for broadband wired/wireless networks
PDF
Consolidated logic and layout synthesis for interconnect -centric VLSI design
PDF
CMOS gigahertz -band high -Q filters with automatic tuning circuitry for communication applications
PDF
Energy -efficient strategies for deployment and resource allocation in wireless sensor networks
PDF
Adaptive dynamic thread scheduling for simultaneous multithreaded architectures with a detector thread
PDF
Energy and time efficient designs for digital signal processing kernels on FPGAs
PDF
Contributions to coding techniques for wireless multimedia communication
Asset Metadata
Creator
Kang, Chang Woo
(author)
Core Title
Clustering techniques for coarse -grained, antifuse-based FPGAs
School
Graduate School
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
engineering, electronics and electrical,OAI-PMH Harvest
Language
English
Contributor
Digitized by ProQuest
(provenance)
Advisor
Pedram, Massoud (
committee chair
), Draper, Jeffrey (
committee member
), Zimmermann, Roger (
committee member
)
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c16-443840
Unique identifier
UC11336620
Identifier
3237159.pdf (filename),usctheses-c16-443840 (legacy record id)
Legacy Identifier
3237159.pdf
Dmrecord
443840
Document Type
Dissertation
Rights
Kang, Chang Woo
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the au...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus, Los Angeles, California 90089, USA
Tags
engineering, electronics and electrical