Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Hierarchical design space exploration for efficient application design using heterogeneous embedded system
(USC Thesis Other)
Hierarchical design space exploration for efficient application design using heterogeneous embedded system
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
HIERARCHICAL DESIGN SPACE EXPLORATION FOR
EFFICIENT APPLICATION DESIGN USING
HETEROGENEOUS EMBEDDED SYSTEM
Copyright 2005
by
Sumit Mohanty
A Dissertation Presented to the
FACULTY OF THE GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(ELECTRICAL ENGINEERING)
May 2005
Sumit Mohanty
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
UMI Number: 3180433
Copyright 2005 by
Mohanty, Sumit
All rights reserved.
INFORMATION TO USERS
The quality of this reproduction is dependent upon the quality of the copy
submitted. Broken or indistinct print, colored or poor quality illustrations and
photographs, print bleed-through, substandard margins, and improper
alignment can adversely affect reproduction.
In the unlikely event that the author did not send a complete manuscript
and there are missing pages, these will be noted. Also, if unauthorized
copyright material had to be removed, a note will indicate the deletion.
®
UMI
UMI Microform 3180433
Copyright 2005 by ProQuest Information and Learning Company.
All rights reserved. This microform edition is protected against
unauthorized copying under Title 17, United States Code.
ProQuest Information and Learning Company
300 North Zeeb Road
P.O. Box 1346
Ann Arbor, Ml 48106-1346
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Dedication
to my parents
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Acknowledgments
I would like to express my gratitude to all those who’s continuous support made it pos
sible for me to complete this thesis. I am deeply indebted to my advisor Prof. Dr. Viktor
K. Prasanna for his stimulating suggestions and encouragements that guided me for the
last five years. But for him this work would not have been possible. I would also like
to thank the other members of my PhD committee who helped me lay foundation for
my work and took effort in reading and providing me with valuable comments on ear
lier versions of this thesis: Pedro Diniz, Alice Parker, Gaurav Sukhatme, and Cauligi
Raghavendra.
I would also like to thank my colleagues Egor Andreev, Seonil Choi, Vaibhav Mathur,
Jingzhao Ou, Ronald Scrofano, Yang Yu for their helpful suggestions at different stages
of the project. Additionally, I thank James Davis, Akos Ledeczi, and Sandeep Neema
of the Vanderbilt University for their contributions in the development of the MILAN
design environment. Zackary Baker and Animesh Pathak deserve a special thank for the
constant support during my enjoyable stay at the university. I also thank the Defense
Advanced Research Projects Agency (DARPA) for sponsoring my research.
I am very grateful for the love and support of my parents Hareram and Bijaya and
my brothers Amit and Binit. Finally, I am very grateful to my fiancee Sreemathi, for her
love, patience, and support during the PhD period.
iii
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Contents
Dedication ii
Acknowledgments iii
List of Tables vii
List of Figures viii
Abstract xi
1 Introduction 1
1.1 A p p ro ach ................................... 8
1.2 Thesis C ontributions.................................................................................. 12
1.2.1 Hierarchical Design Space E xploration ....................................... 13
1.2.2 Models for Design of and Heterogeneous Embedded
System s.................................................. 14
1.2.3 High-level Performance Estimator (HiPerE)................................. 15
1.2.4 Modeling and Performance Estimation of FPGA based Kernel
Design . ........................................................................................ 16
1.2.5 Design Framework and Demonstration ....................................... 17
1.3 Thesis Outline............................................................................................... 18
2 Application Design using Heterogeneous Embedded System 20
2.1 Candidate Devices ...................................................................................... 21
2.1.1 ISA-based Processors . . ............................................................. 21
2.1.2 Field Programmable Gate Arrays ................................................. 25
2.1.3 M em ories......................................................................................... 27
2.1.4 System-on-Chips............................................................................. 29
2.2 Target A pplications...................................................................................... 31
2.3 Deployment Scenarios and Performance R equirem ents.......................... 34
2.4 Design S p a c e ............................................................................................... 36
2.5 Design P roblem ............................................................................................ 37
iv
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
3 Related Work 41
4 Hierarchical Design Space Exploration 46
4.1 The Problem D om ain ................................................................................. 46
4.1.1 Formal Definition of the Design P ro b le m .................................... 48
4.2 Challenges.................................................................................................... 48
4.3 A p p ro ach .................................................................................................... 51
4.3.1 H ypothesis...................................................................................... 51
4.3.2 Our Approach Towards Hierarchical Design Space Exploration 51
4.4 Key Ideas of Our Methodology................................................................. 53
4.4.1 Advantages of Our Methodology................................................... 56
4.4.2 Robustness against approximation errors .................................... 58
5 Application Design using Heterogeneous Embedded Systems 63
5.1 M o d elin g .................................................................................................... 63
5.1.1 Kernel Level M odeling................................................................... 63
5.1.2 Application Level M odeling.......................................................... 66
5.1.3 Resource M odeling......................................................................... 67
5.1.4 Modeling Multiple Operating States of Target D e v ic e s 69
5.1.5 Modeling Duty C ycle....................... 70
5.1.6 Modeling Constraints...................................................................... 72
5.1.7 Modeling Reconfiguration . . ................................................... 72
5.2 Design F low .................................................................................................. 74
5.3 Illustration of Hierarchical Design Space Exploration............................ 75
5.3.1 Energy-Efficient Architecture Selection for a Personnel Detec
tion Application ............................................................................ 76
5.3.2 Design of LMS-based MVDR Adaptive Beamformer................. 79
6 Tool and Design Flow 82
6.1 Model Integrated Computing (MIC) ........................................ 82
6.2 Generic Modeling Environment ............................................................... 84
6.3 Defining the M etam odels........................................................................... 85
6.3.1 Resource M odeling.................... 85
6.3.2 Modeling F P G A ............................................................................. 90
6.3.3 Modeling Applications................................................................... 94
6.4 Integrating and Driving Tools.................. 97
6.5 High-level Performance Estimator (H iP erE )........................................... 98
6.5.1 Component Specific Performance E stim ation.............................. 99
6.5.2 System-Level Performance Estimation..............................................101
6.5.3 Activity R ep o rt.......................... 103
6.5.4 Performance Estimation based on Duty-Cycle.................................104
6.5.5 Design Browser....................................................................................105
v
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
6.6 Dynamic Programming based N-Optimization Heuristic............................106
6.7 Design Space Exploration for FPGA based D esigns...............................110
6.8 Design F low .................................................................................................... 112
6.9 Design Reuse and Extensibility................................................................. 114
6.9.1 Extending the Design Framework . . . ...................................... 116
7 Illustrative Examples 117
7.1 M o d elin g ....................................................................................................... 117
7.1.1 Application M o d e lin g ......................................................................117
7.1.2 Resource M odeling............................................................................119
7.1.3 Mapping and Constraint Specification............................................ 120
7.1.4 FPGA-based Kernel M odeling.........................................................123
7.2 Integration of T o o ls........................................................................................125
7.2.1 Integrating SimpleScalar ...................... 125
7.2.2 Integrating A Dynamic Programming based DSE T o o l ................ 126
7.2.3 Integrating XFLOW, ModelSim, and X P ow er................................128
7.3 Energy-Efficient Designs of Matrix Multiplication Algorithm...................130
7.4 Energy-Efficient Mapping for a Beamforming A pplication......................132
8 Conclusions and Future Directions 136
8.1 Future D irections........................................................................................... 139
8.1.1 Evolutionary A lgorithm s...................................................................139
8.1.2 Extending the Application M o d e l...................................................139
8.1.3 Platform F P G A ..................................... 140
8.1.4 Design Metrics such as Cost, Size, and W e ig h t.............................140
8.1.5 Integration of Battery M o d e ls .........................................................141
Reference List 142
vi
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
List of Tables
2.1 Operating point descriptions for IBM PowerPC 4 0 5 L P .......................... 23
4.1 Robustness Against Approximation E rro rs................................................ 59
5.1 Results for personnel detection application................................................ 78
5.2 Candidate designs and max update rates supported ................................ 81
7.1 Performance estimates of the tasks at different operating Frequencies . . 134
7.2 N-optimal results using our methodology ................................................... 134
vii
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
List of Figures
1.1 A model of heterogeneous embedded system ........................................ 5
1.2 Application model based on directed acyclic graphs............................... 6
1.3 Design choices and large design s p a c e .............. 9
1.4 Hierarchical design space ex p lo ratio n ..................................................... 10
2.1 Lower energy through voltage scaling by exploiting available slack . . . 22
2.2 ProASIC power consumption compared with SRAM FPG A .................. 26
2.3 Low power support by Micron Mobile SD R A M ..................................... 28
2.4 Automated Target Recognition (A T R )..................................................... 32
2.5 Target tracking ........................................................................................... 32
2.6 LMS-based MVDR adaptive beamforming............................................... 33
2.7 Sources for energy dissipation for execution based on a duty cycle . . . 35
2.8 Large design space due to large number of parameters............................ 36
2.9 Two-level design space exploration........................................................... 38
4.1 Different components that define our problem dom ain............................ 47
4.2 Design problem overview........................................................................... 49
4.3 Hierarchical design space ex p lo ratio n ..................................................... 51
4.4 Optimal mapping of linear array of ta sk s.................................................. 59
4.5 Experimental s e tu p ..................................................................................... 60
4.6 Effect of error in identifying the optimal d e s ig n ...................................... 62
viii
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
5.1 Domain specific modeling and performance estimation and component
power state m atrices..................................................................................... 64
5.2 Hierarchical data flow graph with alternatives........................................ 66
5.3 Modeling multiple operating states........................................................... 70
5.4 Modeling reconfigurations....................................................................... 73
5.5 Sample design f lo w .................................................................................... 75
5.6 Personnel detection application and the hardware design s p a c e 76
5.7 A sample constraint for DESERT.............................................................. 79
6.1 Model Integrated Computing (Sztipanivits and Karsai, 1 9 9 9 )............... 83
6.2 Resource m etam o d el................................................................................. 86
6.3 Resource metamodel (additional details).................................................. 87
6.4 State transition m etam odel........................................................................ 89
6.5 Overview of domain specific modeling approach .................................. 91
6.6 Component power state m atrices.............................................................. 93
6.7 Mapping metamodel ................................................................................. 96
6.8 Component specific performance estimation using M IL A N ................... 99
6.9 Sample task graph with m apping.................................................................. 101
6.10 Design b ro w ser.................. 105
6.11 Linear array of tasks and state transition......................................................107
6.12 Effect of error rate on A ............................................................................109
6.13 Design flow using our framework.................................................................. 114
7.1 Top level application model (tasks and dependencies)................................118
7.2 Modeling implementation alternatives for a ta s k ......................................... 119
7.3 Top level resource model (candidate devices) ............................................ 121
7.4 Modeling dynamic voltage scaling for a d e v ic e ......................................... 122
ix
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
7.5 Specifying constraints..................................................................................... 123
7.6 Sample constraint............................................................................................124
7.7 Library of building blocks and FPGA design................................................125
7.8 Invoking SimpleScalar model interpreter using our fram ew ork 127
7.9 Modeling linear array of tasks and invoking dynamic programming based
t o o l ..................................................................................................................128
7.10 Architecture and algorithm for Matrix Multiplication (Prasanna and Tsai,
1 9 9 1 ) .................................................. 130
7.11 Analysis of Matrix Multiplication algorithm ................................................131
7.12 Beamforming application and frequency transition costs for PXA 255 . 132
x
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Abstract
Heterogeneous embedded systems integrate multiple programmable components such
as microprocessors, microcontrollers, digital signal processors, and field programmable
gate arrays together with dedicated hardware components such as application specific
integrated circuits and memory into a single system. During application design using
heterogeneous embedded systems, the availability of multiple programmable compo
nents enable the exploration of tradeoff among key performance metrics such as energy,
latency, and area in addition to other metrics such as programmability, cost, and time-
to-market. Features, of the integrated devices, such as reconfiguration, voltage and fre
quency scaling, low power operating states, efficient start up and shut down, among
others, are exploited by the designer to meet the given latency and energy constraints.
In addition, duty cycle is considered as a significant design parameter while maximizing
the battery life of portable devices. Duty cycle is the proportion of time during which
a system is active. Thus, for systems with low duty cycle, energy dissipation due to
quiescent power while idling can be significant. Therefore, during design space explo
ration, candidate designs must be evaluated based on duty cycle specification. How
ever, a large number of choices during such exploration result in a large design space
that must be explored efficiently. Traditional approaches of using instruction-set, cycle-
accurate, or register transfer-level simulators are extremely time-consuming (e.g. < 200
xi
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
kilo instruction per second for a cycle-accurate simulator) and thus fail to perform effi
cient exploration. In addition, due to lack of a common interface standard and varying
simulation speeds, it is extremely difficult to integrate these simulators to simulate a
heterogeneous embedded system. On the other hand, optimization heuristics based on
high-level models are extremely fast. However, due to simplifying assumptions while
defining high-level models, performance estimation and design space exploration based
on such models are susceptible to error. In this dissertation, we propose a hierarchical
methodology that integrates design space pruning heuristics, a high-level performance
estimator, and low-level simulators to enable efficient exploration of large design spaces.
Through such integration, our methodology exploits the speed versus accuracy tradeoffs
to perform faster and more accurate evaluation of large design spaces. Towards this
end, initially, our methodology evaluates large design spaces using pruning heuristics
to eliminate designs based on the given performance and design constraints. Following
design space pruning, high-level estimator and low-level simulators are used for rela
tively slow but detailed evaluation of the remaining designs to identify the design that
meets the given latency and energy requirements. Use of a high-level estimator enables
the integration of simulation results obtained from various simulators to generate per
formance estimates for the complete heterogeneous embedded system. We applied the
proposed methodology to the domain of low power high performance signal processing
application design using heterogeneous embedded systems based on a given duty cycle
specification. Specifically, for a given signal processing application, we evaluate sev
eral candidate processing and memory devices to identify a heterogeneous embedded
system along with the mapping and scheduling of the application tasks such that the
design meets the given latency and energy requirements. Our methodology was demon
strated to be approximately three orders of magnitude faster while producing similar
results when compared with design space exploration using low-level simulators. In
xii
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
addition, we demonstrated robustness against approximation errors through identifica
tion of designs with lower latency or energy dissipation compared to the results obtained
through heuristic based approaches. We have also developed a unified extensible design
framework based on the model integrated computing approach that implements our
hierarchical design methodology and enables low power signal processing application
design using heterogeneous embedded systems.
S.M.
Los Angeles, California
May 2005.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Chapter 1
Introduction
Application specific integrated circuits (ASICs) are custom-designed processors that
consolidate a number of functionalities of a given application onto a single dedicated
chip. In contrast with general purpose processors executing a software implementation
of the application, ASICs offer higher performance and lower energy dissipation due
to dedicated optimized hardware. Thus ASICs have been the primary choice for the
design of low power high performance embedded systems. The design of such embed
ded systems involves the mapping and scheduling of a given application onto a target
hardware that consists of application specific integrated circuits (ASICs), microcon
troller or low power/performance microprocessors, and memory [8, 22]. Performance
requirements for such embedded systems are to meet the given latency constraint (to
sustain a given input rate) while minimizing energy dissipation. Therefore, ASICs are
attractive because they allow meeting a specific and unique system-level functional and
latency requirement while sustaining lower power dissipation compared with general
purpose processors [70]. Design of such embedded systems requires high performance
application kernels to be implemented using ASICs and the control logic and non-time-
critical functionalities to be implemented using a microcontroller or a low performance
microprocessor. Various challenges during the design of such embedded systems are
design, verification, and testing of the ASICs, specification of interface and communica
tion protocols, and data management while meeting the given performance constraints.
Such a design process is traditionally referred to as the hardware/software co-design
methodology [8, 22, 91]. However, ASIC based embedded system design has two main
1
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
disadvantages compared with design using programmable devices. First is a long and
expensive design cycle, which includes circuit specification, design, testing, and manu
facturing. Second is design modifications and corrections which require similarly large
amount of time and cost especially if they are to be performed post manufacturing.
However, with the advances in the design of general purpose processors (GPPs) and
digital signal processors (DSPs), ASIC based design is no longer the only solution for
high performance low power embedded systems. GPPs such as Intel PXA 255 [35],
IBM PowerPC 405 [66] and DSPs such as TI C5000 series [89] support an array of
features for low power design while sustaining a clock speed of up to 400 to 500 MHz.
These devices are designed using low-leakage processes and thus dissipate low quies
cent power. In addition, these devices support a number of low power operating and
standby states and dynamic voltage and frequency scaling for energy efficient appli
cation design. Similarly, field programmable gate arrays (FPGAs) are traditionally not
considered suitable for low power application design because of higher quiescent power,
significant energy dissipation during start-up, and higher dynamic power due to longer
interconnects and overheads for reconfigurability. However, in recent past, there have
been several key advances in the FPGA manufacturing technology [1, 96]. In general,
FPGAs have become denser, use a lower supply voltage, and provide more computa
tion per Watt than the previous generation of devices. For example, the Xilinx Virtex-4
devices based on 90 nm technology reduces power consumption by as much as 50%
compared with the other Virtex family of devices [97]. In addition, non-volatile Flash
based configuration memory available in some FPGAs (e.g. Actel ProASICppt/s [1])
allows rapid and power efficient startup. Some of the FPGAs such as the Xilinx Virtex-
II series [96] support partial reconfiguration. Such a feature enables the use of a smaller
number of FPGAs and dynamic reconfiguration for low power and area efficient design.
Furthermore, low power memory components are also available. For example, Micron
2
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Mobile SDRAM [52] provides several low power standby modes, variable self-refresh
current, and partial array self-refresh for low power design. In addition, the use of
such commercial-off-the-self (COTS) components for the design of embedded systems
allows fast prototyping and thus reduces time-to-market [47] significantly when com
pared with ASIC design. Such integration of multiple programmable components into a
single system is referred to as heterogeneous embedded system.
Therefore, such GPPs, DSPs, FPGAs, and memories are being considered for low
power and high performance embedded system design. However, in terms of perfor
mance and flexibility, each class of components has its own advantages and disadvan
tages. For example, an instruction set architecture (ISA)-based embedded processor
(GPP, DSP, or microcontroller) is software programmable, possibly low power, but may
not meet the high performance needs of some signal processing application [79]. In
contrast, FPGAs support high degree of parallelism resulting in higher performance
but are not energy efficient and may not be suitable for control intensive applications.
ASICs provide both high performance and low power but are not cost effective at low
volume and require a large time-to-market. Additionally, applications also enforce spe
cific functional requirements, which may require specific hardware capabilities. For
example, signal processing applications exist that require high precision floating point
operations [85]. Such applications require the use of floating-point processors that may
not be energy efficient or software emulation on fixed-point processors that may not
be latency efficient. Therefore, the integration of multiple programmable components
into a single system provides a tradeoff among key performance metrics such as energy,
latency, and area while meeting various unique requirements of different application
tasks.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
A heterogeneous embedded system can be a system-on-chip solution like Xilinx
Virtex-II Pro [96] which integrates IBM PowerPC processing cores into the reconfig-
urable fabric or a multi-chip solution like the PASTA sensor stack [67] which integrates
processors (a DSP and a GPP), microcontrollers, custom logics, sensors, and actuators
onto a single compact low power platform consisting of several modular boards. Such
heterogeneous systems support various power-aware capabilities/features such as recon
figuration [76], dynamic voltage and frequency scaling [28], low power operating states,
efficient start up and shut down, among others. The ability to dynamically customize the
architecture or the operating modes of different components to match the computation,
data flow, and quality of service (QoS) requirements of an application has demonstrated
significant performance benefits such as lower latency and energy dissipation [36]. As
a result, heterogeneous systems are well suited for latency and energy efficient imple
mentation of compute intensive applications and are thus attractive for signal process
ing applications used in wired as well as wireless mobile devices. Figure 1.1 shows
a generic high-level model for heterogeneous embedded systems. Unlike traditional
embedded systems discussed earlier, heterogeneous embedded system allows the use
standard communication protocols (e.g. PCI) and I/O interfaces (e.g. Rocket I/O [77])
for rapid prototyping and thus reduce time-to-market [47].
Target applications for heterogeneous embedded systems can be of many types such
as signal processing, control, man-machine interface, etc. In this thesis, we focus on
signal processing applications that process a stream of input frames while meeting a
given latency constraint for the processing of a single frame [75, 78]. We assume that
the application logic and therefore the performance is independent of the data contained
in the input frames. Such signal processing applications can be modeled as data flow
graphs (Figure 1.2). A data flow graph is a directed acyclic graph where the nodes in
the graph represent the tasks (or kernels) and the directed edges connecting the nodes
4
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
ASIC Processor
Micro
controller
FPGA DSP
I/O ,
Interconnect
I/O
Sensors Memory Actuators
Figure 1.1: A model of heterogeneous embedded system
represent the dependency (order of execution and data flow) among the tasks [43]. The
latency (and energy) constraints are specified based on the end-to-end performance of a
sub-graph of the complete data flow graph where the sub-graph has one source node and
one sink node. For latency constraints, we assume the worst case latency if there exists
multiple paths from the source to the sink of a sub-graph. For energy constraints, we
assume the cumulative sum of energy dissipation for each task. Additional advantage of
a data flow graph is that such representation of an application allows it to be statically
scheduled using topological sort and greedy scheduling.
Our focus is on low power embedded systems with strict latency requirement. Exam
ples of such systems include mobile base stations for software-defined radio [2, 24, 32],
target detection and tracking systems [44, 85], and space applications. Such special-
purpose embedded systems do not meet the volume requirement to consider ASIC as a
feasible solution. So the design problem is to identify a heterogeneous embedded sys
tem based solution that uses FPGAs, DSPs, and GPPs. The tradeoff considered here is
high throughput and relatively higher average power dissipation using the FPGAs versus
comparatively lower throughput and power efficient solution using the general purpose
5
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
processors. Additionally, DSPs are also considered which have the potential to provide
a meet-in-the-middle solution [21]. While ASICs do not provide much flexibility, if
suitable ASICs are available which implement one or more application tasks, we also
consider them as candidate devices. Furthermore, some applications are associated with
tasks with requirements such as floating-point arithmetic or control-intensive logic for
which an ISA-based processor may be preferred over an FPGA. In addition, the use of
FPGAs and ISA-based processors allows design of an evolvable systems that can adapt
to changing system specifications.
latency/energy
constraint 2
latency/energy
constraint 1
T2
input .------ output
T5
kernel/task
T4
Figure 1.2: Application model based on directed acyclic graphs
Given a target signal processing application, the design of a heterogeneous embed
ded system involves (device) selection of suitable processing components and memory.
Following device selection, the mapping between individual application tasks and pro
cessing components, appropriate operating state for each mapping, and the schedule of
execution are identified based on the performance requirements specified as an input.
With the availability of multiple implementation platforms such as FPGAs, general pur
pose processors, and DSPs, a designer not only needs to identify suitable platforms but
also appropriate hardware/software partitioning and mapping onto those platforms. In
addition, other capabilities that play a significant role, especially for energy efficient
6
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
design, are reconfiguration, dynamic voltage scaling and choice of low power operat
ing states. While minimizing energy dissipation, our focus is on maximizing battery
life [40]. Therefore, energy optimization with respect to the behavior of the embedded
system while processing a single input is no longer considered adequate [56]. Energy
models based on the processing of a single input do not take into account the behavior of
the embedded system when it is idle between the processing of two consecutive inputs.
Due to quiescent power, energy dissipation when a system is idle can be significant.
Therefore, low power embedded system design requires design space exploration based
on duty cycle specification to identify suitable device activation schedule [56]. Duty
cycle is the proportion of time during which a system is operated. Such specification
allows modeling of a period of execution as alternate active and inactive phases. Energy
dissipation (e.g. due to leakage current), especially for systems with low duty cycle,
during the inactive phases can contribute significantly to the overall energy dissipation
of the system. Therefore, the tradeoff between the performance cost of shutting down
and starting up a device and the performance cost of remaining idle needs to be consid
ered during system design [11]. In addition, an application is defined as multi-rate if the
constituent tasks execute at different rates. For example, an adaptive beamformer pro
cessing up to 105 mega samples per second may need to update the weight coefficients
only once every second [24]. Therefore, tasks performing weight coefficient evaluation
and update execute once every second. On the other hand, the tasks that process each
data sample are executed 105 x 106 times every second (see Chapter 5). Such multi-rate
applications allow partitioning of the application tasks onto different processing devices
that can be independently shut down and started up based on the behavior of the mapped
tasks.
Thus, many choices/tradeoffs are available during energy and latency efficient sys
tem design. However, a large number of choices during application design results
7
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
in a large design space (Figure 1.3) that must be traversed efficiently to identify the
designs that meet the performance requirements [55]. In addition, due to complex inter
dependencies, the design parameters (choice of operating state, activation schedule, etc.)
can have conflicting effect on different performance metrics making manual design and
analysis impractical. For example, the size of the design space for a target detection
algorithm discussed in Chapter 5 is approximately 73,000. Even with the use of a high-
level estimator, it takes approximately 10 hours to evaluate all the design based on their
latency and energy performance. In addition, based on the duty cycle specification and
if the processing components are shut down or left idle, the overall energy dissipation
of a single design can vary while the latency (or input rate sustained) remains constant.
In addition, one major obstacle to the widespread use of FPGAs is the lack of high-
level design methodologies and tools [26, 50, 57], Most of the available FPGA design
tools focus on compilation of a design specified in an HDL (hardware description lan
guage) onto a target device. This process tends to be time consuming and thus is not
suitable for efficient exploration of large design spaces. Therefore, it is necessary to
support rapid and reasonably accurate performance estimation to quickly evaluate can
didate designs [49]. While a number of implementation choices are available for a signal
processing kernel, it is not easy to manage a large number alternatives in a manner suit
able for efficient evaluation of the design choices [39]. In addition, opportunities for
run-time reconfiguration [10, 30, 48, 82] or device shut down and start up also needs to
be evaluated for low power application design.
1.1 Approach
There exists many design approaches to address the issue of energy efficient system
design in general and specific to heterogeneous embedded systems [12, 18, 44, 45, 46,
8
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
target
devices
binding
algorithms
large design
space
scheduling
reconfiguration voltage scaling
degree of
paralellism
memory
configuration
Figure 1.3: Design choices and large design space
63, 61, 62, 72, 87]. Different design approaches can be classified into two major cat
egories. The first category is optimization heuristics. Approaches that fall under this
category are based on a high-level abstraction (model) of the underlying system design
problem and allow development of (provably) optimal solutions [73]. Such high-level
models are a mathematical abstraction of the application behavior and hardware char
acteristics without the underlying implementation details (and therefore runtime com
plexities). While such approaches can quickly identify an optimal solution, they are
sensitive to errors introduced due to approximations during high-level modeling. The
second category is simulation based design space exploration. Approaches that fall
under this category use simulators that perform cycle-accurate or register-transfer level
simulation. While simulations provide reliable results in terms of accuracy, design space
exploration using such an approach consumes a significant amount of time due to low
simulation speeds [83, 84, 98].
In contrast, our approach, hierarchical methodology for design space exploration,
integrates the best of both worlds. Our methodology consists of two phases (Figure 1.4).
The first phase uses a pruning heuristic that evaluates the initial design space and prunes
9
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Phase-1
Phase-1
Design Space Pruning
Tool
High-level Performance
Estimator
Low-level
Simulator
Figure 1.4: Hierarchical design space exploration
it to a smaller set of designs based on the performance requirements. The pruning heuris
tics operate on the high-level models (developed by us) that specify the target applica
tion, candidate processing devices and memories, and performance constraints. The
performance constraints used by the pruning heuristics are derived based on the given
latency and energy constraint (that the selected design must meet) to ensure that a set of
designs is chosen after the first phase. The derived latency and energy constraints can be
iteratively loosened or tightened to increase or reduce the number of selected designs.
Ordered binary decision diagram based design space exploration [60], dynamic pro
gramming based N-optimization [56], Genetic Algorithm, and Simulated Annealing are
some examples of candidate pruning heuristics. The major difference between a prun
ing heuristic and an optimization heuristic is that while an optimization heuristic selects
the optimal design a pruning heuristics selects a set of candidate designs. Selection of a
10
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
set of designs makes our methodology robust against approximation error (discussed in
Chapter 4).
The second phase uses a high-level estimation tool and low-level simulators to per
form hierarchical simulation. Hierarchical simulation uses low-level simulators to per
form component specific simulations for a given design. The component specific esti
mates are combined using the high-level estimator to generate system-wide performance
estimates for a heterogeneous embedded system based on a duty cycle specifications. In
our methodology, hierarchical simulation is used to evaluate the designs identified in the
first phase. The high-level estimation tool used for hierarchical simulation operates at
a higher level of abstraction than a typical low-level simulator such as cycle-accurate,
register-transfer level, or even instruction-set simulators. For example, the high-level
estimation tool discussed in this thesis requires application input as a data flow graph,
which is at a comparatively higher level of abstraction than say “C” as required by Sim-
pleScalar [83]. Through the integration of simulators, estimator, and pruning heuris
tics, our methodology exploits the speed versus accuracy tradeoffs to perform faster,
compared to simulation only, and more accurate, compared to optimization heuristics,
evaluation of large design spaces.
While the approach discussed above is generic and can potentially be applied to
various design domains, in this thesis we apply the hierarchical methodology to our
target domain of signal processing application design using heterogeneous embedded
system. We have used the model integrated computing (MIC) approach to realize our
hierarchical methodology for our target domain. The key idea of the MIC approach
is the extension of the scope and usage of models such that they form the “backbone”
of a model-integrated system development process [88]. Using MIC technology, the
designer captures the information relevant to the system being designed in the form of
high-level models. The high-level models can explicitly represent the target application,
11
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
target hardware, and dependencies and constraints among the different components of
the models. Such models act as a repository of information that is needed for analyz
ing the system. MIC allows the use of model interpreters to translate the information
contained in the models into suitable input format as required by simulators, estimators,
or pruning heuristics. Based on the MIC approach, we have developed a set of models
to represent duty cycle, multi-rate signal processing applications, candidate devices for
heterogeneous embedded systems, and parameterized kernel designs using FPGAs to
capture the design space and drive candidate pruning heuristics, the high-level estima
tor, and various low-level simulators. Generic Modeling Environment, GME (GME 4),
is a third party graphical tool-suite supporting MIC [29]. GME enables development of a
modeling language for a domain, provides graphical interface to model specific problem
instances for the domain, and facilitates integration of tools that can be driven through
the models [29]. We have utilized GME to develop a design framework that allows us to
apply our design methodology to design real-life signal processing applications [67, 85].
Our design framework allows a designer to perform modeling of application and target
hardware, mapping and constraint specification, and design space exploration.
We list the basic contributions of this thesis and the thesis outline in the following
sections.
1.2 Thesis Contributions
This dissertation addresses several important issues in energy and latency efficient signal
processing application design using heterogeneous embedded systems. We have devel
oped a hierarchical design space exploration methodology that utilizes heuristic-based
design space pruning techniques, a high-level performance estimator, and low-level sim
ulators to provide a rapid and efficient technique to evaluate large number of designs.
12
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
We have applied our design methodology to the domain of signal processing application
design using heterogeneous embedded systems. Towards this end, we have developed a
number of high-level models to describe multi-rate signal processing applications, duty
cycle, heterogeneous embedded systems, mapping, and performance and design con
straints. We have used these models to drive the pruning heuristics and the high-level
performance estimator. We have also designed and developed a framework, based on
the model integrated computing approach [54], that provides a user friendly interface
for modeling and performs hierarchical design space exploration. To the best of our
knowledge, our proposed design methodology is one of the earliest efforts to consider
duty cycle and multi-rate applications while performing device selection for high per
formance low power embedded system design. The contributions of the thesis include:
1.2.1 Hierarchical Design Space Exploration
Hierarchical design space exploration methodology integrates pruning heuristics, a
high-level performance estimator, and low-level simulators to enable efficient explo
ration of large design spaces. This methodology initially evaluates a large design space
using a heuristic based design space exploration technique which performs a coarse
grained evaluation based on high-level models and selects a set of candidates based on
the given performance and design constraints. Performance constraints are based on
the given latency or energy dissipation requirements. Design constraints are based on
the mapping requirements of the application tasks. The candidate designs selected by
the pruning heuristic, are further evaluated by a high-level performance estimator which
performs a more detailed evaluation. While a performance estimator is slower com
pared to a heuristic, it is much faster compared to a low-level simulator simulating the
target heterogeneous embedded system. The high-level estimator uses low-level simula
tors to perform component specific simulations for a given design. Component specific
13
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
simulation refers to the simulation of application tasks using simulators available for
the target devices. SimpleScalar [83] and ModelSim [51] are some such simulators.
The component specific estimates are combined using the high-level estimator to gener
ate system-wide performance estimates for a complete heterogeneous embedded system
based on a duty cycle specification. Such a multi-step process enables a performance
versus accuracy tradeoff and thus allows rapid yet reasonably accurate evaluation of
large design spaces. Our methodology is not associated with specific simulators, esti
mators, or pruning techniques. Rather, our methodology provides a portable technique
to enable integration of suitable pruning heuristics, high-level estimators, and low-level
simulators based on the target applications and hardware.
1.2.2 Models for Design of and Heterogeneous Embedded
Systems
We have defined several high-level models to realize our hierarchical methodology for
the target domain of signal processing application design using heterogeneous embed
ded system. Our models capture application specifications, details of the candidate
hardware for the design of heterogeneous embedded system, performance and design
constraints, and deployment scenarios. Generic Model (GenM) provides an abstraction
of the heterogeneous embedded system that identifies the key architectural features that
can be exploited for performance optimization. These features include, operating states,
average power dissipation in each state, and state transition costs in terms of latency and
energy dissipation. Application model is developed by enhancing the synchronous data
flow graph to capture multi-rate tasks, choice of implementations for each task, and state
transitions (e.g. reconfiguration or dynamic voltage scaling) between task executions.
Such representations allows us to define application design problem as a combinatorial
14
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
algorithm that can be solved in an efficient manner using e.g. dynamic programming.
The above models are used to drive the heuristic based pruning techniques and the high-
level performance estimator. A mapping model is also proposed to specify the map
ping between the application and resource models. Our modeling techniques for FPGA,
applications, and heterogeneous embedded systems, also allow us to develop library of
models promoting reuse.
1.2.3 High-level Performance Estimator (HiPerE)
One of the major challenges in estimating performance of a design based on a hetero
geneous embedded system is lack of standard interface among the component specific
simulators and varying simulation speeds, which makes it difficult to integrate the simu
lators to simulate such a system. We have developed a high-level performance estimator,
HiPerE, based on the models discussed above. HiPerE addresses the lack of standard
interface by combining component specific performance estimates through interpretive
simulation to derive system-level performance estimates. Interpretive simulation in the
context of our high-level estimator refers to the simulation of a data flow graph repre
senting the application on a software based virtual machine representation of the target
heterogeneous embedded system. The input to HiPerE is a design that refers to an
application specification with each task mapped to a specific hardware. In order to pro
vide a rapid estimate, HiPerE operates at the task level abstraction of the application.
As HiPerE integrates component specific estimates generated by low-level simulators,
the speed versus accuracy tradeoffs between the instruction-level, cycle-accurate, and
RTL (register transfer level) simulators are exploited during performance estimation.
In addition to the task execution cost, various other aspects considered by HiPerE for
accurate performance estimation are data access cost, parallelism in the system, energy
dissipation when a component is idle, and state transition cost. HiPerE also supports
15
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
performance estimation based on multi-rate applications and duty cycle specification.
In addition, for a given design HiPerE can evaluate the most energy efficient deploy
ment schedule based on a given duty cycle specification.
1.2.4 Modeling and Performance Estimation of FPGA based Kernel
Design
FPGAs have been considered suitable for energy and latency efficient design of signal
processing kernels and there have been a number of parameterized designs proposed
for such kernels [19, 38, 94]. By varying parameter values, the parameterized ker
nels enable a tradeoff among area, energy dissipation, and latency. The parameters we
consider for the FPGA based design are, degree of parallelism, binding options, preci
sion, and other parameters specific to individual signal processing kernels. For example,
block size is a parameter specific to blocked matrix multiplication [19]. We have defined
a modeling technique suitable for describing FPGA based parameterized designs of sig
nal processing kernels. Our modeling technique enables modeling of the data path and
the algorithm (control flow) for kernel designs at an abstraction suitable for the repre
sentation of parameterized IP (intellectual property) cores where the parameters can be
varied to understand performance tradeoffs. The designer uses the models to estimate
performance, analyze the effect of parameter variation on performance, and identify
optimization opportunities.
Our modeling technique is based on the observation that an FPGA based design
can be described using basic building blocks such as registers, adders, multiplexers,
multipliers, among others [20]. Therefore, using the average power estimates for each
basic building block and the algorithm description of the design we are able to define
16
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
energy and latency estimates of parameterized kernels as high-level functions. In addi
tion, the use of basic building blocks allows us to reuse performance estimates of these
blocks. Chapter 5 discusses our modeling technique in detail. Using our approach, the
designer does not need to synthesize the FPGA designs to verify design decisions. Once
a model has been defined and parameters have been estimated, design decisions are ver
ified using the high-level performance estimator. Additionally, parameterized modeling
enables exploration of a large design space in the initial stages of the design process.
The FPGA modeling technique proposed in this thesis exposes various parameters at
the algorithm level, generates performance models for energy, area, and latency in the
form of parameterized functions, and allows rapid estimation of different performance
metrics using only the information captured in the models
1.2.5 Design Framework and Demonstration
We have developed an framework to support hierarchical design space exploration.
MILAN is a Mlodel based Integrated simuLAtioN framework for embedded system
design and optimization through integration of various simulators into a unified envi
ronment [53]. This framework is developed using GME a tool-suite supporting MIC.
Our specific contributions include extension of modeling capabilities to support duty
cycle specification, multi-rate application, and heterogeneous embedded system. We
have integrated HiPerE and several low-level simulators such as SimpleScalar [83] and
SimplePower [98] into the MILAN framework. This framework is extensible; additional
heuristic based design space exploration techniques, performance estimators, and sim
ulators can be easily integrated to the unified environment provided by the framework.
Our focus in designing the framework is not to develop new techniques for compilation
of high-level specifications onto FPGAs, rather to provide efficient support for some of
17
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
the design steps such as modeling, design reuse, rapid coarse-grained estimation of var
ious performance metrics, performance tradeoff analysis, and design space exploration.
Using MILAN, we demonstrate the use of our methodology in identifying an energy and
latency efficient heterogeneous embedded system based on user-specified performance
requirements for a personnel detection algorithm and an adaptive beamforming algo
rithm from a set of devices that includes FPGAs, DSPs, and general purpose processors.
1.3 Thesis Outline
Chapter 2 provides the background for the thesis. Various candidate computing and
memory devices for heterogeneous embedded systems and the available features (in
these devices) for low power high performance design are discussed in detail. In addi
tion, target signal processing applications and their deployment scenarios are discussed
to identify the performance and design issues that needs to be addressed while develop
ing a design methodology for energy efficient mapping of applications onto heteroge
neous embedded systems and a framework implementing the methodology.
Chapter 3 describes some work by the academia and industry addressing the issues
discussed in this thesis.
Chapter 4 discusses the challenges involving low power high performance applica
tion design using a heterogeneous embedded system and our approach towards solving
some of these challenges. A formal definition of our methodology, hierarchical design
space exploration, is also provided. This chapter provides the foundation for qualita
tive comparison of various design space exploration tools. An example illustrating the
advantages of our hierarchical approach towards design space exploration in comparison
to the traditional approaches is discussed.
18
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Chapter 5 describes the use of our methodology for low power high performance sig
nal processing application design using heterogeneous embedded systems. Techniques
to model application, heterogeneous embedded systems, performance constraints, and
deployment scenarios are also discussed. A modeling technique is also presented to
describe parameterized designs for signal processing kernels using FPGAs. Using the
models and the proposed hierarchical design space exploration methodology, two low
power high performance signal processing applications are designed to demonstrate the
advantages of our proposed methodology.
Chapter 6 describes MILAN (Model-based Integrated Simulation) an embedded sys
tem design framework that integrates the modeling techniques and design methodology
discussed in this thesis. This chapter discusses the design of MILAN, the high-level per
formance estimator, and the integration of low-level simulator into the MILAN frame
work.
Chapter 7 illustrates the use of our hierarchical design space exploration technique
and the MILAN design framework for high performance low power design of various
signal processing application. The examples discussed also demonstrate other advan
tages of our design framework such as creation and maintenance of library of models,
design reuse, and extensibility.
Chapter 8 discusses the conclusion from this dissertation work and provides direc
tions for future research related to the extension of the modeling techniques and our
design methodology, the design framework, and application design using the next gen
eration heterogeneous embedded systems.
19
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Chapter 2
Application Design using
Heterogeneous Embedded System
This thesis focuses on low power design of high performance signal processing appli
cations using heterogeneous embedded system. Such systems integrate ISA (instruction
set architecture)-based processors, FPGAs, and memory into a single computing plat
form. We assume the availability of a set of candidate devices from which suitable
devices will be chosen to create the heterogeneous embedded system. Therefore, based
on various deployment scenarios of the application, different combination of the can
didate devices are evaluated to identify the design that meets the given performance
constraints. In the following sections, we discuss in detail some of the commercially
available state-of-the-art candidate devices suitable for low power high performance
design, candidate target signal processing applications, and various deployment scenar
ios and performance requirements associated with the applications. We also discuss
various aspects of the design space that is defined by the applications and the candidate
devices and issues that must be addressed while exploring this design space. In addition,
we discuss where such a design problem fits in the conventional design flow supported
by various commercial tools used for embedded system design.
20
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
2.1 Candidate Devices
We divide the candidate devices into four categories. They are
• ISA-based processors
• field programmable gate arrays (FPGA)
• memories
• system-on-chips (SoC)
In the following, we discuss several candidate devices within each category.
2.1.1 ISA-based Processors
ISA-based processors include both general purpose processors (GPP) and digital sig
nal processors (DSP). While the general purpose processor is not optimized for a spe
cific computation, the DSPs are optimized for signal processing applications. Recent
advances in processor design techniques have led to the development of processors that
support dynamic voltage and frequency scaling [35, 66]. Since energy dissipation typi
cally decreases exponentially as voltage decreases while increases only linearly as time
increases (frequency decreases), significant power-performance tradeoff can be achieved
based on the needs of the application. Figure 2.1 shows potential savings in power due
to voltage scaling [33]. However, many of the low power ISA-base processors do not
support floating point arithmetic. While it is possible to emulate floating point arith
metic, such emulations are not efficient compared with hardware support for floating
point arithmetic. However, ease of application development and low energy dissipa
tion makes ISA-based processors a default choice for embedded system design. In the
following, we discuss some of the state-of-the-art ISA-based processors.
21
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C O
0- 400
150 MHz 400 MHz
8 0 0
6 0 0 :
400 ®
o
CL
150 MHz D S
200 600 MHz
I E H
432 J
1200 J
12 se c . duty cycle
Figure 2.1: Lower energy through voltage scaling by exploiting available slack
Intel PXA 255 [35] is a low power, high performance 32-bit Intel XScale core-based
CPU. It is compliant of the ARM architecture. It has a 32 KByte Instruction Cache and
a 32 KByte Data Cache. Towards the support for low power application design, it has
four operating modes:
• Turbo Mode: the Core runs at its peak frequency. This mode is ideal for compute
intensive tasks with very few memory access.
• Run Mode: the Core runs at its normal frequency. This mode provides the best
power/performance tradeoff. This is the default active mode.
• Idle Mode: the Core is not being clocked, but the rest of the system is fully oper
ational. This mode is useful when the processor needs to be inactive for a short
duration and needs to get back in active state quickly.
• Sleep Mode: places the processor in its lowest power state but maintains I/O state,
RTC, and the Clocks and Power Manager. It is required to reboot the system to
get back to any active mode.
PXA 255 also supports dynamic frequency scaling. Depending on the mode (Run or
Turbo) the available operating frequencies are 99.5, 132.7, 199.1, 265.4, 298.6, 331.8,
22
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Table 2.1: Operating point descriptions for IBM PowerPC 405LP
Operating Point 33/33 200/100 266/133
Core Voltage (V)
PLL VCO Freq. (MHz)
CPU Freq. (MHz)
PLB Freq. (MHz)
ECB Freq. (MHz)
SDRAM Timing
1.0
800
33
33
33
CAS2
1.5
800
200
100
33
CAS2
1.8
533
266
133
33
CAS 3
and 389.1 MHz. Each operating frequency is also associated with a supply voltage. For
example, the device supply voltage is 1.3 V when operating frequency is 398.1 MHz
where as it is 1.0 V for 99.5 MHz. Frequency scaling is exploited to reduce energy
dissipation by taking advantage of available slack and lowering the operating voltage
and frequency. In addition, PXA 255 contains the CKEN (clock enable) register which
contains configuration bits that can disable the clocks to individual units and clock gates
the units to save power.
IBM PowerPC 405LP [13] is specifically designed for battery operated portable sys
tems and supports dynamic power management strategies based on dynamic voltage
and frequency scaling. An advantage of the scaling support by 405LP is extremely low
latency during scaling. Therefore dynamic power management can be performed with
out disrupting system operations during the scaling events. The 405LP supports several
operating points. Each operating point is described by parameters such as the core volt
age, CPU and bus frequencies, and the operating states of peripheral devices such as
cache. Operating point also extends to states such as sleep and hibernate when the pro
cessor does not process any data. Table 2.1 provides some sample operating points as
described by Brock and Rajamani [13].
The 405LP includes many low power optimizations to reduce both active and
standby power [17], Active power is reduced via voltage and frequency scaling, flexi
ble clock distribution, clock gating, and hardware acceleration. During standby, clock
23
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
freezing, hibernation, and voltage reduction can save power. IBM is also developing
a dynamic power management framework for Linux that enables dynamic power man
agement features for the 405LP. As of the publication of this thesis, the 405LP is just
announced and not available commercially. However, 405GP, which is based on a simi
lar architecture as 405LP, is available. 405GP supports some of the low power features.
Texas Instruments C5000 [89] series of DSP platforms are targeted towards portable
media and communication products. Features responsible for low power application
design are:
• low standby power (approx. 0.12 mW)
• low core and memory operating power
• dynamic frequency and voltage scaling
• multiple standby modes
• ability to turn on and off individual peripherals and functional units
In addition to the above, there are several features that can be exploited for low
power application design. If a program is configured to execute completely from inter
nal memory, activities on the external bus can be disabled to save dynamic power. The
DSP also allows shutting off the CPU execution or only internal CPU execution with
out any external access. The device also supports modes, IDLE1, IDLE2, and IDLE3.
In the mode IDLE1, only the CPU is inactive while the peripherals are active. The
CPU can be moved out of power-down mode using timer interrupts. Therefore, IDLE1
mode is useful if the CPU needs to be in power-down mode periodically, for e.g. while
processing data coming at a predefined but low rate. IDLE2 mode shuts off the CPU
and the peripherals and the IDLE3 mode shuts off the entire chip including the PLLs.
24
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
While IDLE2 and IDLE3 save more power they need external interrupt to wake up the
processor and hence need more time.
2.1.2 Field Programmable Gate Arrays
Traditionally, FPGAs are not considered suitable for low-power application design
because of higher static and dynamic power consumption compared with their hard
ware counterparts, processors and ASICs. However, due to recent architectural advances
that address power concerns in FPGAs, they are being considered as an option for low
power design [92]. Some of the key power saving techniques are the use of non-volatile
memory that reduce energy dissipation during configuration and multiple clock domains
which allows designs to be partitioned using gated clocks, so that a section of the design
can be operated using a slow clock or can be clock gated save power. In the follow
ing, we discuss some of the commercially available FPGAs that provide support for low
power design.
Actel ProASICFif/5 [1] is a 0.22// Flash based CMOS FPGA that combines repro
grammability and non-volatility. SRAM based FPGAs loose configuration when they
are powered down and thus need to be configured every time they are powered up.
However, ProASIC uses a Flash based configuration memory which retains configura
tion and thus eliminates the cost of a boot PROM and reconfiguration cost which can be
significant for a large FPGA. Figure 2.2 compares power dissipation of a SRAM-based
FPGA and ProASIC for 110 instances of a 16-bit binary counter operating at different
frequencies.
QuickLogic Eclipse II [71] uses a propriety technology, ViaLink, to reduce static and
dynamic power compared to SRAM-based FPGAs. In addition, QuickLogic Eclipse II
supports several low power features as discussed below:
25
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
1000---- ooo— ---------- ProASIC
900 ---------- SRAM FPGA
6 400
5 800
E
^ 700
200
o. 600
E
o
| 300
c
o
m 500
100
20 3 0 40 50 60 70 80 90 100
Frequency (MHz)
Figure 2.2: ProASIC power consumption compared with SRAM FPGA
• the FPGA supports a low power mode which deactivates the internal charge
pumps and reduces static current and thus reduces static power
• the I/O pins can be operated in 1.8 V, 2.5 V, or 3.3 V which offers flexibility in
accommodating low power needs
• several embedded computational units for common mathematical functions
such as addition, multiplication, accumulate, multiply-and-add, multiply-and-
accumulate, etc. are available. These units consume less power than the reconfig-
urable fabric configured for the same functionality
In addition, QuickLogic provides design automation tools that support low power
design. Power aware placement tool provides an option to match the locations of buffers
and routing resources to reduce the capacitive load of the drivers. The Eclipse FPGA
is divided into several quadrants. A design can be associated with multiple frequency
domains. Each frequency domain takes up one clock network in a quadrant, if there is
logic within that quadrant that belongs to that frequency domain. Using the placer tool
26
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
a designer can lock modules of a certain clock domain to a quadrant of the FPGA which
minimizes the number of clock resources used. The FPGA provides a configurable pull
down register in every I/O cell. When configured the pull-down register ensures that the
signals are grounded when not being driven which reduces leakage power.
2.1.3 Memories
One of the most important issues in the development any low power system, is the inter
action between the processing device and the memory. Memory dissipates a significant
portion of the overall system power. Therefore, currently, several memory vendors pro
vide power aware memory components to compliment the processing hardware.
Micron Mobile SDRAM [52] is a low power SDRAM device with several parameters
that can be tuned by the application to optimize power dissipation. Mobile SDRAM
operates at 2.5 V with low lDD current and consumes approximately 50% less power
than the standard SDRAM. Mobile SDRAM has several operating modes such as Active,
Power-Down, and Burst. During Active mode, the memory dissipates on average 375
mW during data access and 87.5 mW while it is idle. During Power-Down mode the
average power dissipation is 1 mW but data cannot be accessed in this mode. The
memory dissipates 225 mW while in the Burst mode. Thus based on the data access
pattern of the application, the memory can be moved between modes to optimize power.
Another major factor in power dissipation is self refresh which results in average power
consumption of 525 mW. Micron Mobile SDRAM provides two options to reduce the
power consumption due to self refresh (Figure 2.3).
Additional support for low power include partial array self refresh (PASR) and
temperature compensated self refresh (TCSR). Typically, the self refresh rate in the
SDRAMs are based on the extreme environment conditions such as higher than average
temperature. However, as the device need not always operate in extreme conditions a
27
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
BA1 BAO A12 A11 A10 A9 A8 A7 A6 «A5 A4 A3 A2 A1 AO
1 0 RESERVED TCSR PASR
A4 ' A3 Ma.\ C ase Temp
0 0 70° C
0 1 45° C
1 0 15° C
1 1 85° C
A2 ; A1 AO Self Refresh Cov
0 0 0 All Banks
0 0 1 Banks 0 and 1
0 1 0 Bank 0
0 1 1 Reserved
1 0 0 Reserved
1 0 1 Lower Half Bank 0
1 1 0 Lower Quart. Bank 0
1 1 1 Reserved
Figure 2.3: Low power support by Micron Mobile SDRAM
high self refresh rate results in consumption of excess power. Using the support for
temperature compensated self refresh, the self refresh rate can be changed based on the
temperature of the device. Thus memory can consume an optimal amount of power
needed for that temperature environment. Applications do not always use all of the
available memory. However, if the unused portion of the memory is active it continues
to draw self refresh current even if the data stored are not accessed. Partial array self
refresh allows the designer to specify that the power consumed is relative to the amount
of memory selected to be refreshed during self refresh mode. Both these features can be
exploited during application design to reduce power consumption.
28
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
2.1.4 System-on-Chips
System-on-Chip or popularly known as SoC integrate the complete processing system,
processor, memory, sensors, actuators, I/O, etc. onto a single chip. Several FPGA
vendors currently integrate ISA-base microprocessors, memory, DSP blocks such as
adder/subtracter, multiplier, MACs into the reconfigurable fabric. Such devices can
be classified as configurable SoC and in terms of processing capability match our target
class of devices, heterogeneous embedded system. Therefore, we also consider reconfig
urable SoCs as possible candidate devices. In the following, we discuss Xilinx Virtex-II
Pro and ATMEL FPSLIC as examples of state-of-the-art SoC devices.
Xilinx Virtex-II Pro [96] is primarily targeted towards high-performance application
design. However, it also includes several features that can be exploited for low-power
application design especially compared to general purpose processors and DSPs [19,20,
21]. Among the features that are exploited for low-power application design are:
• use of DSP blocks - The DSP blocks are ASIC components for multiplier and
memory that are optimized for their respective functionalities. The use of these
blocks instead of configured FPGA allows higher performance as well as lower
power dissipation
• use of embedded processor - Virtex-II Pro FPGAs integrate up to four IBM Pow
erPC cores [66], These cores provide a low-power alternative to the reconfigurable
fabric for several functionalities, especially the control-intensive ones
• multiple clock domains - Virtex-II Pro provides multiple clock domains that
can be exploited to provide different optimum frequency to different logic seg
ments. Lower frequency results in lower dynamic power dissipation. Frequencies
between 24 MHz to 420 MHz are supported
29
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
However, Xilinx Virtex-II Pro [96] is a SRAM based FPGA and hence looses con
figuration when powered down. Therefore, it needs reconfiguration during power-up
which consumes a significant time and energy.
ATMEL FPSLIC [5] Atmel’s AT94K and AT94S family of Field Programmable Sys
tem Level Integrated Circuits (FPSLIC devices) integrate basic blocks such as logic,
memory, and microprocessor within an SRAM-based reconfigurable fabric. Typical
ATMEL devices combine FPGA with 5K to 40K gates, an AVR 8-bit RISC micropro
cessor core, several fixed microcontroller peripherals, and up to 36 Kbytes of program
and data memory. The AT94 series of devices support 3 Sleep Modes for low power
design. These modes are Idle, Power-down, and Power-save. In the Idle mode, the CPU
is stopped but the UART, Timers/Counters, and Interrupt System continue to operate.
Therefore, the microprocessor can wake-up and start program execution immediately.
In Power-down mode, as opposed to Idle mode, only external interrupts are active.
Therefore, when in Idle mode, the device consumes even lower power than Idle mode.
However, when an external signal is used to wake-up the microprocessor, there is a
delay until the microprocessor is ready to execute code. The third mode, Power-save
mode, is identical to the Power-down mode but it allows the CPU to wake up based on
Timer interrupts. This mode allows periodic sleep and wake-up of the CPU. While the
processor takes about 2 instruction cycle before processing external interrupt, it takes 3
instruction cycles for Timer interrupts.
30
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
2.2 Target Applications
Our target applications consisting of stream-based data processing of various types (e.g.,
text, speech, image, music, video) that are widely used in portable devices such as wire
less phones, remote sensors, personal digital assistants, laptops, etc. High through
put and stringent real time requirements are especially important in such data-intensive
applications demanding high processing power. On the other hand, energy efficiency
is as important as throughput for portable devices as they operate in power constrained
environments (batteries, and solar power, etc.) [59]. The key characteristics of our target
applications are as follows:
• the application is a signal processing applications. Such applications are compute
intensive and are candidates for deployment in low power environments such as
hand-held or mobile devices. Such applications are also suitable for modeling
using data flow graphs
• the input as discrete frames of similar data arriving at a pre-specified rate. Such
an input allows us to identify a single latency requirement that needs to be met for
each frame of data. In addition, the functional behavior of the application should
not change depending on the content of the frame
• the application can be multi-rate such that depending on the serial order of the
frame, some application tasks may not execute. For example, an adaptive beam-
former processing up to 105 mega samples per second may need to update the
weight coefficients only once every second [24]. However, for modeling and
design space exploration purposes we require that such details are completely
specified
31
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
In the following, we discuss three representative applications to demonstrate the key
characteristics of our target applications.
input
im age
stream
targ et c la ss/
position
calculate
m ean
calculate
d istan c e
m erge
d istan c e
inverse
FFT
pre-
p ro c e ss
calcu late
p sr
multiply
post
p ro c e ss
Figure 2.4: Automated Target Recognition (ATR)
Embedded Automatic Target Recognition (ATR) systems, specifically the ones
employed within missiles, are compute-intensive applications operating within strict
physical and power constraints. The basic ATR algorithm is based on correlation filter
ing. The input consists of a stream of images. The input is also expected to be provided
at a constant rate which implies an upper limit on the latency allowed to process a single
image. Each image of the input stream is sequentially preprocessed and then trans
formed into the frequency domain. The transformed image is compared against several
classes of target of interest in parallel. Comparison is typically done by multiplying with
the filter correlation matrices associated with the different classes of target. The results
for each of the classes are then processed (inverse FFT) to generate the surface maps
associated with each target class. By comparing the strongest correlation peaks of each
image class and the peaks of the reference classes the class for the object in the image is
identified. By processing several input images, the target can also be tracked. Figure 2.4
provides a data flow graph of a typical ATR application.
input
im age
stream
m e ssa g e D em ean inverse whitening
velocity
filter
com pute
CVM
Figure 2.5: Target tracking
32
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
The major difference between ATR and target tracking application (as represented
by 2.5) is that during ATR the target is first detected them tracked, whereas target track
ing tracks the target through motion estimation and then performs classification. Fig
ure 2.5 shows a sample target tracking application. The most important aspect of tar
get tracking is velocity filter, which uses three consecutive images to detect the motion
(direction and velocity) of the target. The earlier stages are used to preprocess the images
to reduce noise (through whitening) and reduce the effect of known defects in the image
due to aging of the camera [85]. In addition, as the target tracker uses sets of images
to detect motion, it processes 3 images during each stage of processing. Thus the input
rate is an important criteria for target tracker as while a high rate of input allows more
precise calculation of motion, it increases the processing requirement of the system and
thus possibly energy dissipation. Target tracker is also a good candidate for multi-rate
implementation where initially the input rate can be low to conserve energy and when a
target is detected the rate of input can be increased for efficient tracking.
weight coefficient
output
update weight
coefficient
spectrum
calculation
calc, input
power
correlation
normalization
LMS based
MVDR
Figure 2.6: LMS-based MVDR adaptive beamforming
Least Mean Square(LMS)-based Minimum Variance Distortionless Response
(MVDR) adaptive beamforming algorithm is used in the base station for software
defined radio to design adaptive (smart) antennas. The processing requirements of smart
antennas are many orders of magnitude greater than those for single antenna imple
mentations [24], In addition, the base station can be deployed in a power-constraint
environment [25, 32], Hence, energy efficiency is an important metric while designing
33
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
such base stations. With respect to performance, an adaptive beamformer may process
up to 105 mega samples per second [24], which is approximately 10 nanoseconds per
sample. Such beamformer may need to update the weight coefficients only once every
second [24],
The exact solution to the MVDR algorithm involves matrix inversion, which has
high computational complexity and is not suitable for real-time signal processing. The
LMS-based solution under consideration (shown in Figure 2.6 has a much lower compu
tational complexity and thus is suitable to be used in many practical telecommunication
systems. The LMS-based MVDR algorithm consists of three steps. One possible class
of input to this beamforming algorithm is acoustic signals captured using microphones.
The first step is the calculation of filter output. The second step is the calculation of
input signal power, correlations between the signals, normalization, and the calculation
of error signals. The third step is the update of the weight coefficients of the adaptive fil
ter. The incoming data rate is approximately 105 mega samples per second [24]. Based
on different application requirements and the working environment, the rate of update
of the coefficients differs.
2.3 Deployment Scenarios and Performance Require
ments
Application design problems requires identifying appropriate mapping and scheduling
of the application tasks onto the target hardware. The solution identified is required to
meet the given performance requirements while processing a single input frame. How
ever, as energy becomes an important criteria, the performance of the hardware even
34
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
when it is not active is considered important. Figure 2.7 shows the average power dissi
pated by a system based on two different scheduling. Part (A) in Figure 2.7 shows the
average power dissipated while the system is processing data and while the system is
idling. Due to large leakage current, the quiescent power of the hardware can be sig
nificant especially if the duration of idling is longer compared to the duration of actual
processing. On the other hand, a designer can choose to shut-down the hardware and
restart if it is possible that the events of interest can trigger device start-up. Recently,
there have been many studies that allow the use of trip-wire sensors to wake up more
expensive (energy dissipation wise) nodes [67]. In such situations, as shown in part
(B) of Figure 2.7, there is a potential to save energy if the duration of idling is long
enough to offset the cost of shutting-down and starting-up the hardware. Therefore, the
deployment scenario plays a significant role while designing energy efficient systems.
Deployment scenario defines if events of interest can trigger the wake-up mechanism
and the rate of input.
part (A) A part (B)
power dissipation
while processing
power dissipation
while idling
potentially long \
duration T
power dissipation
while starting-up
power dissipation
while shutting-down
time time
Figure 2.7: Sources for energy dissipation for execution based on a duty cycle
The target tracking and ATR applications that we have discussed above satisfy the
above criteria where the rate of input is not very high. Therefore, it is possible to evaluate
the tradeoff between shutting-down devices and leaving as it is. Additional information
35
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
such as the mechanism to start-up the hardware also provides cost of such operations
that is necessary to correctly perform design space exploration.
Our focus is minimizing energy while meeting a given latency constraint. For exam
ple, for a typical target tracking system where the target is moving, there is a limited
window (of time) when the target can be seen by the camera. Similarly, for MPEG
encoder/decoder, there is a latency constraint on processing the input images to mini
mize loss of frames and maintain continuity. In addition, when these applications are
deployed in power constrained environment, it is essential to minimize energy dissipa
tion.
2.4 Design Space
precision
... multi-rate
multi-rate
applications
DVS
mapping
very large
design space
heterogeneous
reconf. system
reconfiguration
algorithi
Figure 2.8: Large design space due to large number of parameters
The design space that must be explored to identify a latency and energy efficient
design grows enormously due to large number of architecture knobs (Figure 2.8). For
example, suppose that an application of size n tasks is to be mapped onto a hetero
geneous system with k processing elements. The number of possible mappings can be
computed using Stirling’s number of the second kind, S(n,m), which denotes the number
of ways to partition the set of n elements into m non-empty sets [7] where,
36
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
5(fl.m ) = iiEr=o1 - 1|
Let us assume that each of the k processing modules can have on average p num
ber of different states (settings of architecture knobs). For example, if the Intel PXA
processor can be operated at four different frequencies [35], it has four different states.
Further, it is possible to use “some” or “all” of the processing components available on
a heterogeneous system for an implementation. In addition, the number of processing
components increases if we are choosing from a given set of devices. Thus, the number
of possible solutions (ways an application can be mapped) is given by [58]:
N m a p p in g {P 'i j P ) E E x = l P ^ (^>
Suppose we apply the above formula for a device selection problem for an applica
tion with 5 tasks. We assume that there are 3 candidate heterogeneous systems with 3
processing components each and 6 possible configurations for each processing compo
nent. Based on Equation 2, there are 26,682 different designs in the design space. This
can prove to be a very large number even for the most efficient simulator if simulation
is used to evaluate all possible designs.
2.5 Design Problem
Many FPGA and EDA (Electronic Design Automation) vendors provide tools that
address the problem of application design using FPGAs. The tools range from Xilinx
System Generator for DSP [95] for high-level design specification and implementation
to Synplicity Amplify FPGA [4] for placement and logic optimization. These design
tools start with a single conceptual design typically referred to as the “golden reference”.
The golden reference specifies the algorithm being implemented, I/O interface, mem
ory requirement, and, if needed, the area and latency constraints to be satisfied. Design
37
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
space exploration is performed as part of implementation or through local optimizations
to address performance bottlenecks identified during the design process. Therefore, the
assumption is that the golden reference is in fact a “good” design that can be further
improved through local optimizations to meet the specified performance constraints. A
System Architect’s job is to identify the golden reference.
Design Space
V ® \ choice of activation
algorithm schedule.
^ j > \ IP-cores /
H/W vs. S/W /
^ ®\ /
< & , ® / \ reconfiguration/
7
choice of /
9* d e v i c e s /
golden '
/ - reference \
use high-level
estimators, heuristics,
or algorithms
high-level coarse
grained evaluation
pipelining placement
and routing
local
optimizations
^ binding /
/ \
/ \
/ \
low-level fine-grained
evaluation
use vendor provided
tools
final design
i
Figure 2.9: Two-level design space exploration
38
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Typical duties of a Systems Architect, in the domain of embedded system design,
includes system definition, system architecture and implementation, product architec
ture definition, functional requirements specification, power management strategy iden
tification, and system performance analysis. In order to perform such tasks, a Sys
tem Architect evaluates the tradeoffs among various design choices such as different
algorithms and IP-cores, choice of target FPGAs, hardware/software co-design, run
time reconfiguration, activation schedule for designs with a duty-cycle target hardware,
among others. However, with availability of such a wide array of design choices, it is
desirable to be able to rapidly explore a large number of designs and then arrive at the
golden reference. In addition, as FPGAs are being considered for newer application
domains and it is not always possible to assume the availability of required experience
to identify a golden reference without evaluating available design choices. Hence it is
necessary to explore techniques that would allow a System Architect to efficiently spec
ify and evaluate a large number of design choices to identify a set of candidate designs
for fine-grained evaluation. By fine-grained evaluation we mean synthesis and low-level
simulation (RT-level or cycle accurate) using the design tools provided by the FPGA
and EDA vendors. However, there is a lack of design tools that would allow high-level
design space exploration (DSE). High-level DSE refers to rapid evaluation of design
choices without involving RT-level or cycle accurate simulations making it challeng
ing to predict performance during high-level DSE. Thus the design space exploration
can be described as a two-level process (Figure 2.9). The first level performs a coarse-
granular evaluation of the design space using rapid performance estimators, algorithmic,
or heuristics based tools. These tools allow rapid coarse-grained exploration to identify
a set of candidate designs for further evaluation. The reason to identify a set of designs
will be discussed in detail in Chapter 4. Once a set of candidate designs are identified, in
the second level design space exploration is performed in the traditional manner using
39
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
the tools provided by FPGA and EDA vendors. Our proposed design technique also
addresses the above problem.
40
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Chapter 3
Related Work
There are a number of research and commercial efforts that address the design of energy
and latency efficient embedded system. Such efforts have resulted in the development
of high-level models for different components in a heterogeneous embedded system,
design space exploration techniques and tools, high-level languages and associated com
pilers, optimization heuristics, and performance estimation techniques. In the following,
we discuss several related works in each category and compare our contributions against
each work.
The SPADE (System Level Performance Analysis and Design Space Exploration)
methodology evaluates embedded system architectures at multiple abstraction levels to
perform design space exploration (DSE) [45]. Related Artemis project [65] provides
a simulation environment for system architecture design space exploration. Artemis
also provides an experimentation framework to identify computationally intensive tasks
suitable for execution on reconfigurable hardware. The application is modeled based
on Kahn Process Network and hardware is modeled using a large library of architec
ture models at different granularity. Architecture evaluation at multiple granularity level
exploits time versus accuracy tradeoff. Hence, DSE involves quick evaluation of a large
number of designs using coarse-level architecture models followed by the use of more
accurate and detailed modeling as the number of designs reduces. In contrast, we use a
hierarchical approach by integrating pruning heuristic and high-level performance esti
mation. Additionally, data flow model for applications enables analysis of operating
41
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
state transition costs such as reconfiguration and dynamic voltage scaling between the
execution of successive application tasks.
Stammermann et al.[87] presents ORINOCO, a software tool for power dissipation
analysis and optimization at the algorithmic abstraction level from C/C++ and VHDL
description. ORINOCO allows a designer to evaluate choice of algorithms, scheduling,
and implementation for register-transfer level design. Co-design based on hierarchical
partitioning of algorithms into basic operations is presented in [63]. This methodol
ogy gradually decomposes an algorithm into processes and functions (till basic opera
tions such as addition, multiplication are reached) and evaluates all possible partition
ing to verify if performance constraint can be met. Peixoto et al. [64] discuss sev
eral fundamental issues involved in supporting early design space exploration. These
issues include definition of an adequate level of abstraction, system-level metrics, and
mechanisms for automating the exploration process. They proposed algorithm and
architecture-level models and defined a set of operations on these models to support
DSE. A Java-based environment, Sky, was also proposed. However, these efforts do
not exploit the available capabilities such as frequency scaling, reconfiguration, shut
down or leave idle, etc. that can be utilized to achieve optimal energy performance.
In addition, ORINOCO focuses only on hardware synthesis. On the other hand, our
methodology exploits the above capabilities for the design of heterogeneous embedded
systems that integrates general purpose processors and reconfigurable devices onto a
single device.
Among other tools, the CODEF tool allows design space exploration based on a
complete specification of partitioning/scheduling and interconnection synthesis [6], The
focus is on time and area constraints, and primarily targets the design of dedicated sys
tems. Baghdadi et al. proposed a design space exploration framework based on a fast
42
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
and accurate estimation approach primarily targeted towards homogeneous multiproces
sors with fixed architecture [7], This effort makes use of a novel idea of being able to
perform a few cycle-accurate simulations to extracts all timing constraints necessary for
all possible implementations. All these efforts target specific non-programmable archi
tectures and focus only on time optimization. Thus they may not be suitable for energy
optimal application implementation using the state-of-the-art heterogeneous embedded
systems.
Other system level design frameworks such as PMOSS and COSMYA target unipro
cessor systems [27, 34], PMOSS targets evaluation of speed-up due to a coprocessor.
COSMYA calculates separate metrics for software, hardware, and communication based
on profiling and low-level simulation to analyze performance. POLIS combines high-
level and low-level simulation but targets only a specific architecture of one micropro
cessor accompanied by several custom coprocessors [8]. These efforts target certain
class of embedded systems such as Application Specific Integrated Circuits and micro
controllers. However, our proposed methodology and framework is more generic and is
targeted towards heterogeneous embedded system.
Several C/C++ language based approaches such as SystemC, SpecC are primar
ily aimed at making the C language usable for hardware and software design and
allow efficient and early simulation, synthesis, and/or verification [15]. However, these
approaches do not facilitate modeling of applications at a level of abstraction that
enables algorithmic analysis. A top-down compilation method that compiles C pro
grams into VHDL was proposed by Bazargan et al. [9], However, C/C++ or VHDL
descriptions do not capture parameters affecting system-wide energy and also a designer
requires a complete knowledge of the final system before the code can be generated in
these languages. Additionally, generating source code and going through the complete
synthesis process to evaluate each algorithm decision is time consuming. However, such
43
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
high-level languages are complimentary to our methodology. Our framework currently
provides a preliminary support for SystemC and can be augmented to support system
synthesis using the other languages to implement design identified through design space
exploration [53].
Among the commercial tools, Xilinx System Generator for Simulink provides a
high-level interface for application design using pre-compiled libraries of signal pro
cessing kernels [95]. Similarly, Cadence Signal Processing Workbench (part of the
Incisive Verification Platform) [14] supports development of signal processing algo
rithms using communications and multi-media libraries. Other tools such as Xilinx EDK
(embedded development kit) and Mentor Graphics FPGA Advantage provide integrated
design environments that accept high-level specification as VHDL or Verilog scripts,
schematics, finite state machines, etc. and provide simulation, testing, and debugging
support for the designer to implement a design using FPGAs [80, 96]. However, these
tools start with a single conceptual design. Design space exploration is performed as
part of implementation or through local optimizations to address performance bottle
necks identified during synthesis. Additionally, these tools [80, 96] are designed for
a specific target device and do not allow integration of additional simulators or design
tools. In contrast, our methodology and the proposed framework allows evaluation of a
wide variety of hardware devices while designing heterogeneous embedded system.
In addition to the tools discussed above, many techniques for formal modeling and
optimization of a single metric (latency or energy) while mapping applications onto
reconfigurable computing systems have been proposed [12, 62], An algorithm that
44
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
combines exact (MILP-based) and heuristic techniques for the synthesis of large time-
constrained heterogeneous adaptive systems was proposed by [81]. While these tech
niques are faster compared with our approach, some can not be easily extended to sup
port additional performance metrics or design choices such as shut-down and start-up of
devices to reduce power.
In summary, we use a hierarchical model-based approach by integrating pruning
heuristic and high-level performance estimation. Model-based approach allows us to
define a unified framework and promotes reuse. Because of pruning heuristic, we can
rapidly evaluate a much larger design space prior to estimation (and simulation) based
design space exploration. We do not assume an underlying architecture and instead
evaluate a set of available components, typically commercial-off-the-self (COTS), to
identify an architecture that minimizes energy dissipation while meeting the latency
requirements. In addition to architecture, we also identify the corresponding mapping
of the application onto the architecture. We are not aware of any commercially available
tool that can handle such a wide range of devices and considers duty cycle specifications
during system design while identifying an energy efficient design that meets a given
latency constraint.
In addition, for FPGAs, our approach enables high-level parameterized modeling
for rapid performance estimation and efficient tradeoff analysis. Using our approach, the
designer does not need to synthesize the design to verify design decisions. Once a model
has been defined and parameters have been estimated, design decisions are verified using
the high-level performance estimator. Additionally, parameterized modeling enables
exploration of a large design space in the initial stages of the design process. Our design
framework also allows integration of various design tools and thus provides a unified
extensible framework for FPGA design.
45
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Chapter 4
Hierarchical Design Space Exploration
We propose a hierarchical design space exploration methodology for energy efficient
application mapping onto heterogeneous embedded systems. Energy-efficient design
in our context refers to a design that dissipates the minimum energy while meeting a
given latency constraint. Hierarchical design space exploration integrates two differ
ent categories of exploration techniques; heuristic based selection and simulation based
exhaustive evaluation. While heuristic based techniques are fast, the accuracy of the
result depends on the level of abstraction (typically high) during the modeling of the
architecture. Conversely, due to the use of low-level (detailed) simulators, exhaustive
search can be very accurate but time consuming. Our proposed methodology provides a
tradeoff between these two search techniques and enables efficient design space explo
ration for energy-efficient application design.
4.1 The Problem Domain
Our problem domain is defined by the following four components (see Figure 4.1:
• target application
• target hardware
• performance and design constraints
• deployment scenario
46
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
deployment
scenario
target
application
glue logic
constraints
hierarchical design
space exploration
methodology
candidate
design
candidate
hardware
tools
(simulators,
estim ators, and
exploration tools)
Figure 4.1: Different components that define our problem domain
Our target applications are signal processing applications that can be modeled as
data flow graphs. Data flow graph is a directed acyclic graph where the nodes repre
sent the application tasks and the directed edges represent the dependencies among the
tasks. The input to such applications is a set of frames arriving at a pre-specified rate.
The functionality of the application is oblivious to the content of each frame. These
applications can be multi-rate where based on the serial order of a frame a subset of the
application tasks gets executed.
Our target hardware consists of a set of candidate devices that can be processors,
DSPs, FPGAs, memories, or system-on-chips (see Chapter 2). These devices are con
sidered candidate devices for our design problem which is a device selection problem.
There is no specific requirement for these candidate devices except that these be low-
power devices supporting features that can be exploited to identify an energy efficient
design while meeting the given latency constraints.
The performance constraint is specified as a hard latency constraint that the design
selected by our methodology should meet while minimizing energy dissipation. Other
class of constraint we support are design constraint which can specify if an application
47
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
task should (not) be mapped to certain candidate device or certain candidate devices
should (not) be selected together.
Deployment scenario specifies the working environment for the whole system in
terms of rate of input and if the input rate is a constant or a variable and if devices
can be shut-down and started-up based on some triggering mechanism (e.g. tripwire).
Deployment scenario can also specify the start-up time and energy cost depending on
the mechanism used to perform these tasks. Figure 4.2 provides an overview of our
design problem.
4.1.1 Formal Definition of the Design Problem
Given a target signal processing application, set of candidate devices, performance and
design constraints, and a deployment scenario, the design problem is to identify a design
that specifies the mapping and scheduling of the application tasks and start-up and shut
down schedule of the target devices such that the given constraints are met and energy
dissipation is minimized.
Our focus is to develop a generic approach to solve a similar class of problems
and then realize the approach towards the specific problem domain discussed above
and finally develop a proof-of-concept framework to demonstrate the validity of our
approach.
4.2 Challenges
In this section, we summarize the challenges involved in performing design space explo
ration of the class of problems discussed above. These challenges also define the basic
requirements of the approach we have proposed in this thesis.
48
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Candidate Devices
P1
D2
/
select devices
minimize energy
multi-rate Duty Cycled
power dissipated while idling
or, shut-down and start-up
— J l a t \ \ V 1/10
f s 4V - g K s input
___ _ \ ^
C n " \ ^
Js JJL
(™ \\+ ( '" i 1
i/i^— — ^ ^— H / i
*--------- meet latency requirement--------►
Application Model
output
and T ransitions
Figure 4.2: Design problem overview
• large design space: With the availability of multiple implementation platforms
such as FPGAs, traditional processors, and DSPs, a designer not only needs to
identify suitable platforms but also appropriate hardware/software partitioning
and mapping onto those platforms. In addition, other capabilities that play a sig
nificant role, especially for energy efficient design, are reconfiguration, dynamic
voltage scaling, choice of low power operating states, and device activation
scheduling based on the duty cycle specification. Thus, many choices/tradeoffs
are available during energy and latency efficient system design. However, a large
number of choices during application design results in a large design space.
• lack of suitable well-defined optimization problems: Heuristic based optimiza
tion techniques are fast as they are based on suitable high-level models represent
ing the underlying problem. However, in order to develop high-level models, a
49
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
number of simplifying assumptions are made which affects the accuracy of the
results produced.
• simulation is expensive: Simulation using a low-level simulator is typically con
sidered the most accurate method to verify design decisions without performing
physical measurements. However, simulation, especially cycle-accurate or RT-
level are too time consuming to be practical. In addition, as our target is a hetero
geneous embedded system, it may not be possible to integrate different simulators
for the candidate devices to develop a suitable system-wide simulator. In addition,
implementing the designs even with high-level code is time consuming.
• multi-metric (energy, time, area) optimization: We focus on optimizing the
time and energy metrics. If our design includes and FPGA, we can also optimize
the given area. However, it is expected that the performance metrics are priori
tized to resolve conflicts due to multi-metric optimizations. In addition, weight
and cost are also valid performance metrics. In addition, due to complex inter
dependencies, the design parameters (choice of operating state, activation sched
ule, etc.) can have conflicting effect on different performance metrics making
manual design and analysis impractical.
• less time-to-market: Time-to-market is defined as the length of time it takes to
get a product from idea to marketplace. Recently, due to increased competition,
time-to-market has become very critical [16]. Therefore, it is imperative that the
design methodology be fast and efficient while traversing a large design space.
50
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
4.3 Approach
4.3.1 Hypothesis
It is possible to define a model at a suitable level of abstraction and, using the model,
design space pruning heuristics can be defined, high-level performance estimators can be
developed, and low-level simulators can be driven to define a methodology for efficient
application design using heterogeneous embedded systems
4.3.2 Our Approach Towards Hierarchical Design Space Explo
ration
design parameters
P h ase -I
signal processing
applications
P h a se -ll
Very Large
Design
Space
heterogeneous
system
^ low-level
^.sim ulations
performance estimates
d e sig n e r simulators
high-level
estimator
optimization
heuristics
implementation
input stimulus
simulator configuration
Figure 4.3: Hierarchical design space exploration
Our hierarchical design space methodology has the following components (Fig
ure 4.3):
• define a high-level model to characterize the target application and architecture
and capture the various performance constraints and deployment scenarios given
as input
• perform design space pruning based on the high-level model to select a set of
designs based on user specified latency requirement and design constraints
51
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
• perform hierarchical simulation on the set of designs selected through design
space pruning to identify the design with minimum energy dissipation that meets
the latency requirements
Several techniques are described in the literature that uses various types of design
space exploration to identify a single design that best matches the performance require
ment (refer to Chapter 3). The general approach for a large class of such techniques
consists of two steps; modeling and design space exploration (DSE). Modeling captures
the necessary information about the architecture and application and DSE identifies the
design that meets the requirement specifications, satisfies the constraints, and minimizes
(maximizes) some cost (objective) function. In terms of design space exploration, these
techniques can be divided into two categories:
• exhaustive search based
• heuristics based
Our proposed methodology has three steps instead. These three steps are classified as
modeling, heuristic based pruning, and simulation based exhaustive search. We combine
the advantages of the two types of design space exploration techniques in our methodol
ogy. The accuracy of the result of the heuristic based search techniques depends on the
level of abstraction of the models that capture the application and architecture details.
To model the state-of-the-art candidate devices for energy, several attributes are to be
considered in order to achieve accuracy. For example, simulators such as SimpleScalar,
PowerAnalyzer specify more than 25 parameters [83, 98], A large number of variable
parameters makes it difficult to develop an appropriate heuristic based approach. Con
versely, simulators (low-level) can provide very accurate performance estimates but are
time consuming. For example, SimpleScalar simulates a given MIPS architecture at the
52
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
rate of 200 kilo cycles per second. Therefore, it is not an efficient to use simulation to
search a large design space.
Our approach provides a tradeoff between the two techniques. Initially, we use
heuristic based approach to evaluate a large design space and identify a set of designs
that best meets the latency and energy requirements. In the first step, Phase-I, we are
able to rapidly evaluate a large design space and by choosing a set of designs we are able
to accommodate errors due to the assumptions made during modeling. In the next step,
Phase-II, we use the simulators to perform an exhaustive search on the designs selected
by Phase-I.
The main concern during simulation of a heterogeneous embedded system is the lack
of system-level simulator. Therefore, we have developed a high-level performance esti
mator, HiPerE, that integrates component specific simulation results to provide system-
level performance estimates. We exploit HiPerE to perform hierarchical simulation.
For hierarchical simulation, initially fast component specific simulators are used to effi
ciently explore the large number of designs and later more accurate simulators are used
to facilitate accurate evaluation of the designs. Thus we gradually eliminate sub-optimal
solutions during hierarchical simulation and finally identify the design that best meets
the performance requirements.
4.4 Key Ideas of Our Methodology
In this section we define some concepts and key words that would be used through out
the thesis for the purpose of qualitative and quantitative analysis of our design approach
and the various tools we have used to realize our approach.
A “design space” can be characterized by a set of parameter P (Pi, P2, ■ ■ ■ , P/v)
where each parameter Pi can be associated with a set of values C% . Each set contains
53
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Ni number of values where each value can be denoted as G ',7 where 1 < j < Ni. A
parameter and its corresponding values are determined based on the target system being
designed. For example, if an application task is associated with 3 implementations then
the latency and energy estimates of the task can be a parameter where the parameter
can be associated with 3 values. Therefore, in simple terms the size of the design space
can be considered as the cardinality of the Cartesian product of all parameter sets Ct
where 1 < % < N . However, we also need to capture the constraints that specify
the dependencies among the parameters. Based on the previous example, latency and
energy, while considered as two parameters can only produce 3 distinct combinations as
selection of an implementation automatically selects an energy and latency estimate.
Given such a design space, a design can be identified as a tuple containing N ele
ments (Vi, V 2 , • • ■ , Viv) where each V. is an element of set C*. This is assuming that the
constraints of parameters being dependent on each other is being satisfied.
Lets assume that V be a performance metric. For our target domain it can be energy,
latency, or area. The performance metric can be expressed as a function of the param
eters that defines the design space. Thus V = P (P i, P2, • • ■ , Pjv). The simulators
and performance estimators can be viewed as implementation of some such function to
generate performance estimates. Thus the function may or may not be expressible as
a pure mathematical relation between the parameters. In addition, even the heuristic
based design space exploration tools can also be considered as functions which produce
an accept or reject for a given design if a given set of performance constraints are met
or not.
Based on the above definitions, an efficient design can be identified as a design that
minimizes or maximizes a certain combination of performance metrics while meeting a
pre-specified value of another set of performance metrics. For example, in this thesis,
we consider meeting latency constraints while minimizing energy dissipation in this
54
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
thesis. Another optimization criteria for purely FPGA based designs can be minimizing
the product of energy, latency, and area [19].
In order to discuss our methodology, we define problem domain to refer to a class
of system design problems. Given an application that can be modeled using a data flow
graph (DFG) and a set of target devices, problem domain refers to all system design
problems that can be defined based on the given class of applications and the given target
devices. For example, a linear array of tasks and a processor that supports dynamic
voltage scaling define a problem domain which includes the following system design
problem, “identify appropriate voltage setting for each task such that overall energy
dissipation is minimized”. Another problem may require minimization of energy while a
given latency constraint is satisfied. A set of parameters is associated with each problem
domain. A parameter can be a variable or a constant. For example, for the problem
domain discussed above, some of the parameters are task execution cost for each voltage
setting, operating voltage, voltage scaling cost, cost of data access due to the use of a
memory, choice of memory devices, etc. Among the parameters, operating voltage is a
variable parameter and task execution cost per operating voltage is constant. For a given
problem domain, there exist several optimization heuristics, high-level estimators, and
simulators that can be used to perform design space exploration. We define parameter
coverage as a metric to compare different design space exploration (DSE) techniques
applicable to a problem domain. For a given solution, be it an optimization heuristic
or an estimation/simulation tool, parameter coverage refers to the set of parameters that
is considered by each solution technique while estimating performance or performing
design space exploration. Higher parameter coverage refers to a larger set of parameters
and results in higher accuracy but can potentially be time consuming during DSE. A
low-level simulator is an example of a performance estimation tool with high parameter
55
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
coverage (e.g. SimpleScalar [83]). In contrast, optimization heuristics, due to high-level
of abstraction, tend to have lower parameter coverage.
Given a problem domain, hierarchical design space exploration is defined as a two
step process (Figure 4.3). The first step uses a pruning heuristic that generates a set of
designs meeting the given constraints. The second step consists of a suitable high-level
performance estimator that evaluates the performance of any design that is a potential
solution for the given problem domain. The high-level estimator we discuss in this the
sis, HiPerE, uses an interpretive simulation based approach to estimate performance (see
Chapter 6). This, the high-level performance estimator has a higher parameter coverage
than the optimization or pruning heuristics. Hence, HiPerE can cover the parameters
not included in the models used by the pruning heuristics. As a result, our methodol
ogy can explore a larger design space than an optimization heuristic only design space
exploration scheme. Hierarchical design space exploration assumes the availability of
appropriate simulators to estimate the model parameters required by the pruning heuris
tics and the high-level estimators. Examples of such parameters are the performance
cost of various mappings or operating state transition costs etc. Some of the param
eters can also be obtained from the data sheets provided by the device vendors. We
also assume that appropriate input such as implementations in high-level languages sup
ported by the simulators, input stimulus, and simulator configurations are available to
perform simulation.
4.4.1 Advantages of Our Methodology
The primary difference between our version of an pruning heuristic and the traditional
version is generation of a set of designs as opposed to a single optimal design. The set of
designs consists of the designs that meet the given performance constraints. By ensur
ing that we have a set of good designs as opposed to one optimal design, we increase
56
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
the probability of finding the real-optimal design from the set even when approximated
high-level models are used. An optimal design is the best design identified (based on the
performance requirements) by the optimization heuristic using the underlying approx
imated high-level model. A real-optimal design is the design that is optimal when the
designs are implemented using hardware and performance is measured. A real-optimal
design can be different than the design identified by the optimization heuristic because
the later assumes a high-level approximated model with lower parameter coverage. For
the ease of comparison, we assume that the most detailed low-level simulator avail
able is accurate and can be used to identify the real optimal solution. We also assume
that the errors induced by approximations are marginally low when compared with the
actual performance values. The advantages of our hierarchical design space exploration
methodology are as follows:
• robust against approximation errors due to high-level abstractions (models) used
by the optimization or pruning heuristics
• reduces the number of simulations necessary when compared against simulation
based design space exploration
• combines the speed of heuristic based design space exploration and the higher
accuracy and parameter coverage of simulation based DSE
• designer can potentially combine different pruning heuristics and high-level esti
mators to suite the need of target application design problem domain
Low-level detailed simulators, such as SimpleScalar [83] and ModelSim [51], pro
vide accurate performance estimates but are time consuming. Evaluation of a large
design space, even of the order of 10s or 100s, can take days. Our hierarchical design
methodology does not require simulation to evaluate the complete design space. Rather,
57
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
the simulators are used just to estimate performance of different mappings that are
used by the pruning heuristics and high-level estimator to perform DSE. Therefore, our
methodology is significantly faster than simulation based DSE. Chapter 5 compares the
overhead due to our methodology and a simulation based DSE. In comparison with
optimization heuristic based DSE techniques, due to the use of a high-level estima
tor, we support a higher parameter coverage. Higher parameter coverage results in the
evaluation of a larger design space. This is because number of designs in a design space
depends on the possible values of various parameters associated with a problem domain.
For example, for the problem domain defined by an application and a processor support
ing dynamic voltage scaling, if the number of discrete voltage settings increases or if we
also choose to evaluate a set of memory configurations, the size of the design space will
grow.
In the following, we perform an experiment to demonstrate the robustness of our
hierarchical design space exploration methodology against approximation errors.
4.4.2 Robustness against approximation errors
Our objective in this experiment is to demonstrate the effect of error due to assumptions
made during high-level modeling on design space exploration and the advantages of a
hierarchical design methodology over optimization heuristics. The design space explo
ration problem we consider is “given a linear array of tasks, a device with multiple
operating state, a set of valid states for each task, time taken to execute each task in
each (valid) state, and state transition cost for each pair of states, find a mapping of the
tasks and operating states with the minimum latency” (Figure 4.4). An operating state
can be configuration in case of reconfigurable devices or operating voltage (frequency)
in case of processors supporting dynamic voltage (frequency) scaling.
58
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Table 4.1: Robustness Against Approximation Errors
E R R O R 256 3125 46656 823543 No. o f designs
up to 4535 6363 7625 7889 optimal design
10% 4535 6823 7625 8044 design using dynamic prog.
error 4535 6363 7625 7889 design using our method
Yes Yes Yes Yes was optimal design selected?
up to 4691 6031 7406 8325 optimal design
15% 4895 6075 7651 8354 design using dynamic prog.
error 4691 6031 7499 8424 design using our method
Yes Yes No N o was optimal design selected?
up to 4705 6160 6849 8669 optimal design
20% 5117 6230 7109 9370 design using dynamic prog.
error 4988 6230 6849 9116 design using our method
No N o Yes N o was optimal design selected?
up to 4755 5850 6680 8567 optimal design
25% 4876 6305 6851 8800 design using dynamic prog.
error 4755 5850 6783 8790 design using our method
Yes Yes No N o was optimal design selected?
’ state h state
Task 1 Task 2 Task 3 ca c o
mapping
state
Task n
Figure 4.4: Optimal mapping of linear array of tasks
We compare the well known dynamic programming based solution using assembly-
line scheduling algorithm [23] and our hierarchical approach. Our approach uses
dynamic programming based N-optimization heuristic to identify a set of N solutions
based on a given latency constraint. The solutions are further evaluated using a high-
level performance estimator, HiPerE (discussed in 6), to identify the optimal design. On
the other hand, the assembly-line scheduling algorithm identifies the optimal solution.
In the following, we define the dynamic programming based N-optimization heuristic
59
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
followed by the results and observations based on our experiment. Figure 4.5 shows our
experiment setup.
The problem can be formally defined as, given a linear array of tasks and a device, a
set of operating states for each task, time taken to execute each task on operating state,
state transition cost for each pair of states, find N solutions for mapping of the tasks
and states such that the solution constitutes a set of designs with N lowest values for
latency or energy dissipation. See Chapter 6 for the detailed description and proof of
our dynamic programming based solution for N-optimization problem.
optimal
design
Design
Space
compare
optimal
■ ■ ... design . ' Reduced
dynamic prog, based
N-optimization
JouleTrack,
SimpleScalar,
HiPerE
dynamic prog, based
optimization
(task execution, state transition)
simulators & data sheets
(error gets introduced)
Figure 4.5: Experimental setup
The estimates used for our experiment are the task execution cost per operating state
and the state transition cost for each unique pair of source and destination states. In
the experiment, we introduce a certain amount of error in the latency estimates (both
execution and state transition) and compare the optimal design identified by both the
approaches with the real optimal design. The real optimal design is identified based on
the estimates without any error using an exhaustive search of the complete design space.
HiPerE uses the estimates without any error. However, the optimization and pruning
heuristics use estimates with error to demonstrate the effect of error due to high-level
60
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
modeling. The latency estimates for task execution and reconfigurations can be gen
erated using suitable low-level simulators. By introducing random errors to the above
estimates, we are able to simulate approximation errors encountered while modeling
using a high-level abstraction. We also verify if the real-optimal design was present in
the set of solutions identified by step one of our methodology. We experimented using
4 different applications (linear array of tasks) containing 4, 5,6, and 7 tasks each and
4,5,6, and 7 numbers of configuration for each task respectively. Table 4.1 shows the
results. We evaluate each instance of the linear array of tasks with up to 10%, 15%, 20%,
and 25% approximation errors. The latency values shown in the table is based on the
actual latency estimates for the designs. However, the designs are identified based on
the latency estimates with error.
As shown in Table 4.1, out of the 16 experiments, the dynamic programming based
optimization is able to identify the real-optimal design only in 2 cases whereas using
our methodology we are able to identify the real-optimal design in 9 cases. Even when
our methodology fails to identify the real-optimal design, the optimal designs identified
by our method is on average 2.7% (max 6%, min 1.2%) worse than the real-optimal
design. In contrast, the same, for results obtained using dynamic programming solution
is on average 3.8% (max 8.7% and min 0.34%). The above experiment demonstrates the
robustness against approximation error due to our hierarchical methodology. While we
do not evaluate the time needed for simulation based DSE, the size of the design space
indicates the magnitude of the simulation time needed to perform all the simulations.
In addition, we studied the effect of error during high-level modeling while deter
mining the optimal design using optimization heuristics. Figure 4.6 summarizes our
observation. We compared the designs selected using estimates without any error, esti
mates with up to 15% error, and estimates with up to 20% error. The arrows shown in
Figure 4.6 refer to the position (in terms of latency) of the optimal design. As expected,
61
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
when we use estimates with error the optimal design identified may not be the real-
optimal design.
optimal designs
i
ROD
15% Err
□ 20% Err
M etric 1500
D esig n s
0% 15% 20%
Figure 4.6: Effect of error in identifying the optimal design
62
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Chapter 5
Application Design using
Heterogeneous Embedded Systems
In this chapter, we apply the hierarchical design space exploration methodology towards
the domain of signal processing application design using heterogeneous embedded sys
tems. In order to apply the methodology, we have develop a set of comprehensive mod
els that describe the problem domain and enable the use of a pruning heuristic and a
high-level performance estimator. For the low-level simulators, we assumed the avail
ability of application task implementation in a high-level code (e.g. “C”, “VHDL”).
5.1 Modeling
Our modeling technique integrates two complimentary approaches; hierarchical data
flow graph based approach for application specification [41] and domain specific mod
eling based approach for modeling the kernels [20]. In addition, we discuss how we
model the performance and design constraints and reconfiguration.
5.1.1 Kernel Level Modeling
Our kernel level modeling enables specification of parameterized design for signal pro
cessing kernels for implementation using FPGAs. We exploit domain specific model
ing, a technique for high-level modeling of FPGAs, developed by Choi et al. [20]. This
63
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
High-level
Model
Inter-
Module
Number of components
Component Specific Parameters
(a,nfp,f,$w,....)
Function
Estimation
State of a
component
in a cycle
Component
Specific
Power Functions
Component
Power State
Matrices
u
c
— - i —
Energy Function
jr_____
^Latency Function I Area Function I
Design System-wide
Performance
Parameters
Figure 5.1: Domain specific modeling and performance estimation and component
power state matrices
technique has been demonstrated successfully for designing energy efficient signal pro
cessing kernels using FPGAs [38]. A domain refers to a class of architectures and the
corresponding algorithms for a particular signal processing kernel. A class of architec
tures can be a uniprocessor, linear array of processors, 2-D array of processors, or any
other class of parameterized architectures. For example, matrix multiplication on a lin
ear array of processors is a domain. A model defined using this technique consists of
RModules, Interconnects, component specific parameters and power functions, compo
nent power state matrices, and a system-wide energy function. A Relocatable Module
(RModule) is a high-level architecture abstraction of a computation or storage module.
For hardware implementations on an FPGA, a register can be a RModule if the num
ber of registers in the design can vary based on algorithmic level choices. Interconnect
represents the resources used for data transfer between the RModules. A component
(also referred to as building block) can be a RModule or an Interconnect. Component
specific parameters depend on the characteristics of the component and its relationship
to the algorithm. For example, degree of parallelism, precision, size of internal memory
64
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
(on FPGA), binding options for RModules, power states, are possible component spe
cific parameters. Component specific power functions capture the effect of component
specific parameters on the average power dissipation of the component. For this we
assume a switching activity of 12.5%. Component Power State (CPS) matrices capture
the power state for all the components in each cycle. For example, consider a design that
contains k different types of components (C\ ,... , Cf) with rii components of type i. If
the design has the latency of T cycles, then k two dimensional matrices are constructed
where the i-th matrix is of size T x n., (Figure 5.1). An entry in a CPS matrix repre
sents the power state (e.g. active or clock-gated) of a component during a specific cycle
and is determined by the algorithm. System-wide energy function represents the energy
dissipation of the designs belonging to a specific domain as a function of the parame
ters associated with the domain. Figure 5.1 summarizes the domain specific modeling
approach.
Modeling based on the technique described above has the following advantages:
• various parameters get exposed at the algorithm level
• performance models for energy, area, and latency are generated in the form of
parameterized functions,
• it is possible to rapidly estimate different performance metrics using only the
information captured in the models
• a parameterized model of a domain captures a set of designs (based on parameter
values) that can be analyzed for various performance tradeoffs
65
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
alternative-node alternatives compound-node
directed acyclic graph
hiprsim hirsil
domain
specific model
Figure 5.2: Hierarchical data flow graph with alternatives
5.1.2 Application Level Modeling
Application modeling involves application specification as a hierarchical data flow graph
with alternatives (Figure 5.2). The functional specification of the target application spec-
specifies the alternatives. Data flow is a widely used, expressive language for specifying
signal processing applications. Among the various flavors and variants of data flow, we
chose the synchronous data flow, to model the application. An advantage of synchronous
data flow is that it can be statically scheduled. The data flow graph is represented as a
directed acyclic graph (Figure 5.2). The nodes associated with the graph are classified
into three types; leaf, compound and alternative. The leaf node represent an application
task (or kernel). A leaf node is associated with a model of a kernels based on the domain
specific model (discussed in Section 5.1.1). Alternatively, a leaf node can be associated
with a high-level code (e.g. VHDL) that implements the task. This code is used to sim
ulate the design of the kernel and estimate latency, area, and energy dissipation. The
alternative nodes capture alternatives associated with a task. An alternative can only be
a leaf node or a compound node. The alternatives implement the same task but model
different algorithm or implementation on different FPGAs. A compound node contains
a hierarchical data flow graph with alternatives. Essentially, the compound nodes allow
ifies the structure of the data flow graph and the choice of algorithms/implementations
66
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
easy management of large application models. The directed edges connecting the nodes
represent the temporal dependencies among the nodes.
5.1.3 Resource Modeling
The Generic Model (GenM) models heterogeneous embedded systems to capture vari
ous capabilities that can be exploited for performance optimization during application
mapping onto heterogeneous embedded systems. In this section, we discuss a version
of GenM that consists of three components: a processor, a reconfigurable logic (FPGA),
and a memory, all connected through an interconnect. This can be easily generalized for
any heterogeneous embedded systems. GenM models discrete dynamic voltage scal
ing (DVS) for the processor, power states for the memory, and reconfiguration for the
FPGA. GenM model consists of the following parameters:
V : array of operating states of the processor, [V 0 ... K,_i], where
Vo represents the idle state
Vlj(VV) : time (energy) costs for voltage variation from Vl to V3
C : array of operating states of the FPGA, [C '0 ... Ce-i], where Co
represents the idle state
Cfj(Cfj) : time (energy) costs for reconfiguration from Ci to Cj
eidiefd) : processor (FPGA) energy dissipation in the idle state
M : array of memory states, [M0 ... Mm_i]
: time (energy) costs for changing memory state from M, to Mj
oti : memory energy consumption per time unit in state M*
rj (C) : unit data transfer time (energy) between the processor/FPGA
and the memory
67
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
In order to model application mapping onto heterogeneous embedded systems,
GenM model needs some additional parameters. These parameters capture few appli
cation details and provide performance estimates for various task to computing element
mapping and are referred to as Performance Parameters. These parameters are:
T : set of tasks, Ti ... Tn
0\n{^aai) ' ■ amount of data input (output) to (from) task T, from (to) memory
Uj (eij) : time (energy) for executing task Ti on the processor in operating
state Vj, j = 1,... ,v — 1
t'ij(e'ij) • *ime (energy) for executing task Ti on the FPGA in operating state
Cj, j = 1,... , c - 1
Energy estimates (etJ and e 'tJ ) refer to average energy dissipated by the task evaluated
by averaging the estimates based on a set of input data. We also assume that when there
is no task to execute, the processor is in the idle state. The idle state and the other
states of the processor determined by specific operating voltages are referred to as the
operating states of the processor. Similar assumption holds for the FPGA. The idle state
of the FPGA and other states determined by specific configurations are referred to as the
operating states of FPGA.
Let S denote the set of all possible system states. Each system state s, s £ S , is
represented by a tuple (I v(s), I c(s), Im(s)), where Iv(s), Ic(s), and Im(s) are integers
and represent the operating states of the processor, FPGA, and memory respectively. We
have 0 < Iv(s) < v — 1, 0 < I c(s) < c— 1, and 0 < Im(s) < m — 1. Therefore, the total
number of distinct states is van. The transition between different system states incurs
certain time and energy costs which depend on the source and destination states of the
transition. Let g,; ? - (r^) denote the energy (time) costs for the system state for transition
68
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
from to Sj. qij is calculated as V^v{si)jv{sj) + Ce lc{s.)jc{sj) + M e Im{Si)Jm[s.y ry can
also be calculated similarly. In the above formalism, we have assumed communication
between tasks is performed via shared memory. However, this assumption is a limitation
of the model. It is not necessary to define the problems.
GenM can be used for the following:
• rapid estimation of performance in terms of energy and latency
• development of efficient application designs using a high-level abstraction that
ignores specific low-level details of the target heterogeneous embedded systems
and applications
• development of optimization techniques for mapping and scheduling applications
onto heterogeneous embedded systems
The high-level performance estimation tool HiPerE is based on the GenM model
(see Chapter 6).
5.1.4 Modeling Multiple Operating States of Target Devices
The candidate devices considered by our framework are the state-of-the-art low power
devices that offer several capabilities such as low power operating modes, volt
age/frequency scaling, reconfiguration, etc. [35, 66, 89, 96]. We assume discrete volt
age/frequency scaling. We model the candidate devices based on the operating states
supported by the device. An operating state can be a configuration for an FPGA or a
voltage/frequency setting for devices supporting voltage/frequency scaling. In addition,
given two operating states A and B, we assume that the transitions from A to B and
B to A are associated with transition costs. Transition cost includes latency and energy
dissipated during the transition.
69
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
average power
consumption
<--- ( P3 )
ite
Figure 5.3: Modeling multiple operating states
We model the operating states for each device using an augmented finite state
machine (FSM). Figure 5.3 shows a sample model for a device with 3 operating states.
Each node in an FSM represents one operating state. Each pair of nodes is connected
with a pair of directed edges. Each edge correspond to a state transition from the state
represented as the source node to the state represented as the destination node. Each
edge is also associated with a 2-tuple (TL, TE) where TL is the latency cost and TE is
the energy dissipation during the transition.
Each operating state is associated with an estimate of average power consumed while
idling (Pi in Figure 5.3). This information allows us to compute the total energy dissi
pated when the device is idling in a particular state. The model also indicates a state as
the default state (shown in gray). Unless specified, the default state is the operating state
of the device when the device powers up. In addition, there is one operating state per
device representing power down.
5.1.5 Modeling Duty Cycle
Duty cycle is the proportion of time a system is active. Therefore, based on the duty
cycle specification, system execution can be modeled as alternate active and inactive
phases. Duration of each phase depends on the input rate and the latency required to
70
(L13, E13)
S2 P2
default stc
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
process a single input frame. Input rate can be variable. For example, a target detec
tion system can scan the environment and process data at a lower rate when no target is
detected and later increase the scan rate in case there is a detection. Our design frame
work needs the input rates to be statically defined. Variable input rate is specified as an
ordered set of 2-tuples (Ir , N / ) where Ir refers to the input rate in Hz and N f refers
to number of frames processed at the above rate. Given two consecutive 2-tuples, (Jry,
Nfi) and (Iri+1, N f i+1), application execution is modeled as “process N j\ frames at
the rate of 7r, Hz followed by N f l+\ frames at the rate of Ir i+i Hz”. Our framework
assumes that the given set of 2-tuples can repeat indefinitely to model processing of
large number of input frames. Similarly, maximum allowed latency to process an input
frame can also be specified. If not specified, the framework derives the constraint based
on the given maximum input rate.
During the inactive phases, devices that constitute the target hardware are idle. If the
duration of the inactive phases is significantly larger than that of the active phases, min
imization of energy dissipation during inactive phases contributes significantly towards
the minimization of the overall energy dissipation. Various options that can be evalu
ated to minimize energy dissipation are shutting down devices, transition to a low power
mode, and leaving as it is (thereby saving the transition costs). Our framework evaluates
these options to identify the energy optimal one. In addition, based on the maximum
allowed latency, our framework also evaluates the option of executing tasks in a low
performance mode (e.g. low operating frequency) that takes advantage of the available
slack.
Duty cycle aware application design is considered non-trivial for the following rea
sons:
• switching on and configuration of FPGAs involve energy dissipation which needs
to be compared with energy dissipation while idling
71
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
• latency required for switching on and configuration can be larger than the duration
of inactive phases
• use of different optimization technique results in different energy efficient designs
for the same duty cycle specification
5.1.6 Modeling Constraints
We also model constraints that must be satisfied by the design(s). The constraints are
divided into two types; performance constraints and design constraints. The perfor
mance constraints specify the latency and energy requirements for the complete appli
cation graph or any of its sub-graphs. If energy constraint is not specified, we minimize
energy dissipation.
The design constraints specify valid combinations of task mappings. A mapping
specifies the execution of a task on an FPGA operating in a specific configuration. Con
straint on mapping specifies pair wise relations between two mappings. For example, if
multiple FPGAs are available, a design constraint of the form “task A1 implemented on
FPGA B 1 operating in configuration C l is not compatible with task A2 implemented
on FPGA B 2 operating in configuration C 2” will result in selection of designs that do
not include task A1 mapped on B 1 using C l and task A2 mapped on B 2 using (72.
5.1.7 Modeling Reconfiguration
In our application model, each leaf node is mapped to an FPGA. We support designs
that use multiple FPGAs. Therefore, each leaf node is associated with FPGAs that can
implement the kernel represented by the leaf node. Each leaf node is also associated
with a domain specific model that is used to predict the area requirement of the kernel.
Therefore, based on the capacity of the mapped FPGA, and the area requirement of
72
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
consecutive tasks, it is possible to determine if reconfiguration is required between the
execution of two tasks if both the tasks are mapped onto the same FPGA.
TASK A TASK B
_ 5 C j C > alternatives
source task destination task
modeling
reconfiguration
P S E U D S '
t a s k ^
TASK A TASK B
design constraints
Figure 5.4: Modeling reconfigurations
The hierarchical data flow graph discussed above does not explicitly support recon
figuration. Therefore, we developed a technique that uses pseudo tasks and design con
straints to enable modeling of reconfiguration. This technique introduces a pseudo task
(as an alternative node) between pairs of tasks where a pair constitutes a source and a
destination task. Given a set of tasks modeled as a data flow graph, two tasks are a
source-destination pair if both the tasks are mapped onto the same FPGA and the exe
cution of the destination task follows the execution of the source task on the mapped
FPGA. A pseudo task models reconfiguration. The alternatives for this pseudo task are
a set of possible reconfigurations determined based on the design choices available for
the source and destination task. Each reconfiguration is modeled as a leaf node. As dis
cussed above, a domain specific model of a kernel specifies a set of designs. Therefore,
based on the design selected for the source and destination tasks, multiple reconfigura
tions are possible; one for each unique pair of designs for the pair of source and destina
tion tasks. Each reconfiguration is associated with a unique value of performance cost
based on the latency and energy dissipation required per reconfiguration. Therefore, we
specify a set of design constraints to ensure that appropriate reconfiguration is chosen
73
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
based on the mappings chosen for the source and destination tasks. Thus, when we per
form design space exploration and select the designs that meet the design constraints,
correct reconfigurations are also automatically chosen. In addition, reconfiguration costs
also get automatically added to the overall performance of each design while evaluating
the designs against the performance constraints.
5.2 Design Flow
In order to use hierarchical design space exploration, the designer needs to follow the
design flow as described in the following (Figure 5.5). Once the application design
problem domain is identified, the designer needs to identify a suitable pruning heuristic
and a high-level performance estimation tool. Given a specific application design prob
lem, the designer needs to define models for the application and the target hardware or
hardware choices. A mapping model, containing feasible mappings, is defined using the
application and hardware model. In addition, performance constraints for latency and
energy dissipation are also specified in the model. These performance constraints are
defined such that a set of designs can be chosen using the pruning heuristic. Following
model definition and constraint specification, the pruning heuristic is used to identify the
designs that meet the performance constraints. The selected designs are evaluated using
the high-level performance estimator to identify the design that meets the performance
requirements of the design. Note the difference between performance requirements and
performance constraints. Performance requirement is specified as part of the design
problem specification and refers to the requirements that the resulting design needs to
satisfy. Performance constraints are specified as input to certain type of pruning heuris
tics such that a set of designs can be selected using the first step.
74
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
[ signal processing
' / M
Modeling
Environm ent
t k \ > j§
identify a set
of designs
Design Space
Exploration
(DESERT)
model
m apping
’r < 1 0 2
Browser
High-level
Performance
Estimator
human-in the-loop
final design
heterogeneous
system
A 1
I hiqh-level [, I thirdparty I
h — ,» . — m , simulators
implementation r . I
and input I _ n n T T SZi
Figure 5.5: Sample design flow
5.3 Illustration of Hierarchical Design Space Explo
ration
We use two different application design problems to demonstrate the design flow and
the advantages of our hierarchical design methodology. The first design problem is to
identify an energy-efficient hardware for a personnel detection algorithm [85] from a
set of devices that consists of traditional processors, FPGAs, and DSPs. In the second
design problem, we demonstrate the selection of an energy efficient architecture and
the corresponding mapping for LMS (Least Mean Square)-based MVDR (Minimum
Variance Distortionless Response) adaptive beamforming algorithm [24], Using these
examples, we demonstrate the following advantages of our approach:
• support for modeling and rapid design space exploration
• device selection
• energy efficient application mapping onto the target devices
75
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
5.3.1 Energy-Efficient Architecture Selection for a Personnel Detec
tion Application
dmean target
position
velocity
filter
velocity
H ardw are S p a c e
Virtex II Pro, Actel
ProASIC, TI DSP, Intel
PXA, PowerPC 405
compute
CVM
inverse
image
from
camera
whiten
Figure 5.6: Personnel detection application and the hardware design space
We consider a personnel detection application for our first example [85], The per
sonnel detection algorithm is required to processes input in real-time and hence there
is a hard latency requirement. In addition, as the system needs to be deployed in a
power-constrained environment, energy dissipation is also an important metric. The
application consists of 5 tasks as shown in Figure 5.6. The input comes from a sen
sor. The application design problem is to select the most energy-efficient hardware and
the corresponding mapping of the tasks onto the hardware components. The hardware
needs to be selected from a set that consists of Xilinx Virtex-II Pro, Actel ProASICPiC/s,
Intel PXA 255, PowerPC 405, andTI C6711 DSP [1][35][66][90][96], Micron Mobile
SDRAM was chosen as the memory. In order to identify the most energy-efficient solu
tion, it is necessary to evaluate the designs based on a duty cycle which was provided as
an input. For our target mapping problem, the duty cycle specification was determined
based on the input rate of the camera. In addition, the application is multi-rate as the
rate of compute CVM and inverse are different than the other tasks.
Before applying our hierarchical design methodology, we modeled the application
design problem. Modeling involved specification of the application as a data flow graph.
The modeling of the target devices involved modeling of individual processing com
ponents and memory using the GenM model. Once the application and the hardware
76
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
devices are modeled, possible mapping choices are indicated in the model for mapping.
For example, inverse and whiten needs to be computed using floating point arithmetic
for which Actel ProASIC PLUS is not a suitable choice (due to large area require
ment). Such constraints are indicated by not allowing inverse and whiten to be mapped
onto ProASIC7 ’1 '* 75 while modeling. Additionally, while we have specified 5 distinct
devices in the design space, the 4 valid combinations are, a stand alone Virtex-II Pro and
ProASICPiC/,s combined with either the DSP or with one of the two processors. Such
constraints are specified in the model using the object constraint language (OCL) [41].
The duty cycle and multi-rate specification are provided as follows: rate of compute
CVM and inverse CVM is 2 and the input rate of the camera is 0.5 Hz. In addition, we
needed to evaluate each design for 10 input frames. We used DESERT (Design Space
Exploration Tool [60]) and HiPerE (High-level Performance Estimator [55]) to perform
hierarchical design space exploration. DESERT is a design space exploration tool based
on ordered binary decision diagrams. DESERT applies performance and design con
straints on a given design space (as modeled based on our methodology) and selects
designs that meet the given constraints [60]. DESERT is not based on a heuristic tech
nique. However, as it allows selection of a set of designs based on the given constraints
DESERT is well-suited for our design space exploration methodology. As DESERT
cannot be used to evaluate designs based on duty cycle requirements, we used DESERT
to prune the design space using performance of each design for a single end-to-end
execution (process one input frame). Therefore, we selected a latency constraint of < 1
second to identify the designs that can support an input rate of 1 Hz (note that we want to
select a set of designs). Through trial and error, energy constraint of < 860 milliJoules
was chosen to have DESERT select 16 designs as output of the first step. The size of
the initial design space was approximately 73,000. Once 16 designs were identified by
77
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
DESERT, we used HiPerE to perform duty cycle based design space exploration to iden
tify the best design that meets the duty cycle requirements and dissipates the minimum
energy. As Intel PXA 255 and PowerPC 405 do not support floating point arithmetic,
due to high latency for floating point emulation, designs that included these two pro
cessors were eliminated by DESERT. The 16 designs identified by DESERT included
designs that use only Virtex-II Pro or a combination of the DSP and ProASICPLf/,s.
Designs that use the same device (or device combination) differed in mapping. For
Virtex-II Pro, availability of different configurations for each task resulted in different
designs. Among the designs we chose 3 designs; the most energy efficient design using
only Virtex-II Pro, only the TI DSP, and both TI DSP and ProASICPit/s for analysis.
Results of our design space exploration is provided in Table 5.1.
Table 5.] : Results for personnel detection application
Designs Scenario 1 Scenario 2
Latency (ms) Energy (mJ) Latency (ms) Energy (mJ)
Virtex-II Pro only 247.33 114.06 17895 13938.1
TI DSP only 614.57 670.11 18286 9692.7
ProASIC + TI DSP 496.50 538.18 17166 8652.9
In Scenario 1, the application is executed only once (no start up or shut down cost
included) and in Scenario 2, the application was executed based on the duty cycle
requirement specified to us (Table 5.1). Note that though the Virtex-II Pro based design
is the most energy-efficient for Scenario 1, it is the least energy-efficient for Scenario
2. This fact demonstrates the advantage of using a hierarchical design space exploration
and also the usefulness of higher parameter coverage of HiPerE when compared with
DESERT. On the other hand, the use of an pruning heuristic in the first step allows us
to evaluate a design space of size 73,000 in less than a minute. Using HiPerE, it takes
approximately 10 hours to estimate performance of all the designs and a tedious manual
comparison of all the estimates to identify the most suitable design.
78
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
5.3.2 Design of LMS-based MVDR Adaptive Beamformer
The LMS-based MVDR algorithm consists of three steps. The first step is the calcula
tion of filter output. The second step is the calculation of input signal power, correlations
between the signals, normalization, and the calculation of error signals. The third step
is the update of the weight coefficients of the adaptive filter. The incoming data rate is
approximately 105 x 106 samples per second [24], This provides us with the latency con
straint of < 9.5 ns. In addition, the base station can be deployed in a power-constrained
environment [25] which makes energy dissipation an important performance metric.
Typically, the rate of weight coefficient update is once every second [24]. How
ever, the rate of update depends on the stability of the environment and the application
requirement. For example, an environment with high interference and path-loss might
require a higher rate of update. Therefore, in addition to device selection, we also use the
framework to identify the maximum rate of update that can be supported without affect
ing latency. The set of devices considered includes two Xilinx Virtex-II Pro devices
(xc2vp2 and xc2vp20), one Xilinx Virtex-II (xc2vl500), Intel PXA 255, PowerPC 405,
and TI C6711 DSP. Most of these devices are also used by the system design problem
discussed in Section 5.3.1. Thus, we were able to reuse the models of the devices. Input
specifications also include design constraints to ensure that a target hardware can con
sist of only an FPGA or an FPGA with a DSP or a traditional processor. The size of the
design space was 243.
self.children("filter").implementedBy()=self.children(” filter").children("dmean_ON_PXA") implies
(not((self.children("coeff").implementedBy()=
self.children(“ coeffl ').children(“ coeff_ON_PPC“ )) or (self.children("coeff").implementedBy() =
self.children("coeffl,).children("coeff_ON_TIC641 '))) and not(self.children("update").implementedBy()
= self.children("update,l).children("update_ON_PPC1 1 )) or (self.children("update").implementedBy()=
self.children("update")children(''update_ON_TIC64")))
Figure 5.7: A sample constraint for DESERT
79
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
For our target design problem, while there are 5 different target devices available,
not all combinations are valid. The 3 valid combinations are only DSP, only FPGA,
and a composition of an FPGA and a traditional processor or DSP. Hence, if a design
is identified that contains both the processors (Intel PXA 255 and PowerPC) it is not a
valid design. We specified a set of constraints using OCL (object constraint language)
to ensure that only valid designs are chosen [41]. A sample constraint is shown in
Figure 5.7.
HiPerE was configured to power down components when idle if it reduces the overall
energy dissipation. Table 5.2 shows the designs selected by our framework. Scenario
1 refers to coefficient update rate of once every 105 x 106 samples. In Scenario 2, the
coefficient update rate is the maximum that can be sustained by a design. We evaluated
all the designs based on their performances when the system executes for a period of 1
second. For each design, Table 5.2 shows the energy dissipation per scenario. The last
column shows the maximum update rate that can be supported.
An additional advantage of using our framework is the savings in design time due
to reduction in the simulation time. Cycle-accurate simulation of the processors and
RT-level simulation of the FPGAs are time consuming. For example, simulation of the
task filter using the DSP simulator takes approximately 15 minutes and using Model-
Sim and XPower takes about 30 minutes. Therefore, simulating all the 243 designs
would take 100+ hours. Using our approach, we could identify the results in minutes.
There are some overheads associated with computing the performance estimations of
different mappings while modeling. However, in the worst case, the overhead will be 5
simulations (one simulation of the three stage application per device) as opposed to 243
simulations.
80
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Table 5.2: Candidate designs and max update rates supported
No. designs total energy dissipated (mJ) max rate
supported Scenario 1 Scenario 2
1 two xc2vp2 921.6 1113.1 8
2 xc2vp2 + PowerPC 816.6 755.6 12x10s
3 xc2vp2 + PXA 255 682.6 839.6 11x103
4 xc2vp2 + TI DSP 490.6 1960 1.5x10s
5 xc2vl500 792.3 852.2 8
6 xc2vp20 968.6 1019.6 8
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Chapter 6
Tool and Design Flow
We have developed a design framework, Model-based Integrated Simulation (MILAN),
to implement the hierarchical design space exploration technique and demonstrate the
usage of the technique through the design of low power high performance signal pro
cessing applications using heterogeneous embedded systems. We used the Generic
Modeling Environment [29] a tool-suite supporting MIC (model integrated computing)
to develop our design framework. In this chapter we discuss the design framework in
detail.
6.1 Model Integrated Computing (MIC)
The key idea of the MIC approach is the extension of the scope and usage of models
such that they form the “backbone” of a model-integrated system development pro
cess [54, 88]. Using MIC technology the designer captures the information relevant to
the system being designed in the form of high-level models. The high-level models can
explicitly represent the target application, target hardware, and dependencies and con
straints among the different components of the models. Such models act as a repository
of information that is needed for analyzing the system. Several tools exist that analyze
different performance characteristics such as latency and energy of a system. Therefore,
MIC allows the use of model interpreters to translate the information in the models to
the input languages of analysis tools. Figure 6.1 describes the MIC concept [54],
82
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
M etaprogram ming
Interface
Application
Domain
Application
Evolution
Environment
Evolution
M odeling
E nvironm ent
M eta-L evel
T ra n s la tio n
M odel
In te rp re ta tio n
Figure 6.1: Model Integrated Computing (Sztipanivits and Karsai, 1999)
Model integrated computing has the following core features [54]:
• models capture the information relevant to the system to be designed explicitly
representing the designer’s understanding of the entire system
• modeling allows the explicit representation of dependencies and constraints
among the different model components
• tool-specific model interpreters translate the information in the models to the for
mat required by the tools
The above three features allow us to realize the hierarchical design space exploration
methodology for the domain of signal processing application design using heteroge
neous embedded systems. The models capture the application and candidate hardware
specification, performance and design constraints, and define the design space. Model
interpreters integrate relevant tools to enable hierarchical design space exploration. In
83
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
this chapter, we discuss the various models required to capture the relevant information
for our target domain. We also discuss how tool integration capability is utilized to
realize hierarchical design space exploration.
6.2 Generic Modeling Environment
Generic Modeling Environment, GME (GME 4), a graphical tool-suite, that enables
development of a modeling language for a domain, provides graphical interface to model
specific problem instances for the domain, and facilitates integration of tools that can be
driven through the models [29]. GME is a freely available tool-suite developed at Insti
tute For Software Integrated Systems, Vanderbilt. A metamodel (modeling paradigm) is
a formal description of model construction semantics. Once the metamodel is specified
by the user, it can be used to configure GME itself to present a modeling environment
specific to the problem domain. MIC enables design reuse through the models. Various
design problems within a domain share resources. Model interpreters are software com
ponents that translate the information captured in the models to drive integrated tools
that estimate the performance (latency, energy, throughput, etc.) of a system. Model
interpreters can also be configured to automatically translate the models into executable
specifications. Feedback interpreters are software components that analyze the output
generated by the integrated tools and update the models. These interpreters are based
on the model construction semantics and thus are suitable for any model based on a
given modeling paradigm. Therefore, these interpreters are essentially automation tools
that, once written, are used for several system design problems. GME supports a set of
well-defined C++ APIs that provide bidirectional access to the models. Both model and
feedback interpreters are developed using these APIs.
84
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
6.3 Defining the Metamodels
In the following, we discuss several metamodels used by our design framework to pro
vide the modeling support.
6.3.1 Resource Modeling
The MILAN resource models define the hardware platforms available for application
implementation. The primary motivation of the resource model is to model various
architecture capabilities that can be exploited to perform design space exploration and to
be able to drive a set of widely used energy and latency simulators from a single model.
The resource model along with the application model captures the various mapping
possibilities of the target system being modeled in MILAN.
The target hardware platforms are modeled in terms of hardware components and
the physical connections among them. For reconfigurable hardware, the resource model
captures the valid configurations possible with that hardware. Similarly, for processors
supporting dynamic voltage scaling, the resource model supports specification of the
various operating voltages and voltage transition cost in terms of latency and energy
dissipation. These values can be obtained through simulation or from data sheets pro
vided by the device manufacturers. Several state-of-the-art memory components such
as MICRON Mobile SRAM support several power saving features such as low-power
states and variable self refresh rate [52]. The designer models the hardware as a set of
connected components. The building blocks provided in the MILAN resource modeling
paradigm include processing elements (RISC cores, DSPs, FPGAs), memory elements,
I/O elements, interconnects, among others. The physical interconnections between the
components are modeled through ports. The resource model imposes structural and
85
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
compositional constraints on the hardware layout to ensure validity of the model. A part
of the resource metamodel is shown in Figure 6.2.
g g .F t e E dit W indow Help
^ T Name: jH e teio g e n eo u sR es jP aradigm Sheet A spect,!C lass Diagram Base: ]N /A
0*
src
dst
Element
« M o d e l »
PhyConnection
« C o n n e c tio n »
R esourceM odel
« F o l d e r »
S ta te s
« M o d e lP r o x y »
PortN um ber :field
Port
« A t o m »
R e so u rc e N u m b e r: field
’ MetaGME - [H elerogeneousR esource - /MILAN/Resource/]
C om pS ize :
W e ig h t:
Price :
C o m p N a m e : field
C om pO perF req : field
field
field
field
Component
« M o d e l»
Figure 6.2: Resource metamodel
The MILAN resource model is motivated by two related aspects of embedded system
design; available target devices and widely used simulators for those devices. Various
classes of target devices that are supported in a comprehensive manner by the resource
model are the general purpose processors and memories. MILAN also provides a pre
liminary support for reconfigurable devices, interconnect, DSPs, and ASICs. Various
simulators/estimators that are supported are SimpleScalar, SimplePower, PowerAna-
lyzer, and High-level Performance Estimator (described in Chapter 7). In the following,
we describe the resource model in detail and provide guidelines for using resource model
86
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
to model the target hardware and drive the simulators. The accompanying tutorials pro
vide a more detailed discussion regarding the use of resource models.
Resource metamodel encompasses the composition rules that governs modeling of
the resources and configures GME for modeling the target hardware. There are sev
eral aspects of resource modeling, namely compositional, behavioral, and parameters.
Aspect in this context is different than the visualization aspects used in GME and refer
to analytical decomposition of resource modeling.
Structural Modeling of Resources
Structural modeling refers to how a target device is composed of different compo
nents. A component might be a processor, memory, or interconnect. Structural modeling
is a high-level specification of the target device.
: MRMr,Ml: -pf(?terogeneoiJsR B»onrr;n.At«AH ;R efio«rce/]. . . .
ISAProc
« M o d e l »
Configurable
« M o d e l »
C ache
« M o d e ! »
Depth : field
W idth: field
C o lA c c e s s : field
• field
A ssociativity: field
Memory
« M o d e l »
F unctionalities: field
ASIC
« M o d e l »
Processing__F requency: field
Processing
« M o d e l »
A ccessE n erg y ifield
B andw idth: field
A ccessL aten cy field
Interconnect
« M o d e l »
C a p a c ity :
D ataR E n erg y : field
D ataR Lat: field
D ataW L at: field
N o P a rR e a d : field
N o P arW rite: field
D ataW E nergy: field
Storage
« M o d e l »
field
Figure 6.3: Resource metamodel (additional details)
The model Component is an abstract class with two derived sub-classes Element and
Unit. The inclusion of Component within a Unit allows hierarchical specification of a
87
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
system. Such a modeling specification allows the designer to visualize a target system
as a Unit composed of various sub-Units. For example, the Xilinx Virtex-II Pro [96] can
be analyzed as a Unit that consists of two Units; FPGA and PowerPC. However, it is a
designers choice how to model a target device. As resource model is primarily used to
specify mapping options for the application tasks and to drive the simulators, based on
the application characteristic the same Xilinx Virtex-II Pro can also be visualized as a
single Unit with no sub-Units. Such a scenario might arise if the target application is
analyzed such that each task is mapped to the complete device without any details of how
the interaction between FPGA and the processor is modeled. A typical instance of such
a scenario is the use of IP libraries provided by the Vendors where the designer uses the
IP-cores as black-boxes and only the over-all performance behavior is exposed during
system design. Another such example is the use of SimpleScalar [83] as a simulator.
Typically, while analyzing a task mapped onto a processor it is not required to provide
details of cache configuration. The task can be modeled based on the performance
estimates only. However, if the task is being specifically analyzed for different cache
configurations, it is necessary to provide details of cache configurations. Even cache
configuration is also necessary if SimpleScalar is configured to simulate a particular
processor. Therefore, the designer should have the flexibility of modeling the hardware
at the required granularity.
The connectivity between the resources are described using Ports. A Port is part of
an Element. Therefore any Element can be connected to any other Element. However,
the resource model enforces the rule that all connections need to be through Intercon
nect. This is specified using OCL constraints. OCL refers to Object Constraint Lan
guage [29]. The idea of such a constraint is to ensure an order in how different Elements
can be connected and also to provide a place to capture the performance behavior of the
interconnect resources within the target devices. Element is further classified as Storage,
88
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Interconnect, Processing, IOSpec, and ClockTree. As the name suggests, theses models
capture the key components of the target devices. Storage is further classified as Cache,
Memory, and BranchTargetBuffer. Processing is further classified as ISAProc, Con
figurable, and ASIC, namely three primary classes of processing elements (Figure 6.3.
Such a classification of the target devices is by no means complete and is still evolv
ing. The ability to evolve based is one of the key aspects of MIC (Model Integrated
Computing) and is fully supported by GME.
Modeling of Operating States
'MetaGME - MILAN - fStatelnfofm ation - /MILAN/Resource/]
'j g File Edit Viera Window Help
/ A ial | i r ! ti I C o m p o n e n ts X £ f
T N e n e iStatelnfoim alion [Paiadigm Sheet A spect [c la ss Diagram - j B ase [N/A 2 o o c i:p 5 G ^ V j
DefaultState
« A t o m »
S tates
« M o d e l »
DefaultStateConn
«C onnectiosi& >
S tateN am e :
S tEnergyU nit: enum
S ta te ld le E n e rg y : field
field
S tate
« A t o m »
StLatencyU nit: enum
S tateTranE nergy :field
S tateT ran T im e: field
StE nergyU nit: enum
StateTransition
« C o n n e c tio n »
1 1
R e a d y ZOOM 150% MetaGME 09 :0 6 AM: ,\-£
Figure 6.4: State transition metamodel
As energy modeling is one of the major focus of the MILAN environment, the
resource modeling provides a number of specific support to model various energy min
imization capabilities provided by the state-of-the-art devices. Some such capabilities
are availability of different operating states and the facility of dynamic voltage scaling
89
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
that provide a trade-off between speed and energy dissipation. In addition, dynamic
reconfiguration of configurable devices is also emerging as a key technique to achieve
high performance. Therefore, we have added modeling support to capture various oper
ating states and state-transition costs associated with different target devices. Figure 6.4
shows the metamodel to capture operating states.
Essentially, we capture the information that there are several possible states associ
ated with a device and there is a certain performance cost (time and energy dissipation)
associated with each possible transition between the states. For example, Intel PXA
255 supports three operating voltages (possibly more) 99.5,199.1, and 298.6 MHz [35].
A different value of quiescent energy (when processor is idle) is associated with each
of these frequencies. This information is captured through StateldleEnergy parameter
associated with State atom. Similarly, transition costs are captured through StateTran-
Time and StateTranEnergy (Figure 6.4). For reconfigurable devices, various possible
configurations and reconfiguration cost are also modeled using the above metamodel.
The association of state transition modeling to the main resource metamodel is specified
using a model proxy States in the mapping metamodel.
6.3.2 Modeling FPGA
The modeling and performance estimation support for FPGA provided in MILAN is
based on Domain Specific Modeling [20], The focus is on FPGA based designs for
typical signal processing algorithms that contain loops and are data oblivious. Matrix
multiply, motion estimation, etc. are some such examples. There are numerous ways
to map an algorithm onto an FPGA as opposed to mapping onto a traditional processor
such as a RISC processor or a DSP, for which the architecture and the components such
as ALU, data path, memory, etc. are well defined. For FPGAs, the basic element is
the lookup table (LUT), which is too low-level an entity to be considered for high-level
90
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
modeling. Therefore we use domain specific modeling to facilitate high-level modeling
of FPGAs.
^ Various Kernels
(FFT, DCT. Matrix multiplication, <
Matrix factorization. CFAR d etecto rs,..)
f Kernel j
Various Architecture Families
Domain 1 Domain 2
Q 0 D
Domain
Specific
Modeling
Domain
Specific
Modeling
Domain
Specific
Modeling
System-wide
Energy
Function
System-wide
Energy
Function
System-wide
Energy
Function
Domain n
Figure 6.5: Overview of domain specific modeling approach
Domain-specific modeling technique facilitates high-level energy modeling for a
specific domain (see Figure 6.5. A domain corresponds to a family of architectures
and algorithms that implements a given kernel. For example, a set of algorithms imple
menting matrix multiplication on a linear array is a domain. Detailed knowledge of the
domain is exploited to identify the architecture parameters for the analysis of the energy
dissipation of the resulting designs in the domain. By restricting our modeling to a spe
cific domain, we reduce the number of architecture parameters and their ranges, thereby
significantly reducing the design space. A limited number of architecture parameters
also facilitate development of power functions that estimate the power dissipated by
91
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
each component (a building block of a design). For a specific design, the component
specific power functions, parameter values associated with the design, and the cycle
specific power state of each component are combined to specify a system-wide energy
function. An overview of domain specific modeling is shown in Figure 6.5. More details
on domain specific modeling can be found in [20].
We provide a hierarchical modeling support to model the datapath. The hierarchy
consists of three types of components; micro, macro, and basic blocks. A basic block
is target FPGA specific. For example, the basic blocks specific to Xilinx Virtex II Pro
are LUT, embedded memory cell, I/O Pad, embedded multiplier, and interconnects. In
contrast, for Actel ProASIC 500 series of devices, there will be no embedded multiplier.
Micro blocks are basic architecture components such as adders, counters, multiplexers,
etc. designed using the basic blocks. In principle, there is no difference between a basic
block and a micro block. The classification is introduced to enable logical creation of
a basic library per device. A macro block is an architecture component that is used
by some instance of the target class of architectures associated with the domain. For
example, if linear array of processing elements (PE) is our target architecture, a PE is a
macro block.
Each building block is associated with a set of component specific parameters.
Power states is one such parameter which refers to various operating states of each
building block. For example, we can model two states, O N and O F F for each micro
and basic block. In O N state the component is active and in O F F state it is clock gated.
For macro blocks it is possible to have more than 2 states due to different combination
of states of the constituent micro and basic blocks. Power is specified as a function or
constant value (in the example model, power for different components are specified as
constants).
92
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
In addition each block can be associated with a set of variables. Precision, depth and
width for memory, size of register or memory are some example of variables that can be
associated with a component.
While the datapath is modeled as specified above, the model for control flow is
relatively tricky. Our focus of the modeling and estimation capability is rapid energy,
latency, and area estimation. Area can be estimated based on the model of the data path
(sum of the components areas). In order to model the control flow we make use of CPS
matrices. Component Power State (CPS) matrices capture the power state for all the
components in each cycle. For example, consider a design that contains k different types
of components (C'i, • ■ ■ , Ck) with n, components of type i. If the design has the latency
of T cycles, then k two dimensional matrices are constructed where the i-th matrix is of
size T x n». An entry in a CPS matrix represents the power state of a component during
a specific cycle and is determined by the algorithm (see Figure 6.6.
Number of com ponents
■4----------------------------------------------------- ►
State of a
com ponent
in a cycle
Figure 6.6: Component power state matrices
However, specification of such a matrix is not easy. Hence, we take advantage of the
typical loop oriented structures of kernels such as matrix multiply, FFT, etc. for which
the FPGA based designs are created. If we analyze the CPS matrices, we can observe
that another easy way to specify the same information is through a table. Such table
93
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
would contain a number of rows where each row is a 3-tuple (component, state, number
of cycles in this state). As we are interested only in performance estimation, this much
of information is enough.
6.3.3 Modeling Applications
The application modeling paradigm is based on a dataflow representation. The key
capabilities of this modeling paradigms are:
• hierarchy to handle large data flow graphs
• ability to capture design and implementation alternatives
• support for pseudo tasks to support modeling of state transition between task exe
cutions
• support for association of task implementation in the leaf nodes
A dataflow graph consists of a set of compute nodes and directed links connecting
them representing the flow of data. Each of these nodes can be a leaf, compound, or an
alternative. In addition, the compound node can contain all three nodes which specifies
hierarchy. A node is an alternative node if it contains the alternative implementations
associated with that node. Each node can be associated with input and output port.
These ports are used to connect the nodes to model data flow. Each leaf node represents
the actual implementation and mapping of a task (or task alternative).
In the Mapping aspect of the application models, references to resource models can
be created. These references are used to illustrate that an application component can
be realized on the referenced hardware platform. All Primitives need to have mapping
models created. Configuration models are used to contain simulation information about
94
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
specific mappings of application components to physical resources. These models con
tain references to all the primitives contained in the current application hierarchy and
to all resources that could be used to implement these components. A connection is
made between the application primitives and the resources to illustrate which appli
cation primitives were simulated on which resources. The configuration model itself
captures the latency, throughput, and power characteristics of the simulation through
the use of Configuration Model attributes. It is up to the user to ensure the types of data
stored are consistent. Figure 6.7 shows the mapping metamodel defined in MILAN.
95
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
M e t a G M E ^ M a p p in g - /M IL A N /M a p p in g /]
M Fte Ertr view Window Help
» i J * i d s x
^ T NameifMapping
A
iPaipdNjnSheet ■ AspecljClasiDiegiem e j BaseijNVA
n C o m p o n en ts: ! X
' Zoom: |i s » v j
AsyncComponent
«M odelProxy»
AltSelection: bool
Priority: field
FiringCondition : Field
. AppComponent
« R e fe re n c e »
src O-.'O.:
hwModule
«M odelProxy»
AltSelection : bool
Includes : field
InitScriptName: field
VHDL Includes field
InitFileName:
?
field
SyncComponent
«M odelProxy»
AltSelection: bool
Priority: field
FiringCondition : field
Mapping
« C onnection»
Unit
«M odelProxy»
ResourceNumber: field
r
Resource
« R e fe re n c e »
ResourceSelection :bool
O d s tlo .’
0..'
0.* 0..* 0.*,
State Ref
« R e fe re n c e »
State
«A tom Proxy»
StEnergyUnit: enum
StateldleEnergy: field
StateName: field
r
Configuration
« M o d e l»
A rea: field
ConfigurationSelection bool
EnergyEstimate: field
LatencyEstimate: field
MpDataUnit: enum
MpEnergyUnit: enum
MpLatencyUnit: enum
SizeOflnpData: field
SizeOfOutData: field
ThroughputEstifnate: field
OperState
« C o n n ectio n »
FPGAModelRef
« R e fe re n e e »
FPGADesign
«M odelProxy»
r
< f a t a ® * JJ9:12 A M
Figure 6.7: Mapping metamodel
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
6.4 Integrating and Driving Tools
Simulators are integrated by developing appropriate model interpreters using the APIs
supported by GME. These APIs provide constructs to traverse the models, retrieve infor
mation from the models, and store information back in the models. In order to drive sim
ulators, a designer has to provide the necessary information to the models. For example,
if the designer wants to drive SimpleScalar, there is a long list of information that is
used by SimpleScalar to configure itself to match the target processor its modeling [83].
The MILAN models provide the required place-holders (fields) to input the information
needed by the simulators. All these fields are initialized by the default values as spec
ified by the simulators. If there is a conflict between two simulators we use one of the
values. The designer needs to modify (if need be) the values in the fields depending on
the requirement. There is a model interpreter associated with each of the simulators.
These model interpreters are responsible to drive the simulators. A model interpreter
for a simulator traverses the model and extracts the required information and formats it
based on the requirement of the simulators. Most of the simulators specify a certain for
mat of the configuration file. A model interpreter generates such a configuration file and
optionally invokes the simulator with additional input such as high-level source code
and input (typically obtained from the application models). Model interpreters asso
ciated with each simulator also captures additional information that are not specific to
the device but are required by the simulators. One such information may be “Simula
tor scheduling policy” that is used by SimpleScalar and PowerAnalyzer. Additionally,
there are feedback interpreters that extract the simulation result and store it back in the
models. Model interpreter for the simulators and the associated feedback interpreters
complete the simulation loop.
97
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
6.5 High-level Performance Estimator (HiPerE)
As discussed earlier, one of the major challenges in system-level performance estimation
is lack of standard interface among the component specific simulators which makes it
difficult to integrates the simulators to simulate a heterogeneous embedded systems.
HiPerE addresses this issue by combining component specific performance estimates
through interpretive simulation to derive system-level performance values.
The GenM model describing the target heterogeneous embedded systems is the pri
mary input to HiPerE. In addition, the performance parameters are also provided as an
input. In our methodology, various optimizations may be performed before invoking
HiPerE. In case an optimization is performed, a subset of designs identified by the opti
mization technique are evaluated by HiPerE. A designer can also choose not to perform
any optimization and apply a brute force technique to evaluate each possible design
exploiting the rapid estimation capability of HiPerE.
For performance estimation of a given design, HiPerE needs the mapping (specified
by the design). Mapping identifies the computing element a task is mapped to and
provides the operating voltage (or configuration) if the element is the processor (or the
FPGA). HiPerE uses the mapping information to identify the appropriate component
specific estimates (tij or for latency and or e':j for energy). The designer provides
initial values for all the performance parameters. Later, component specific performance
estimation is used to estimate more accurate values for t^ , , e, v and eC. In addition to
these inputs, the application task graph which captures dependency among tasks is also
provided. The task graph provides the order of execution (using topological sort) for
the tasks. For the memory component, the designer provides a schedule of power states.
Currently, we support change of power state for the memory only at the task boundaries.
98
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
The output of HiPerE is system-level energy and latency estimates. Along with these
estimations, HiPerE also generates an activity report for each component in the target
architecture. An activity report identifies various voltage settings, configurations, and
power states for the processor, FPGA, and the memory component respectively during
the course of execution. It also provides the duration of idle time (if any) between
execution of tasks for the processor and the reconfigurable component.
6.5.1 Component Specific Performance Estimation
Program
Implementing
the Task
Application
Model
Resource
Model
M IL A N
Update Energy
>• and Latency
Estimates
(ty - tV ey - and e’y )
Source
Code
Component
Specific
Estimate
Low-level
Simulator
Figure 6.8: Component specific performance estimation using MILAN
Component specific performance estimation refers to the evaluation of performance
parameters tij, eij, t'?,and e 't] specific to a task in a particular voltage setting or con
figuration. There are several techniques to estimate component specific performance
values such as Complexity Analysis, Graph Interpolation [31 ], Trace Analysis [45, 86],
and Cycle-accurate Simulation [98, 83]. While complexity analysis does not require a
simulator, all the other techniques use a simulator based on an architecture model at an
appropriate level of abstraction.
99
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
We exploit the isolated simulation feature of the MILAN framework to perform
component specific simulation (Figure 6.8). This feature refers to the ability to simulate
a single application task on a specific hardware component. The resulting performance
estimates are used to automatically update the performance parameters. The MILAN
framework features an application model, a resource model, and a mapping model [58].
The resource model consists of all the parameters of the GenM model and various addi
tional parameters that are used to drive simulators. The application model captures the
application details as a data flow graph. Performance parameters, Tt6\n, and 0l out are also
part of the application model. Other performance parameters are part of the mapping
model. The designer also provides implementation of each task, for example, in C or
Java.
Once a task has been selected for isolated simulation, based on the computing ele
ment it is mapped to, MILAN generates an appropriate simulator-configuration file and
a source file (in a high-level language) that implements the task. While modeling the
application, the designer provides source and destination scripts for each task that gen
erate input for the task and consume output from the task. These two scripts are used by
MILAN during the generation of a program that implements the task. For example, if
FFT is a task mapped onto a MIPS processor and SimpleScalar is the chosen simulator,
MILAN generates a C code implementing FFT and a SimpleScalar configuration file.
After the simulation is performed, the performance estimate is provided as a feedback
to MILAN which is used to update the initial performance estimates provided by the
designer.
Component specific performance estimation is used to improve the accuracy of the
initial estimates provided in the GenM model. We assume that when a designer provides
a GenM model for a specific problem the performance estimates (initial values) are also
provided.
100
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Before moving to system-level performance estimation, we derive composite perfor
mance estimate for each task. Composite performance estimate includes all the set-up
cost for task execution including the cost of execution. This estimate includes cost for
execution, data access, memory activation, and reconfiguration or voltage variation. For
example, assume that task T < is mapped onto the FPGA with configuration Cj and C),
be the previous configuration. If we assume that no memory power state transition
occurred, the composite latency performance (rC',) can be evaluated as:
Similar composite estimates (ECi) is derived for energy dissipation of task T,. In
the following subsection, the component specific performance estimate of a task refers
to the composite performance estimate for that task.
6.5.2 System-Level Performance Estimation
rc'i = 4 + (C + O - v + &jk
(6.1)
( 3 ) Processor
^ FPGA
Figure 6.9: Sample task graph with mapping
101
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
We employ the following technique to evaluate system-level energy and latency and
to generate the activity report. Figure 6.9 shows a sample task graph with tasks mapped
on to either the processor or the FPGA. Our technique is as follows:
Let MPi denote the mapping (given as input) for task Tt, where MPj = 1 when
Ti is mapped on the processor, and M Pi = 2 when Tt is mapped on the FPGA. How
ever, MPi is provided as an input to system-level performance estimation. Let a denote
the list of tasks, {Tni, T7 T 2 , • • • , Tnn} obtained by a topological sort of the original task
graph. Let A \ and A 2 denote the earliest available time for executing a task on the
processor and FPGA, respectively. Initially A \ = A 2 = 0. Let denote the com
pletion time of task Tni. In the following algorithm, the earliest start time is calculated
for each task without violating the dependency information provided in the application
task graph. It is also assumed that a non-preemptive scheduling policy is used by each
computing element. The pseudo-code for the algorithm is provided below.
for k < — 1 to n do
Let (5 be the set of immediate predecessors of Tn k
The earliest start time for TW k is
r = rnax{TnaXTiep{ti},AMP1 ,k}
Set u k = T + rev, and A Mp^k = Uk
As a result, rnax{Ai, A 2} is the system-level latency. The idle time of the processor
is calculated as IT i = A i — YuMp,=i re',. Therefore, the idle energy dissipation of the
processor is IE i — IT 2 ■ efd. Similarly, the idle time and energy dissipation of the FPGA,
IT 2 and IE 2, are also calculated. Total energy dissipation is the sum of individual
component specific energy dissipation and energy dissipation when the component is in
the idle state. Thus total energy is evaluated as:
102
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
] T ECi + IE! + IE 2 (6.2)
i = 1
Clearly, the complexity of the above method is 0(iV + Q), where N and Q are the
numbers of nodes and edges in the task graph, respectively.
The activity report is generated based on the processed task graph with the map
ping information and the time of completion for each task. The designer can exploit
the activity report to identify bottlenecks and optimization opportunities. One possible
optimization is to take advantage of the idle time and use a lower DVS (dynamic voltage
scaling) setting to execute a task slowly in order to save energy.
HiPerE is implemented using Java and can be run on both Unix and Windows plat
form. HiPerE is also integrated into the MILAN framework. Therefore, it is possible to
automatically generate input for HiPerE and execute it to obtain the performance esti
mates. We are currently developing the feedback mechanism to automatically store the
HiPerE output in the MILAN framework.
6.5.3 Activity Report
The activity report is generated based on the processed task graph with the mapping
information and the time of completion for each task. The designer can exploit the
activity report to identify bottlenecks and optimization opportunities. One possible opti
mization is to take advantage of the idle time and use a lower DVS setting to execute a
task slowly in order to save energy. Due to space constraints the first two tables are trun
cated. User can generate the complete activity report by invoking HiPerE for the given
model. There are two sets of tables in the activity report. The first set of tables capture
the details of task execution for each processing element. Each table has one row for
103
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
each task executed on the processor. The tasks are ordered based on their dependency.
Each row provides the name of the task, the operating state of the device while executing
the task, total time consumed and energy dissipated, time and energy for state transition
(if any), time and energy for the idle period (if any) before execution of the task, time
and energy for just task execution, and finally, the start time and end time for the task.
These tables summarize the activity on a device.
The second set of tables provides a list of idle periods, length of idle period, and
start and end time of the idle period. This information can be used to identify optimiza
tion possibility that take advantage of the idle time available to reduce energy without
affecting over-all latency. HiPerE is implemented using Java. HiPerE is also integrated
into the MILAN framework. Therefore, it is possible to automatically generate input for
HiPerE and execute it to obtain the performance estimates.
6.5.4 Performance Estimation based on Duty-Cycle
HiPerE also supports performance estimation based on duty cycle specification. Duty-
cycle in the context of application execution refers to the proportion of time during
which a component, device, or system is operated. Support for duty-cycle includes
being able to estimate performance for a length of time or number of execution instances
while taking into account, start up and shut down cost, idle energy dissipation, and rate
of input.
In addition, a duty-cycle aware estimator needs to support applications with multi
rate execution. An application modeled as a set of tasks is said to be multi-rate if differ
ent tasks have different rate of execution. A multi-rate application needs to adapt based
on the input or environment condition. Hence, we have enhanced HiPerE to estimate
performance of different execution instances based on rate of execution of individual
tasks.
104
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
6.5.5 Design Browser
File Action Help
comp cvm velFilter
ComputeCV...|velfllte^OTl!
ComputeCV...|velFilter_ON..
ComputeCv...sveiFmer ON..
OomputeCv...iveiFmer UN.
OomputeCv...iveiFiner ON.
ComputeCV... velFilter_ON.
irvoroc
Computelm/..
Computelnv..
iComputelnv..
iComputelnv..
iComputelnv..
I whiten
IwmBn "dfT
whiten_ON_
|whiten_ON_
whiten_ON_
whiten_ON_
Em
iComputelnv.. whiten ON
dmean
idmean ONE
. dmean ON... 538.224426..
1 %r.
idmean_ON..
idmean ON..
dmean ON..
idmean ON..
615.898864..
593.327514..
124.741706..
123.967704..
Laten
512T0TD___
496.505004...
548.890014...
533.294006...
247.257995...
247.225006...
ComputeCV.. velFilter_ON... iComputelnv.. whiten ON idmean ON.. 114.790702.. 247.061004...
ComputeCV.. velFilter_ON... iComputelnv.. whiten ON idmean ON.. 114.016700.. 247.027999..
A c tiv ity re p o rt fo r d e s ig n n u m b e r 0
A c tiv ity re p o rt fo r d e s ig n n u m b e r 1
A c tiv ity re p o rt fo r d e s ig n n u m b e r 2
A c tiv ity re p o rt fo r d e s ig n n u m b e r 3
A c tiv ity re p o rt f o r d e s ig n n u m b e r 4
A c tiv ity re p o rt f o r d e s ig n n u m b e r 5
A c tiv ity re p o rt fo r d e s ig n n u m b e r 6
A c tiv ity re p o rt fo r d e s ig n n u m b e r 7
Figure 6.10: Design browser
The MILAN design browser is a graphical front-end to HiPerE. The input to the
browser is the set of designs identified by the heuristic based design space exploration
tools such as DESERT. Figure 6.10 shows a snapshot of the design browser. Among the
features supported are display of mapping information of the designs identified by the
pruning heuristics, invocation of HiPerE on one or more designs, duty-cycle parameter
specification, and visual comparison of the designs based on the estimates of latency
and energy dissipation.
Using the design browser, the designer can perform trade-off analysis using the esti
mation capabilities of HiPerE. Designer can also evaluate the performance impact of
allowing the processing components to idle or shutting the components down when
not used. HiPerE also produces an activity report for the entire duration of simulation
105
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
for a duty-cycle based scenario which can be viewed and analyzed through the design
browser.
6.6 Dynamic Programming based N-Optimization
Heuristic
We have enhanced MILAN to integrate an N-optimization heuristic that identifies N-
optimal solutions for a class of applications that can be modeled as a linear array of
tasks. A linear array of tasks consists of an ordered set of tasks with each task having
at most one input or output. There is only one source task (one output and no input)
and one sink task (one input and no output). The application is to be mapped onto a
target device that supports multiple operating states. An operating state can be a config
uration in case of reconfigurable devices or an operating voltage (frequency) in case of
processors supporting dynamic voltage (frequency) scaling. Each task can be executed
in different operating states. Hence, each task is associated with a unique performance
estimate (energy dissipation and latency) for each operating state. Transitions between
the operating states are associated with transition costs (energy dissipation and latency)
which depend on the source and destination states. In the above scenario, a design is
referred to as a set of operating states; one operating state per task. Performance of a
design is the sum of the execution cost of each task in the corresponding operating state
and the state transition cost between task executions (Figure 6.11). The N-optimization
problem can be formally defined as, given a linear array of tasks, a device supporting
multiple operating states, and performance estimates for task executions and state tran
sitions, find a set of designs with N lowest values for latency or energy dissipation. The
proof of complexity and correctness are provided below.
106
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
^ \
^ state
^ y
^ ' s
J state
^ y
© s i r ^ \
► < 8 !2 W Task 2 I ►
* 2
><;
^ N
J state 1
'v y
( C w
“ m
mapping
-►^TaskT^
^ N
^ state 1
'v y
> r
Figure 6.11: Linear array of tasks and state transition
Let the set of n tasks be T[T2.. ,T'r and the set of m possible operating states
be S 1 S 2 .. ■ Sm. Our focus is to minimize energy dissipation while solving the N-
optimization problem. Minimization of latency can also be analyzed in the same fashion.
We use dynamic programming to identify N-optimal solutions. In the solution space of
the optimization problem considered, a sequence of configurations for a given sequence
of tasks may occur as part of a solution to multiple larger sequences. Dynamic program
ming is utilized to compute the optimal solution for the complete sequence of tasks by
using solutions to smaller subsequences. Once N-optimal solutions for executing up to
task Ti is determined, the energy dissipation for executing up to task Ti+ 1 can be deter
mined. This approach is used recursively to compute the N-optimal solution. We define
ram/vO as a function that given a set of values identifies N smallest values, m in ^Q is
a selection problem and can be solved in 0 (n ) time if the number of inputs is n [23].
Theorem: Given a sequence of tasks T[T2... T'r, and m possible states S \S 2 ... Sm,
N-optimal solutions (sequence of states for executing these tasks) with minimum energy
dissipation can be computed in 0 (rm 2N ) time.
Proof: We use the dynamic programming approach to compute the N-optimal solu
tions. We define the energy dissipation values for N-optimal solutions for execution up
to task T- ending in state Sj as El3k where k = 1... N . Let Ei2 be a set of N values
E iji... E ijN. We initialize the E values as Eojk, Vj : 1 < j < m and V fc : 1 < k < N .
107
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
We assume that N-optimal solutions for executing up to task T[ ending in all possible
states, Si, are computed. Now for each of the possible states (Sj 6 S) in which we can
execute T'+1, we have to compute N-optimal solutions for sequence of states ending in
that state Sj. This is computed as follows.
Ei+ij m ifiN ({V fc/1 Vki — &i+ij T S/iki qkj J -
V & : 1 < j < m, V Z : 1 < I < N )V j : 1 < j < m
Thus we examine all possible ways to execute Ti+ 1 once we have finished executing
T,. If each set of values Eik is N-optimal then the values Ei+ij is also N-optimal.
Computation of each N-optimal set of values takes 0 (m N ). Since there are 0 (r m )
sets of values to be computed, the total time complexity is 0 (r m 2N).
Proof o f correctness: Let us assume that there exists a state sequence II executing up
to task T'+ 1 ending in state Sj (refer to the proof above) for which the energy dissipation
is less than one of the N-optimal solution computed for execution up to task T/+1 ending
in state Sj. Let Sk be the previous state in the state sequence II during the execution of
task T[ and II — Sj is the sequence of states for execution up to task T/. If this sequence
of states (solution) is part of the N-optimal solutions E ik, then there is a contradiction
as we must have evaluated state sequence II while searching for a solution ending in Sj
executing up to task T/+1. Conversely, if the sequence II — Sj is not part of the N-optimal
solution, then energy dissipation for II cannot be less than E iki + qkj + i (+lj-V Z : 1 < I <
N as Eik is the N-optimal solution for task execution up to T, ending in state Sk. Hence,
the optimal solutions selected are the N-optimal solutions.
MILAN already supports modeling of a linear array of tasks and a reconfigurable
device with a set of configurations and reconfiguration costs. Therefore, an implemen
tation of the above heuristic can be integrated into MILAN.
108
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Error rate vs. size of N
80 -
70 i
60 - t
I ,
50 - j
40
30 i
2 0 -j
10 -[
o! I
app-4 app-5 app-6 app-7
DSE case studies
■ 10
■ 15
□ 20
□ 25
Figure 6.12: Effect of error rate on N
While using the N-optimization heuristic, the designer specifies the value for N . As
discussed earlier, by selecting a set of designs using the pruning heuristics, we overcome
the effect of error due to high-level modeling. Figure 6.12 summarizes an experiment
we conducted to study the the effect of error-rate on the value of N . We considered
the set of applications used in Section 4.4.2. In this experiment we identified the lowest
value of N (in multiples of 10) for which the real-optimal design was included in the
set. The x-axis indicates different applications (number of tasks 4, 5, 6, and 7) and y-
axis indicates value of N . Each of the bar graph shows the minimum value of N . The
value was determined by averaging over 3 experiments per application. The four bars
per application indicate the value of N for four different error-rates. As expected, the
size of N increases as the rate of error increases. In addition, the value of N affects the
number of designs that need to be evaluated by high-level estimation. Therefore, the
designer should also take estimation and simulation time into account while deciding
the value of N .
109
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
On the other hand, DESERT does not support specification of N while performing
DSE. The number of designs identified depends on the constraints specified while mod
eling. However, the designer can tighten or loosen the constraints to vary the number of
designs selected by DESERT [41].
6.7 Design Space Exploration for FPGA based Designs
Design space exploration refers to evaluation of the specified designs to identify candi
dates that meet the specified design and performance constraints.
Enumerating Kernel Designs
Given a kernel, several domains can be identified and within each domain several
designs can be identified [20]. All possible designs within a domain are analyzed as
follows:
• for a given input size, the designs are analyzed to evaluate energy dissipated by
each type of component as a fraction of total energy dissipation. This analysis
helps identification of candidates for energy optimization. Similar analysis is also
performed for area
• for a given input size, a single design parameter is varied while keeping the others
constant and different performance metrics are plotted as a graph to observe the
effect of each parameter on different performance metrics. Depending on the
number of parameters several such graphs are possible
• if there exists alternatives for a building block (e.g. embedded multiplier and
configured multiplier in Virtex-II Pro) each option is considered to study the trade
off among the performance metrics
110
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
The design framework automatically generates a set of designs and estimates latency,
energy, and area associated with each design. We discuss a uniprocessor (PE) imple
menting the “usual” block matrix multiplication as an example domain to demonstrate
the exploration of kernel designs.
This domain uses a single multiplier and results in area and energy efficient designs.
We consider an off-chip design where all matrix data are stored in an external memory
outside the FPGA. In this design, the PE has one MAC (multiplier and accumulator), a
cache (local buffer) of size c, and I/O ports. The data matrices are stored in an external
memory. Block matrix multiplication is performed with block size \/c x y/c. We identi
fied four components: MAC, cache, and the memory banks as RModules, the I/O as an
Interconnect. The RModules have w bit precision. Therefore, the cache size (c) and pre
cision ( v j ) are the parameter that can be varied at design time. To implement the MAC
in Virtex-II, there are two design choices: a CLB-based multiplier and a dedicated mul
tiplier. A dedicated multiplier is a stand-alone ASIC-based multiplier. A CLB-based
multiplier is built using CLBs and it was observed that it dissipates more power than
a dedicated multiplier. Similarly, there are two design choices for implementing the
cache using CLBs. If the cache size is small, the cache can be realized using CLBs
configured as register modules. Larger cache can be realized using CLBs configured as
SRAM [96]. However, a SRAM-based cache can only be configured to be a multiple of
16 bytes. Therefore, the model discussed above provides several design choices based
on the choice of multiplier, cache, cache size, and precision.
Exploring Application Design
Exploring application design involves the use of DESERT and HiPerE to evaluate
the design space and identify the designs that meet the given performance constraints.
The application design space is defined by the choice of implementations associated
with the application tasks. Therefore, an application with n tasks T\ ... Tn and each
111
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
task Ti with Ai implementation alternatives will have a design space of size n"=i
Therefore, the initial design space can be large. Our experience with DESERT shows
that we can prune a design space with approximately 102 0 ~ 104 0 designs in order
of minutes [58]. DESERT uses symbolic methods based on Ordered Binary Decision
Diagrams (OBDDs) for constraint satisfaction [41]. Therefore, using DESERT we can
quickly prune a large design space to identify designs that meet the specified design and
performance constraints. However, DESERT is not an optimization tool. It selects a
set of designs. Therefore, we use HiPerE to evaluate the designs selected by DESERT.
HiPerE evaluates system level energy dissipation and latency. In order to provide a rapid
estimate, HiPerE operates at the task level abstraction of the application. In addition to
the task execution cost, various other aspects considered by HiPerE for accurate per
formance estimation are data access cost, parallelism in the system, energy dissipation
when a component is idle, and state transition cost. Our results for signal processing
applications show that HiPerE estimates are within 8% of the estimates using low-level
simulations [55, 58].
6.8 Design Flow
As shown in Figure 6.13, design flow using our framework consists of 6 steps. The first
three steps deal with modeling the kernel designs based on domain specific modeling
and identifying design choices. The last 3 steps perform application level modeling and
design space exploration. In the following, we discuss each step in detail.
Modeling Kernel Design (1): In this step, the designer analyzes the kernels to define
domain specific models. The designer identifies the micro, macro, and library blocks and
the associated component specific parameters. The model of the data path is graphically
constructed in this step using GME. The designer can also specify high-level scripts
112
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
for the building blocks to be used in the next step. In addition, CPS matrices for the
algorithm are also specified.
Parameter Estimation (2): Estimation of the cost functions for power and area involves
synthesis of a building block, low-level simulations, and in case of power, the use of
confidence intervals to generate statistically significant power estimates [20]. The simu
lations are performed off-line or, if required simulator is integrated, automatically using
specified high-level scripts. Instead, if a library of models is available, the stored per
formance estimates are used directly. Latency functions are estimated using the CPS
matrices. System-wide energy is estimated using the latency function and component
specific power functions.
Enumeration and Tradeoff Analysis (3): In this step, the designer chooses the candidate
kernel designs that would be evaluated while designing applications. Given a domain
specific model of a kernel, a set of designs are identified based on the parameter values
and binding choices. The framework also generates comparison graphs to compare the
performance of the designs.
Hierarchical Data flow Modeling (4): Once, we have identified implementation choices
for each kernel, we construct the application model as a hierarchical data flow with
alternatives. Compound, alternative, and leaf nodes are used to specify the applica
tion model. The leaf nodes are also associated with FPGAs on which the kernel will
be implemented. In addition, each leaf node is associated with performance estimates
obtained using the high-level performance estimator.
Modeling Reconfiguration (5): Based on the mapping and area estimates of the task
implementations, pseudo nodes are introduced to model reconfiguration. This step is
automatic within our framework. The application model is analyzed using topologi
cal sort and for each consecutive tasks (source and destination) executing on a single
113
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
FPGA, the application model introduces a pseudo task. Each pseudo task is automati
cally associated with a set of alternatives and design constraints are introduced to ensure
that correct reconfiguration is chosen based on the choices selected for the source and
destination tasks.
Hierarchical DSE (6): This step uses DESERT and EfiPerE to explore the design space
using the application model. DESERT applies all the performance and design con
straints and selects a set of designs that meet the constraints. HiPerE evaluates the
selected designs based on their performance estimates and allows the designer to choose
the final design based on the given performance requirements. In the following, we dis
cuss this step in detail.
|6
Applications
e.g. beamforming,
SDR, target tracking
Hierarchical
Dataflow
Modeling
Model Re- |
5
Hierarchical DSE i
configuration I
| DESERT | | HiPerE | H
Candidate^
Application
Designs
f design choices, reconfiguration
costs, performance estimates application level
/ N - ____ c— — kernel level
/''Candidate KerneTN
\ S S i^ Designs
Algorithm &
Architecture
domain
Algorithm &
Architecture
Enumartion
& Tradeoff
Analysis
Kernels
e.g. matrix
multiplication, FFT
Model Kernel
Designs
Parameter
Estimation
estimate model
parameters
iBi
Algorithm &
Architecture
Figure 6.13: Design flow using our framework
6.9 Design Reuse and Extensibility
The use of domain specific modeling and the model-integrated computing approach
allows us to create and reuse models across designs. The micro, macro, and library
building blocks used to model a kernel design are stored using the model library sup
ported by GME. Each of these models is associated with performance estimates (avg.
114
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
power and area) for a target FPGA. Our design framework supports specification of a
set of performance estimates where each estimate corresponds to a unique target FPGA.
Thus when we model a design that use the building blocks for which the models exist
along with the performance estimates, the models are used. The designer does not need
to perform time-consuming low-level simulations. Such reuse is possible as
• the micro and macro blocks are selected such that they are basic design compo
nents supported by the target FPGA independent of the functionality of the kernels
being designed
• it is common to use the same class of FPGA devices to design several different
applications
On the other hand, if the same application needs to be designed using a different
FPGA, the designer can reuse the complete application and kernel model. Only the
building blocks need to be reevaluated using low-level simulators to estimate their per
formance for the new target FPGA. While the simulation is relatively time consuming,
our design framework can completely automate the simulation and performance estima
tion if suitable low-level simulator for the new target FPGA is integrated.
Alternatively, a designer can also create a library of widely used kernel designs. For
example, matrix multiply, FFT, DFT are some of the most widely used signal processing
kernels. Therefore, a designer can create domain specific models for the above signal
processing kernels and store them in the model library. These models can be reused for
application specification and subsequent design space exploration.
115
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
6.9.1 Extending the Design Framework
The design framework supports two types of extension. First extension involves inte
gration of additional simulators and design tools. Integration of a tool (or a simula
tor) involves generation of required input from the models specified in the framework
to drive the tool and, once the tool generates output, analysis of the output to extract
required information and store them in the models. Towards integration of the tools,
GME supports creation of model interpreters through the support of a set of well-defined
C++ APIs that provide bidirectional access to the models [29]. Thus the model inter
preters can extract data required by the integrated tools from the models and also update
the models based on the data generated by the tools. For example, integration of Mod-
elsim would involve development of a model interpreter that extracts required inputs,
from the models, for Modelsim such as the high-level implementation using VHDL and
the associated test bench and invoke the compiler compile the design and simulator to
generate performance output. The interpreter then analyzes the output to extract latency
estimates and updates the model.
In addition to integration of tools, the modeling paradigm associated with the design
framework can also be extended. Through the extension of the modeling paradigm
newer building blocks can be added. For example, currently, some FPGAs provide
embedded high-performance DSP blocks [3], high-performance I/O such as Rocket
I/O [96], or Flash based configuration memory [1]. These components can be added
as micro blocks in the modeling paradigm to support kernel design using the above
FPGAs. Extension of a modeling paradigm may require modification of existing model
interpreters.
116
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Chapter 7
Illustrative Examples
In this chapter, we discuss several examples demonstrating our modeling, high-level
performance estimation, and design space exploration techniques. We also demonstrate
the use of MILAN in designing several applications.
7.1 Modeling
Modeling involves the use of the constructs provided by our framework to describe the
target applications and implementation alternatives for each application task, target can
didate devices for the heterogeneous embedded systems and their capabilities, mapping
of application tasks onto the target hardware, and performance and design constraints.
In the following, we discuss several examples demonstrating the above capabilities of
our framework.
7.1.1 Application Modeling
The application model shown in Figure 7.1 refers to the target tracking application dis
cussed in Chapter 2. The application is divided into 6 tasks. While the 5 tasks, dmean,
com_cvm, inverse, whiten, and velFilter represent data processing components within
the application, the 6th task models image capture by the Camera. These tasks were
identified based on the functional specification of the application. Each task except the
source task is associated with input ports that are connected to the tasks that provide
input to this task. Similarly the output ports are specified for each task other than the
117
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
I / ’ MILAN - [Snirtlllinav.e - /frtj&eH)tUectfoii/lrtiKjelPetei:tioiiApgljedflon/FAS7Ro(>WJ . . .
jSyncGonipound ' Aspecl;jDalaFlow
Aggngntr j Ijsheiittnce) W « U »
M odel of the Target Detection Application
Generatelmage_ON_CameraJN_Active amean comp_cvm
M b *
whiten veFilter
CScript CScriptSim JavaCode JavaCodeSim MatlabScript MatlabScriptSim
isji SHierarchyStop SlnPort SOutPort
SyncAlternalive SyncPrimitive SyncCompound
Figure 7.1: Top level application model (tasks and dependencies)
sink task. The data flow is shown by connecting the appropriate ports with each other. In
addition, each task except the source task is an alternative as multiple implementations
(described below) are associated with these tasks.
For each application task, we model multiple alternatives. These alternatives refer to
the execution of the task on different devices operating on different states. In Figure 7.2,
we show the alternatives for the task whiten which is associated with implementation
alternatives for 4 devices, Virtex-II Pro FPGA, TIC67 series DSP, Intel PXA 255, and
IBM PowerPC 405. For any design, only one option is chosen but the availability of
multiple option models the complete design space. Although not shown in Figure 7.2,
we have modeled choices for all the five tasks. However, some tasks have specific
requirements or constraints such as floating point computation or large area if imple
mented on an FPGA because of which not all tasks can be mapped onto each candidate
device.
118
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
j d File Edit View ; Window Help
*5 4 f i < 4 1133 EH'S! B .G 3 | “ I
| C o m p o n e n ts
;; | w hiten
whiten ON VirtexllPro
dmeanlN
whiten ON TIC67
whitenOUl
cvmlN
whiten ON lntelPXA250
whiten_ON_PowerPC405
jEDIT |lO O % MILAN 0 9 :5 3 R eady
w hi a
MILAN - [ w h ite n - /T a r g e tD e te c tio n /T a r g e tD e te e tio n A p p lk a tio n /F A S T R o c k t/S m a lllm a g e /] ['*. |
Figure 7.2: Modeling implementation alternatives for a task
7.1.2 Resource Modeling
The MILAN resource models define the hardware components available for application
implementation. The primary motivation of the resource model is to model various
architecture capabilities that can be exploited to perform design space exploration and to
be able to drive a set of widely used energy and latency simulators from a single model.
The resource model along with the application model captures the various mapping
possibilities of the target system being designed. Figure 7.3 illustrates a resource model
describing 5 processing devices, a memory, and a camera. Individual components are
also connected to each other via interconnects. Our framework supports a hierarchical
119
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
view of the resource model. Therefore, additional details of each component, while
present, are not shown in Figure 7.3.
The target devices we consider for the model shown are, Virtex-II Pro, Actel ProA-
SICPLUS, Intel PXA 255, PowerPC 405, and TI C6711 DSP. For our design space
exploration, we are interested in three aspects of each device; choice of operating states,
state transition cost, and idle energy dissipation per state. Figure 7.4 shows the model
for the Intel PXA 255 that captures the above details. In the top-level model (Fig
ure 7.3) we show that all the devices are connected to a single memory. However, its
a demonstration model. More than one memory and different kind of connections can
be specified. Also, the model does not demonstrate the Structure of the final hardware.
The model can be viewed as a test board where different combination of devices can be
evaluated and the best combination of devices can be identified. We will discuss how
we use Constraints to capture the requirement that only certain combinations of devices
are allowed.
All the performance estimates we are interested in are captured in the model for
the operating states as shown in Figure 7.4. The model is based on the idea of a finite
state machine (FSM) where the connecting directional edges capture performance cost
of transition (denoted by dotted lines). Energy dissipation while idling is specified for
each state. It is required to have a ShutDown state to denote the power-down state of
each device. It is also required to specify a default state.
7.1.3 Mapping and Constraint Specification
Once the application and the resources are modeled, we specify mappings. The mapping
captures the performance estimates for each valid combination of tasks and devices. The
tasks at the leaf level of the application model are mapped. Leaf level refers to tasks
without any children (choices).
120
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
MILAM - [C hoiceO tR esources - /T argetU etectipn/R esourceM oriel/]
M Pile Edit Mew Window fletp
✓ i ,|J \ O \
* rt
a 1
rf j " ^ o4 CfC E CD E ITi y Components:
t:| Structural V j B ase: jW A " Zoom: j l Q1 JJ£
ProASIC
c □ □
Interconnect
D^SD □
lntelPXA250 Interconnect
□ □
P ow erP C 405 Interconnect
C am era
Interconnect VirtexIPro
■ Por
MobileSDRAM
Interconnect TIC67
*
EDIT 1 0 e % W W M j
Figure 7.3: Top level resource model (candidate devices)
There are two types of constraints specified in the model. The performance con
straints specify the latency and energy requirements of the design. Our choice of first-
level DSE tool DESERT does not support design space exploration based on the duty-
cycle requirements. Hence we apply a two level approach. We identify not-so-strict
constraints for latency and energy to evaluate single execution (for e.g., start with the
camera and end with velocity filter for the application model shown in Figure 7.1) for
each possible design. Afterwards we select the designs using DESERT and evaluate the
selected designs based on the duty-cycle requirement using HiPerE. It can be guaranteed
that the designs discarded by DESERT would not have satisfied the requirement based
on duty-cycle. In order to do so, we take advantage of the maximum end-to-end latency
requirement and specify that as a latency constraint. The designer can tighten or loosen
constraints to select a required number of designs.
121
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
a File Edit View W indow Help
/ i U r
; Components; ,
" _ B %
# -J S y*l IE rpi m B 3 j 7
' f NanieiJVoltageStates (States
X
A s p e e if Structural j * ] B ase: JN /A
A
F 3 3 8
D efa u ltS ta te
IS tateT ransition
Attr P re fe re n c e s | P roperties j
S hutD ow n F 9 9
.r: to r K i n ?
S ta te transition time: 20
S ta te tran energy: 5 500
S e le c t latency unit: micro s e c
S e le c t energy unit: micro joule
Ready EDIT 10 0 % !MILAN 0 9 :5 7
Figure 7.4: Modeling dynamic voltage scaling for a device
The second class of constraints are combinational constraints. These constraints
ensure that only the valid combinations are chosen by DESERT. For example, a valid
combination is a processor (PXA or PowerPC) and ProASIC. Figure 7.6 shows an exam
ple of such a constraint. In simple terms, it reads if dmean is implemented by PXA 255
then no other tasks can be implemented by PowerPC or TIC67. So the only choices
left are ProASIC and Virtex. However, another constraint (e.g. AllOnVirtexl/2/3/4/5
shown in Figure 7.5) ensures that when Virtex is chosen for a task all other tasks should
122
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
MILAN -[S m a lllm a g e - /T arg etD etec tio n /T a rg etL Je te ctio n A p p lic atio n /F A S T R o o t/]
f Flte I * Wew WiHdow: Help
j y .£ i a x i a a i 4 -»
T N a m o r |S m illm a g o JSyn oE an p ou n d A s p e c t! Constraint
V
jt;
'#
A
SJ
.an
' ! D R [ T i ? C o m p o n e n t s : i
s [M /A Zoom: jldoi™ v |
IgnoreV ariableR ateA sT askC V M
Ig noreV ariableR ateA sT asklnverse
O N P P C orP X A orC 67 1 O N P P C o rP X A o rC 6 7 6 O N PF
AllOnVirtexI
AIIOnVirtex2
AIIOnVirtex3
O N P P C orP X A orC 67 2 O N P P C o rP X A o rC 6 7 7 ONPF
O N P P C o rP X A o rC 6 7 _ 3 O N P P C o rP X A o rC 6 7 8 O N PF
R ea a y ' j l 0 0 % .M I L A N j Q 9 ; S 9 A M
Figure 7.5: Specifying constraints
be mapped on Virtex. Therefore the only valid combination is PXA 255 and ProA
SIC. A sample constraint specified in Object Constraint Language (OCL) is shown in
Figure 7.6. Several such constraints are specified in the model.
7.1.4 FPGA-based Kernel Modeling
FPGA based kernel modeling consists of two steps; creating a library of components,
and using the library to create designs for the kernels. For a given FPGA, the basic
building blocks such as micro and macro building blocks are always available (Chap
ter 6). Using these building blocks, larger design components such as library blocks can
be created. In Figure 7.7, we show a list of micro and macro building blocks. These
blocks include adders with 4-, 8-, and 16-bit precision, 4-bit 2/3/4 to 1 muxes, direct
interconnect, 4-bit counter, among others. Each of these building blocks are associated
123
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
self.children("dmean").implementedBy()=self.children("dmean").
children (, , dmean_ON_IntelPXA250,,) implies (not ( (self-children (ncomp_cvm“) .
children("ComputeCVM").implementedBy()=self.children("comp_cvm").
children("ComputeCVM").children("comp_cvm_ON_PowerPC405■)) or
(self.children(“comp_cvm").children("ComputeCVM").implementedBy()=
self.children("compcvm").children("ComputeCVM").
children("comp_cvm_ON_TIC67")))and not
((self.children("inverse").children("Computelnverse").
implementedBy()=self.children("inverse").children("Computelnverse").
children("inverse_ON_PowerPC405")) or (self.children("inverse").
children(“Computelnverse").implementedBy()=self.children("inverse").
children("Computelnverse").children("inverse_ON_TIC67“))) and not
((self.children("whiten").implementedBy()=self.children("whiten").
children(“whiten_ON_PowerPC405"))or (self.children("whiten").
implementedBy () =self. children ("whiten") . children("whiten_0N_TIC67") ))
and not ((self.children("velFilter").implementedBy()
=self .children ("velFilter") .children (, , velFilter_ON_PowerPC405"))
or (self.children("velFilter").implementedBy()
=self.children(“velFilter").children("velFilter_ON_TIC67"))))
Figure 7.6: Sample constraint
with performance estimates that include area and average power. Each micro and macro
building blocks are associated with only two power states “ON” and “OFF”. We per
formed simulation using ModelSim and Xilinx XPower to estimate the performance of
each building block stored in the library. Once a library of building blocks are available
they are used to define library blocks. Figure 7.7 shows the model of a 4-bit MAC that
uses a 4-bit multiplier and a 8-bit adder connected using direct interconnects (Xilinx
Virtex FPGA assumed).
As the MAC integrates two building blocks that can be independently moved to
power states “ON” and “OFF”, the MAC can be associated with more than two power
states. For our modeling purposes, we assumed 3 states, “full ON”, “adder OFF and
multiplier ON” and “full OFF”. The high-level performance estimator uses the model
for MAC to estimate the area and average power dissipation for the three states.
Using the above blocks a complete kernel design can be specified. Section 7.3 dis
cusses the design of matrix multiplication kernel that is specified using the building
blocks specified above.
124
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
MILAN - [MAC4 - /S a m p le M o d e ll/F P G A D e sig n L ib ra ry l/J
td He Etft Vten Window Help
/ ~ A "
j C o m p o n e n t s : |
. s : x
A EEffiES
if T N a m e :|M A C 4
■ y
t>
A
< o
Aspect!FUbrapAspect - j Base: |N s*A
Multiplier^ Directi Add8
AddOFF
Ready
Direct2
EDIT 1 0 0 % MILAN 0-
Mi l AM - f ( ' i n f i « i i i n . ,,
S N® E d t View W id o w Help
. S1 X
/ I y ' ff: | 4
Components:
k
R
> -■
A
d y
Aagregat# ) InKeianee | Meta | ;
t i n
ja
E - ^ S l FPG A D esignLibraryl A
$ - B A d d 1 6
m : M A dd4
j © - § g A dd8
© •'IS ! Cntr4
i il- 'IM Direct
S M G enericA dder
| i ' B MAC4
S M em ory19x16
| S B - H MultiplierlG
j $ H M ultiplied
© H Multiplier8
i © H M ux2to1x4
| © i - p f M ux3to1x4
I © H M ux4to1x4 V
fte a d y
Figure 7.7: Library of building blocks and FPGA design
7.2 Integration of Tools
In our framework, tools are integrated through model interpreters. The model inter
preters are effectively translators that map the design models to executable models or
configuration files that are, in turn, used by the different simulators. Model interpreters
traverse the application and resource models and generate the information necessary
to drive the individual simulators. In the following we discuss example integration of
SimpleScalar [83] into our framework.
7.2.1 Integrating SimpleScalar
SimpleScalar is a cycle-accurate simulator for MIPS processor [83]. There are two com
ponents for simulation using SimpleScalar. Our framework needs to provide the source
code in C and the configuration for SimpleScalar. We focus on the model interpreter
125
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
that provides configuration information for SimpleScalar. The generated configuration
can be provided as input to SimpleScalar to simulate the target processor.
SimpleScalar supports a number of low-level constructs to configure the internal
simulation engine. Such constructs are cache configuration, cache hit/miss latency,
TLB configuration, depth of pipeline, among others. Typically, a designer knows the
exact configuration that needs to be specified to emulate the target processor. Other
wise a default configuration is supported. Using our framework, a designer can spec
ify the configuration or choose the default option. Figure 7.8 provides the invocation
screen for SimpleScalar model interpreter. Using this screen, a designer can specify
non-architectural parameters such as simulator input/output file, random number gener
ator seed, etc.
Additionally, we have developed a model interpreter to modify the model based on
the output of SimpleScalar. SimpleScalar generates a number of performance metrics
such as execution time, number of cycles, number of cache hits/misses, number of TLB
hits/misses, and memory accesses, among others. Based on the need of the designer
these metrics can be selectively filtered and stored back in the models. For example,
execution time for a specific task mapped onto a processor modeled by SimpleScalar
can be stored in the model for use during design space exploration. Similarly, we have
integrated SimplePower and PowerAnalyzer as simulators to estimate energy dissipa
tion.
7.2.2 Integrating A Dynamic Programming based DSE Tool
The dynamic programming based DSE tool solves the following problem. Given a linear
array of tasks and a set of alternatives (FPGA based implementation) for each task,
select an implementation for each task such that overall latency or energy dissipation is
minimized. The problem is non-trivial because it is assumed that the FPGA can execute
126
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
mj
.pILAN j SimpleScalar Simulator Input
Select version
F i L H H i
Configuration file
jCAssconfig.txt
Dump configuration file
r .....
Print help message
: f -
Verbose operation
F
Enable debug message
p
Start in Dlite debugger r
Initialize and terminate immediately r
Random number seed
I f "
Simulator output file
r
Simiiator program output file
r “ ~
..™ “~ " '
Simulator scheduling priority
i°
Maximum instructions to execute . j o
No. of instructions skipped before time starts
i°
Restore EIO trace from
I------------------- -
Cancel j | O K |
Figure 7.8: Invoking SimpleScalar model interpreter using our framework
only one task at a time and hence needs to be reconfigured between task executions and
the cost to reconfigure depends on the choice of implementation for each task. Dynamic
programming based solution for latency and energy optimization have been discussed
in [12] and [62] respectively. Based on their solutions, we developed a C++ based tool
that would solve the above design problem for latency and energy.
The first step of tool integration is to ensure that we can generate the input required
by the tool using the models specified in the design framework. Linear array of task
is a special case of data flow modeling. Therefore our application modeling technique
can be used to specify the linear array of tasks. The implementation alternatives can be
modeled as application alternatives and the reconfiguration cost can be modeled using
127
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
EdK View W in d o w :.Help
\ / l d t t$i £>■’ ',* , I T E f T E r /
^ T Nam e; [LirArrRoo! [SyncC om pound A spect:| D ataflow Base: | n A
X
M
<
£1
Components:
MILAN - [Taskl - /LinArray/LinArray/LinArrRoot/]
gsm ij Sin S O u g | Sin SO ugJ-
T ask l P s e u d o ! 2 Task2
M a pping d FJnodr A rray o f Tasks
^ File EdK View Window Help - & X
j l J -a X n , ‘J j v J r?; n
Components:
^ T N am e:|T a sk l : | ^ K ^ a r S S v e ~ ~ A specl:fD ataR ow
^ 1
y
S >
SJ
T a s k l1
SlnPort SOutPort
>
Jjn T J
Figure 7.9: Modeling linear array of tasks and invoking dynamic programming based
tool
the technique discussed in Chapter 6. For each pair of consecutive tasks, the design
constraints can be used to associate correct reconfiguration cost with the task imple
mentations selected. In addition, each leaf node in our model is associated with perfor
mance estimates. Thus the cost of task execution and reconfiguration can be obtained
from the models. Therefore, the input required for the dynamic programming based
tool can be extracted from the application model described in our design framework.
Figure 7.9 shows a model of a 3 stage linear array of task with 2 pseudo tasks to model
reconfiguration.
7.2.3 Integrating XFLOW, ModelSim, and XPower
XFLOW is a command line tool that allows one to automate the Xilinx implementa
tion and simulation flow [93]. Such automation allows us to automatically simulate
128
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
the micro, macro, and library blocks (see Chapter 5) and estimate the performance if a
high-level implementation scripts and input stimulus (e.g. in VHDL) are specified by
the designer. We have developed a model interpreter that allows the designer to specify
and combine flow types. The model interpreter automatically verifies if the flow type is
supported by the target device and generates required scripts to execute XFLOW. The
generated output is parsed and provided as a feedback to the models to store perfor
mance estimates.
Similarly, we have integrated ModelSim and XPower to the design framework.
Using the command line tools vcom and vsim (provided by ModelSim), we can auto
matically, compile and simulate FPGA based designs and estimate latency. As we are
already provided with implementation and simulation files, using MILAN, we only gen
erate the required script to invoke vcom and vsim. The latency information is generated
based on the output result and if the design compiles and simulates without any error.
The model interpreter also supports setting appropriate command line options for vcom
and vsim.
XPower needs the following files; NCD files created by par or map, PCF files pro
duced by map, VCD files produced by a simulator (in our case ModelSim), and an
XML settings file. We use XFLOW and ModelSim to generate the necessary file prior to
invoking XPower using command line options. Using the model interpreter, the designer
can also set appropriate command line options. The performance estimate obtained via
XPower simulation is also provided as a feedback into the model to be used during
design space exploration.
129
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
7.3 Energy-Efficient Designs of Matrix Multiplication
Algorithm
A S -
B S -
B F-
OC-
ACT-
AS.LR AS.RR AS
BS.LR BS
BS.RR
(a) Linear Array Architecture
BF
MAC
C [1 ] C[s]
ACT.L
ACT.R
ACT[1) ACT[s]
PE. PE. PE.
Shift d a ta in shift re g iste rs.
R e a d d a ta into (input) re g iste rs from input
p orts.
If (ACT[i]=1) th en
C[i]=C[i]+AS.LR*BF.T.
If (P c = 1) th en
Ma s e le c t d a ta from A S.LR
Mb s e le c t d a ta from B S.R R
Mc S elect d a ta from BS[s]
Md s e le c t d a ta from ACT[s).
e ls e
Ma s e le c t d a ta from A S.R R
MB s e le c t d a ta from BF.R
Mc s e le c t d a ta from B S.LR
M0 s e le c t d a ta from ACT.L.
(c) Algorithm for Matrix Multiplication
executed in each PE
(b) Organization of PE
Figure 7.10: Architecture and algorithm for Matrix Multiplication (Prasanna and Tsai,
1991)
A matrix multiplication algorithm for linear array architectures is proposed in [68].
We use this algorithm to demonstrate modeling, high-level performance estimation, and
performance tradeoff analysis capabilities of the design framework. Thus it uses only
Step 1, 2, and 3 of the design flow. The focus is to generate a set of energy efficient
designs for matrix multiply using Xilinx Virtex-II Pro.
In step 1, the architecture and the algorithm were analyzed to define the domain
specific model. Various building blocks that were identified are register, multiplexer,
multiplier, adder, processing element (PE), and interconnects between the PEs. Among
these building blocks only the PE is a library block and the rest of the components are
micro blocks. Component specific parameters for the PE include number of register (,s)
and power states O N and O FF. O N refers to the state when the multiplier (within the
PE) is in O N state and O F F refers to the state when the multiplier is in O F F state.
130
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Additionally, for the complete kernel design number of PEs (pe) is also a parameter. For
N x N matrix multiplication, the range of values for s is 1 < s < N and for pe it is
1 < pe < Ar([iY/.s]). For matrix multiplication with larger size matrices (large values
of N ) it is not possible to synthesize the required number of PEs due to area constraint.
In such cases, block matrix multiplication is used. Therefore, block-size (bs) is also a
parameter.
Once the data path was modeled we generated the cost function for power and area
for the different components. Switching activity was the only parameter for power func
tions. To define the CPS matrices, we analyzed the algorithm to identify the operat
ing state of each component in different cycles 7.10. As per the algorithm [38], in
each PE, the multiplier is in O N state for T /( \n js \) cycles and is in O F F state for
T x (1 — 1/ \n /s\) cycles. All other components are active for the complete duration.
0.8
> 0.6
-*— Latency
« — Energy
* — Area
= 0.4
° 0.2
0.0
2 B 16
Block Size
Figure 7.11: Analysis of Matrix Multiplication algorithm
In Step 2, we performed simulations to estimate the power dissipated and area occu
pied by the building blocks. The latency (T) of this design using N [AT/s] PEs and s
storage per PE [68] is T = (N 2 + 2N \N/.s] — [AT/s] + 1). Using the latency function,
component specific power functions, and CPS matrices, we derived the system-wide
energy function.
131
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Finally, we analyzed the model to identify a set of designs that provide a tradeoff
between different performance metrics. Figure 7.11 (a) shows the variation of energy,
latency, and area for different block sizes for 16 x 16 matrix multiplication. It can be
observed that energy is minimum at a block size of 4 and area and latency are minimum
at block size 2 and 16 respectively. This information is used to identify a suitable design
(block size) based on latency, energy, or area requirements.
7.4 Energy-Efficient Mapping for a Beamforming
Application
\ PXA operating voltages and
^ t transition costs
micro
controller
RADIO
negligible
transition
cost
sampling
^ y 10000 u sec
398)) 1100 u joule
receive data and perform
analog to digital transformation
three stage beamforming
algorithm
Figure 7.12: Beamforming application and frequency transition costs for PXA 255
The design problem is to identify an energy efficient mapping of an automated target
recognition (ATR) application onto a heterogeneous embedded system while meeting
the given latency constraint [67]. The underlying architecture for the PASTA project is
already specified. Hence, our approach is used to identify an energy efficient mapping
of the ATR application onto the PASTA hardware.
The hardware includes sensor(s), a processor, several microcontrollers (Cygnal
8051), memories, and a radio. Each component can be independently turned on or off.
In addition, the processor (Intel PXA 255) supports voltage and frequency scaling [35].
132
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
The target application is an automated target recognition algorithm that performs beam-
forming based on acoustic signals from the sensors [67]. The beamforming application
consists of a linear array of 6 tasks (Figure 7.12). The first three tasks are receive data
which is mapped onto the radio, sampling which is mapped onto the microcontroller,
and false-alarm detection which can be mapped onto either the microcontroller or the
processor. The last three tasks that compute beamforming are FFT, peak-pick, and delay
sum, all mapped onto the processor.
The design problem for PASTA involves identification of the operating state of each
component for each task such that the complete ATR application dissipates the mini
mum energy while satisfying the latency requirement. All the components in the PASTA
stack have at least two operating states; ON and OFF. There is a constant amount of time
and energy spent to switch on or off each component. In addition, the processor has 6
different operating frequencies. Tasks can be mapped onto the processor or the micro
controller. When mapped onto a processor, the task can be executed at a certain operat
ing frequency and the performance of the mapping depends on the operating frequency.
Transition between any two operating frequencies also involves time and energy costs
which depend on the source and destination states. The various operating frequencies
supported by the processor are 99.5,199, 298, and 398 MHz. We modeled all the above
in MILAN. The resulting design space was approximately 500,000 designs. However,
we noticed that the transition costs between different operating states of the processor
are negligible except one transition which involves changing operating frequency of the
bus (transition to or from 398 MHz. Ignoring the negligible transition costs reduced the
design space to 320 designs. As with the other examples, we used simulators for Intel
PXA 255 and the microcontroller to estimate performance of all the mappings. The start
up costs and state transition costs are estimated based on the data sheets provided by the
vendors [35].
133
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Table 7.1: Performance estimates of the tasks at different operating Frequencies
Tasks F re q . 99.5 199 298 398
FFT Latency 431958 215870 143937 107962
Energy 107989 90665 64772 68556
peak-pick Latency 7933 3964 2643 1982
Energy 1983 1665 1189 1259
delay-sum Latency 231956 115919 77292 57974
Energy 57989 46868 34782 36813
FA D Latency 43200 21600 14400 10800
Energy 8212 9261 8302 10556
Table 7.2: N-optimal results using our methodology
Des. 1 Des. 2 Des. 3 Des. 4 Des. 5 Des. 6 Des. 7 Des. 8
Energy (/iJ) 222422 224692 226653 226723 227306 227376 229407 231537
Latency ( jis) 893632 912971 894314 893653 867657 866996 847678 868339
Figure 7.12 shows the details of the application design problem. Table 7.1 provides
the latency (ji sec.) and energy dissipation (/i Joule) for each task for different frequen
cies.
Using MILAN, we modeled the hardware and the application. The latency con
straint was specified as < 200 milliseconds for data processing after receiving data
using the radio (radio takes on average 655 milliseconds to receive data). We apply
hierarchical design space exploration to identify an energy-efficient design that meets
the input latency constraint. As the application can be modeled using linear array of
tasks we chose to apply the dynamic programming based N-optimization heuristic as
the first step (N = 8). Overall size of the design space is approximately 320. Even
though the design space is not very large, we use this experiment to show that we only
need to evaluate 8 designs (in the second step of our methodology) to identify the most
energy-efficient design that meets the given latency constraint (results shown in Table
134
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
7.2 includes cost for receiving data). The performance of the most energy efficient solu
tion (Des. 7) that meets the given latency constraint is highlighted. On the other hand,
an optimization heuristic that identifies the most energy optimal solution fails to meet
the latency as the energy optimal solution (all task executed at frequency 298 MHz) has
a latency of 238 milliseconds.
135
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Chapter 8
Conclusions and Future Directions
We proposed a hierarchical design space exploration methodology that integrates prun
ing heuristics, a high-level estimation tool, and low-level simulators to perform efficient
design space exploration during low power high performance signal processing applica
tion design using heterogeneous embedded systems. We also addressed the challenges
associated with designing a heterogeneous embedded system through the evaluation of
COTS (commercial-of-the-self) hardware components when an application specifica
tion and performance requirement are provided. In addition, we discussed energy and
latency efficient application design based on duty cycle and multi-rate specification.
Using the MILAN (Model-based Integrated Simulation) framework, we demonstrated
the use of our methodology through energy and latency efficient design of several signal
processing applications. The key advantages of our framework are a unified modeling
environment and the ability to seamlessly integrate tools and simulators. Additionally,
as our framework can potentially integrate any available tool for simulation, compila
tion, and design space exploration, it is possible to extend our framework based on the
target devices and create a customized design environment. Therefore, our framework
has an advantage when there is a need of a specialized high performance and low power
implementation using a variety of COTS components.
Our methodology and the proposed framework are most suited during the early
stages of the embedded system design process when the target hardware, algorithms
for the application tasks, mapping, and task-schedule are yet to be finalized. During
the early stages, design space tends to be extremely large and thus an efficient design
136
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
space exploration technique is useful. In addition, during the early stages, application
specifications are constantly being modified and candidate devices are being added or
removed. Our methodology allows easy adaptation to the above modifications. How
ever, during the later stages of the design process when the hardware, algorithm, map
ping, and task-schedule are already finalized, use of specific compilers and simulators
is more efficient. While our framework can integrate the compilers and the simulators,
the framework’s contribution towards design decisions will be limited as a designer will
mostly be performing local optimizations for which compilers are simulators are more
appropriate.
The framework relies on the performance estimates associated with the various com
ponents such as task mappings, state transitions, building blocks for FPGA based ker
nel modeling to evaluate the overall performance of the designs and perform design
space exploration. Therefore, the accuracy of the result is dependent on the accuracy of
the estimates. However, the framework also allows integration of low-level simulators
which, while time consuming, can generate accurate performance estimates. Thus the
framework provides a tradeoff between design time and accuracy of result.
In general, several simulators exist for COTS components. While the framework
allows integration of tools provided by commercial vendors, it is easy to integrate tools
with well-defined command-line interface. Tools with only graphical interface cannot
be easily integrated. However, it is possible to develop a model interpreter that gener
ates required input for a tool which is invoked outside the design framework and the
relevant result is manually added to the model. In the similar spirit, there exists several
heuristics in addition to DESERT that are potential candidate for integration. Genetic
algorithm or simulated annealing are two such heuristics. However, it is required that
the model interpreter be intelligent enough to not only perform syntactical translation
but also a semantic mapping from the models to the generic tools implementing genetic
137
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
algorithm or simulated annealing. Therefore, while it is possible to integrate simula
tors and heuristics into the framework, each integration requires unique translation logic
while writing the model interpreters.
In addition to tool integration, the models discussed in this thesis can also be easily
extended to capture additional capabilities that can be exploited during design space
exploration. While GME supports a well-defined process to modify the metamodel
which results in the modification of the modeling environment, a designer needs to
determine if the model interpreters need modification as well. Therefore, caution must
be exercised while modifying the modeling semantics.
Another core strength of our proposed methodology and the corresponding frame
work is support for reuse. The models of the application, target hardware, task mapping,
and FPGA based kernel designs can be stored as a library of model and reused in other
designs. Once integrated, all the simulators, performance estimators, heuristic based
design space exploration tools are also reused. Thus the effort invested in developing
the metamodel, models, and the model interpreters is compensated through reuse. In
addition, potentially, the models can be constantly expanded to support new capabilities
and a large number of tools can be integrated into our framework without affecting the
already existing capabilities. Therefore, our proposed framework can act as a repository
of design capabilities. Depending on the target domain, our design methodology allows
selective use of models and tools suitable for that domain.
138
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
8.1 Future Directions
8.1.1 Evolutionary Algorithms
Evolutionary algorithms such as genetic algorithms and simulated annealing are well-
suited to be considered as potential candidates for design space pruning. Genetic Algo
rithm works on a set of designs and applies selection, mutation, and pruning to generate
a new set of designs with better performance. Therefore, once the required number
of iterations are over instead of selecting the best design from the final set, we can
consider the complete set of designs and apply hierarchical simulation. Similarly, Sim
ulated Annealing can also be used to consider a number of solutions in each iteration.
Both Genetic Algorithm and Simulated Annealing use high-level models and therefore
are susceptible to estimation error. Therefore, the use of a hierarchical design space
exploration technique will result in overcoming error due to high-level modeling.
8.1.2 Extending the Application Model
We have used the data flow model for application specification. This model is well-
suited for signal processing applications. However, our methodology and the corre
sponding framework can be easily extended to additional domains by supporting other
application models. Kahn process network is one such domains which is well-suited
for modeling communication semantics such as shared memory communication. Sup
porting Kahn process network will allow us to accurately model the communication
between the different components of the heterogeneous embedded system. Other appli
cation models [42, 69] such as asynchronous data flow graphs, discrete events, finite
state machines can also be considered.
139
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
8.1.3 Platform FPGA
Platform FPGA is an emerging heterogeneous embedded system which, integrates gen
eral purpose processors within the FPGA fabric [96, 97]. While our approach supports
Platform FPGA as a candidate target hardware, as the processor is embedded within the
FPGA fabric, it is possible to more accurately model the communication between the
processor and the FPGA based on the information provided in data sheets. Therefore,
current models can be extended to more accurately support Platform FPGAs. In addi
tion, there exists several execution models for Platform FPGAs to specify interactions
between the processor and FPGA. FPGA can act as an accelerator with the processor
being the master or the processor can act as a de-accelerator [37] with FPGA being the
master. In addition, both FPGA and processor can be executed synchronously or one
needs to wait for the other to complete. Such choices should be considered while devel
oping models and design space exploration techniques suitable for Platform FPGA.
8.1.4 Design Metrics such as Cost, Size, and Weight
We have focused on the three performance metrics area, latency, and energy. However,
additional metrics such as cost, dimension, and weight are also critical metrics while
designing and embedded system. HiPerE can be easily extended to support the above
metrics with very little modification to the existing models. However, with additional
metrics we need to identify rules for identifying an optimal design based on such a large
number of metrics. Also, such rules will be different depending on the target domain and
design requirements. So a technique needs to be developed that will allows specification
of such rules such that our methodology can transparently operate on these rules.
140
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
8.1.5 Integration of Battery Models
A number of comprehensive models exist to analyze power dissipation characteristics
of batteries while maximizing battery life [40, 74], While our methodology supports
evaluation of a design for a length of duration, inclusion of a battery model will help
identify optimizations that would maximize the battery life. Therefore, integration of a
battery model will provide a comprehensive framework for energy efficient application
design. Also, several research results are available that focus on techniques to maximize
battery life [40], These techniques can be incorporated in our framework to guide energy
optimization.
141
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Reference List
[1] Actel ProASICPLf75 - The Nonvolatile Reprogrammable Gate Array, h t t p : / /
www. a c t e l . c o m / p r o d u c t s / p r o a s i c / .
[2] R Albright. Saving Power To Put Wi-Fi In Handsets. In Wireless Week, Aug. 5,
2002.
[3] Altera Stratix, Stratix-II, Hardcopy. h t t p : / / w w w .a l t e r a .c o m /
p r o d u c t s / d e v i c e s / .
[4] Amplify FPGA Physical Optimizer. h ttp : / / w w w .s y n p l i c i t y .c o m /
p r o d u c t s / amp1i f y /.
[5] Atmel’s AT94K and AT94S family of Field Programmable System Level Inte
grated Circuits, h t t p : //www. a t m e l . co m /p ro d u c ts /F P S L IC /.
[6] M. Auguin, L. Capella, F. Cuesta, and E. Gresset. CODEF: A System Level Design
Space Exploration Tool. In Proc. of Intl. Conf. on Acoustics, Speech, and Signal
Processing, 2001.
[7] A. Baghdadi, N. Zergainoh, W. Cesario, T. Roudier, and A. Jerraya. Design Space
Exploration for Hardware/Software Codesign of Multiprocessor Systems. In Proc.
o f Workshop on Rapid System Prototyping, 2000.
[8] F. Balarin, M. Chiodo, P. Giusto, H. Hsieh, Jurecska A, L. Lavagno, C. Passerone,
A. Sangiovanni-Vincentelli, E. Sentovich, K. Suzuki, and B. Tabbara. Hardware-
Software Codesign of Embedded Systems: The POLIS Approach. Kluwer Aca
demic Publishers, Dordrecht, The Netherlands, 1997.
[9] K. Bazargan, R. Kastner, S. Ogrenci, and M. Sarrafzadeh. A C to Hard
ware/Software Compiler. In Proc. o f Field-Programmable Custom Computing
Machines, 2000.
[10] J. Becker, A. Thomas, M. Vorbach, and V. Baumgarte. An Industrial/Academic
Configurable System-on-Chip Project (CSoC): Coarse-grain XPP-/Leon-based
Architecture Integration. In Proc. of Design, Automation and Test in Europe, 2003.
142
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
[11] L. Benini, A. Bogliolo, and G. Micheli. A Survey of Design Techniques for
System-level Dynamic Power Management. In IEEE Tran, on VLSI Systems, Vol
ume: 8, Issue: 3, 2000.
[12] K. Bondalapati and V. K. Prasanna. Loop Pipelining and Optimization for Recon-
figurable Architectures. In Proc. o f Reconfigurable Architectures Workshop, 2000.
[13] B. Brock and K. Rajamani. Dynamic Power Management for Embedded Systems.
In Proc. of IEEE SoC Conference, 2003.
[14] Cadence Incisive Verification Platform and Signal Processing Workbench, h t t p :
/ / www. c a d e n c e . c o m / p r o d u c t s / i n c i s i v e .h tm l.
[15] L. Cai, M. Olivarez, P. Kritzinger, and D. Gajski. C/C++ Based System Design
Flow Using SpecC, VCC, and SystemC. Technical Report 02-30, UC, Irvine, June
2002.
[16] R. J. Calantone and C. A. Di Benedetto. Performance and Time to Market: Accel
erating Cycle Time with Overlapping Stages. IEEE Tran, on Engineering Man
agement, 47(2):232-244, 2000.
[17] G. Carpenter. Low Power SOC for IBM’s PowerPC Information Appliance Plat
form. In Proc. of Microprocessor Forum, 2001.
[18] L. N. Chakrapani, P. Korkmaz, V. J. Mooney, K. V. Palem, K. Puttaswamy, and
W. F. Wong. The Emerging Power Crisis in Embedded Processors: What Can
a Poor Compiler Do? In Proc. of Compilers, Architectures and Synthesis for
Embedded Systems, pages 176-180, 2001.
[19] S. Choi, G. Govindu, J. Jang, and V. K. Prasanna. Energy-efficient and Parame
terized Designs for Fast Fourier Transform on FPGAs. In Proc. o f International
Conference on Acoustics, Speech, and Signal Processing, 2003.
[20] S. Choi, J. Jang, S. Mohanty, and V. K. Prasanna. Domain-Specific Modeling
for Rapid System-Wide Energy Estimation of Reconfigurable Architectures. The
Journal of Supercomputing, 26(3):259-261, November 2002.
[21] S. Choi, R. Scrofano, V. K. Prasanna, and J. Jang. Energy-Efficient Signal Process
ing Using FPGAs. In Proc. o f International Symposium on Field Programmable
Gate Arrays, pages 225-234, 2003.
[22] P. Chou, G. Borriello, R. Ortega, K. Hines, and K. Partridge. IPChinook: An
Integrated IP-Based Design Framework for Distributed Embedded Systems. In
Proc. o f Design Automation Conference, 1999.
143
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
[23] T. H. Cormen, C. E. Leiserson, and R. L. Rivest. Introduction to Algorithms. MIT
Press and McGraw Hill, 2001.
[24] M. Devlin. Product Focus DSP: How to Make Smart Antenna Arrays. In Xcell
Jrnl. Ql, 2003.
[25] C. Dick. FPGA: Enabling the Software/Reconfigurable Radio. In Advanced Radio
Technologies, 2003.
[26] F. Doucet, M. Otsuka, S. Shukla, and R. Gupta. An Environment for Dynamic
Component Composition for Efficient Co-Design. In Proc. o f Design Automation
and Test in Europe, 2002.
[27] H. J. Eikerling, W. Hardt, J. Gerlach, and W. Rosenstiel. A Methodology for
Rapid Analysis and Optimization of Embedded Systems. In Proc. of Symposium
on Engineering of Computer Based Systems, 1996.
[28] K. Flautner, S. Reinhardt, and T. Mudge. Automatic Performance Setting for
Dynamic Voltage Scaling. ACM Journal on Wireless Networks, 8(5):507-520,
September 2002.
[29] Generic Modeling Environment, h t t p : / / w w w . i s i s . v a n d e r b i l t . e d u /
P r o je c ts /g m e /.
[30] S. Ghiasi, A. Nahapetian, and M. Sarrafzadeh. An Optimal Algorithm for Mini
mizing Runtime Reconfiguration Delay. ACM Transactions on Embedded Com
puting Systems, 3(2):237-256, 2004.
[31] T. Givargis, F. Vahid, and J. Henkel. Evaluating Power Consumption of Parame
terized Cache and Bus Architectures in System-on-a-chip Designs. IEEE Transac
tions on Very Large Scale Integration Systems, 9:500-508, 2001.
[32] GPS-6010 Smart Antenna. h t t p : //www. s t a r g p s . c a /
GPS - 6 010 - M anual - E . pdf.
[33] D. Grunwlad. Slow Wires, Hot Chips, and Leaky Transistors: New Challenges
in the New Millennium, 2000. h t t p : //www. cs . w i s e . edu/~arch/www/
ISCA-20 0 0 - p a n e l/.
[34] J. Henkel and R. Ernest. High-Level Estimation Techniques for Usage in Hard
ware/Software Co-Design. In Proc. of Asia South Pacific Design Automation Con
ference, 1998.
[35] Intel PXA 255 Processor. h t t p : / / w w w .i n t e l .c o m / d e s i g n / p c a /
p r o d b r e f /25278 0 . htm.
144
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
[36] S. Irani, S. Shukla, and R. K. Gupta. Competitive Analysis of Dynamic Power
Management Strategies for Systems with Multiple Power Saving States. In Proc.
of Design Automation and Test Conference, 2002.
[37] P. James-Roxby, G. Brebner, and D. Bemmann. Time-Critical Software Decelera
tion in an FCCM. In Proc. of Field-Programmable Custom Computing Machines,
2004.
[38] J. Jang, S. Choi, and V. K. Prasanna. Energy-Efficient Matrix Multiplication on
FPGAs. In Proc. of Field Programmable Logic and Applications, 2002.
[39] D. Kirk, M. Roper, and M. Wood. Defining the Problems of Framework Reuse. In
Proc. of Computer Software and Applications Conference, 2002.
[40] K. Lahiri, A. Raghunathan, and S. Dey. Efficient Power Profiling for Battery-
driven Embedded System Design. IEEE Tran, on Computer-Aided Design of Inte
grated Circuits and Systems, 23(6):919-932, 2004.
[41] A. Ledeczi, J. Davis, S. Neema, and A. Agrawal. Modeling Methodology for
Integrated Simulation of Embedded Systems. ACM Transactions on Modeling and
Computer Simulation, 13(1):82— 103, January 2003.
[42] E. A. Lee. Overview of the Ptolemy Project. Technical Report UCB/ERL M03/25,
UC, Berkeley, July 2003.
[43] E. A. Lee and D. G. Messerschmitt. Synchronous Data Flow. Proceedings of
IEEE, 75,1987.
[44] B. Levine, S. Natarajan, C. Tan, D. Newport, and D. Bouldin. Mapping of an
Automated Target Recognition Application from a Graphical Software Environ
ment to FPGA-based Reconfigurable Hardware. In Proc. of IEEE Symposium on
Field-programmable Custom Computing Machines, 1999.
[45] P. Lieverse, P. van der Wolf, K. Vissers, and E. Deprettere. A Methodology for
Architecture Exploration of Heterogeneous Signal Processing Systems. Journal of
VLSI Signal Processing for Signal, Image and Video Technology, 29(3): 197-207,
November 2001.
[46] J. Liu and P. H. Chou. Energy Optimization of Distributed Embedded Proces
sors by Combined Data Compression and Functional Partitioning. In Proc. Intl.
Conference on Computer-Aided Design, 2003.
[47] B. Madahar and I. Alston. How Rapid is Rapid Prototyping? Electronics and
Communication Engineering Journal, 14(4): 155-164, 2002.
145
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
[48] R. Maestre, F. Kurdahi, M. Fernandez, R. Hermida, N. Bagherzadeh, and H. Singh.
A Framework for Reconfigurable Computing: Task Scheduling and Context Man
agement. IEEE Transactions on VLSI Design, 9(6):858— 873, 2001.
[49] M. Mamidipaka, M. Khouri, N. Dutt, and M. Abadir. IDAP: A Tool for High Level
Power Estimation of Custom Array Structures. In Proc. of Intl. Conf. on Computer
Aided Design, 2003.
[50] G. McGregor, D. Robinson, and P. Lysaght. Hardware/Software Co-design Envi
ronment for Reconfigurable Logic Systems. In Proc. of Field-Programmable Logic
and Applications, 1998.
[51] Mentor Graphics ModelSim. h t t p : / /www. m o d el. com/.
[52] Micron Mobile SDRAM, h ttp ://w w w .m ic ro n .c o m /p r o d u c ts /d ra m /
m o b ile/.
[53] Model-based Integrated Simulation, h t t p : / / m i l a n . u se . edu/.
[54] Model Integrated Computing. h t t p : / / w w w . i s i s . v a n d e r b i l t . e d u /
r e s e a r c h / m i c . html.
[55] S. Mohanty and V. Prasanna. Rapid System-Level Performance Evaluation and
Optimization for Application Mapping onto SoC Architectures. In Proc. o f IEEE
Intl. ASIC/SOC Conference, 2002.
[56] S. Mohanty and V. Prasanna. A Hierarchical Approach for Energy Efficient Appli
cation Design using Heterogeneous Embedded Systems. In Proc. of Compilers,
Architectures and Synthesis for Embedded Systems, pages 243-254, 2003.
[57] S. Mohanty and V. Prasanna. An Algorithm Designer’s Workbench for Platform
FPGAs. In Proc. of Field Programmable Logic and its Application, 2004.
[58] S. Mohanty, V. K. Prasanna, S. Neema, and J. Davis. Rapid Design Space Explo
ration of Heterogeneous Embedded Systems using Symbolic Search and Multi-
Granular Simulation. In Proc. of Language Compilers and Tools for Embedded
System, 2002.
[59] T. Mudge. Power: A First Class Design Constraint. Computer, 34(4):52-58, April
2001.
[60] S. Neema. System Level Synthesis of Adaptive Computing Systems. PhD thesis,
Vanderbilt University, Department of Electrical and Computer Engineering, May
2001.
146
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
[61] S. Ong, N. Kerkiz, B. Srijanto, C. Tan, M. Langston, D. Newport, and D. Bouldin.
Automatic Mapping of Multiple Applications to Multiple Adaptive Computing
Systems. In Proc. of IEEE Symposium on Field-programmable Custom Computing
Machines, 2001.
[62] J. Ou, S. Choi, and V. K. Prasanna. Performance Modeling of Reconfigurable SoC
Architectures and Energy-Efficient Mapping of a Class of Applications. In Proc.
of IEEE Symposium on Field-Programmable Custom Computing Machines, 2003.
[63] H. Oudghiri, B. Kaminska, and J. Rajski. A Hardware/Software Partitioning Tech
nique with Hierarchical Design Space Exploration. In Proc. of Custom Integrated
Circuits Conference, 1997.
[64] H. P. Peixoto and M. F. Jacome. Algorithm and Architecture-level Design Space
Exploration using Hierarchical Data Flows. In Prof. of Application-Specific Sys
tems, Architectures and Processors, 1997.
[65] A. D. Pimentel, P. van der Wolf, E. F. Deprettere, L. O. Hertzberger, J. T. J. van
Eijndhoven, and S. Vassiliadis. The Artemis Architecture Workbench. In Proc. of
The Progress workshop on Embedded Systems, 2000.
[66] PowerPC 405 Embedded Cores. http://w w w -3.ibm .com /chips/
techlib/1echlib.n s f/product s /P o w e rP C _ 4 0 5 _ E m b % e d d e d _
C o re s .
[67] Power Aware Sensing Tracking and Analysis, h t t p : / / p a s t a . e a s t . i s i .
edu/.
[68] V. K. Prasanna and Y. Tsai. On Synthesizing Optimal Family of Linear Systolic
Arrays for Matrix Multiplication. IEEE Tran, on Computers, 40(6):770-774,1991.
[69] Ptolemy Project, h t t p : / /p to le m y . e e c s . b e r k e l e y . edu/.
[70] R. Puri, L. Stok, J. Cohn, D. Kung, D. Pan, D. Sylvester, and A. Srivastava S.
Kulkami. Pushing ASIC Performance in a Power Envelope. In Proc. of Design
Automation Conference, 2003.
[71] QuickLogic EclipsePlus. h t t p : / /w w w . q u i c k l o g i c . com.
[72] R. M. Rabbah and K. V. Palem. Data Remapping for Design Space Optimization
of Embedded Memory Systems. In ACM Trans. Embedded Computing System,
pages 186-218, 2003.
[73] A. Raghunathan, N. K. Jha, and S. Dey. High-Level Power Analysis and Optimiza
tion. Kluwer Academic Publishers, MA, 1998.
147
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
[74] D. N. Rakhmatov and S. B. K. Vrudhula. An Analytical High-Level Battery Model
for Use in Energy Management of Portable Electronic Systems. In Proc. of Intl.
Conf. on Computer Aided Design, 2001.
[75] R. Riley, S. Thakkar, J. Czamaski, and B. Schott. Power-Aware Acoustic Beam-
forming. In International Military Sensing Symposium, 2003.
[76] I. Robertson, J. Irvine, P. Lysaght, and D. Robinson. Improved Functional Simula
tion of Dynamically Reconfigurable Logic. In Proc. of Field Programmable Logic
and Applications, pages 152-161, 2002.
[77] Rocket I/O Multi-Gigabit Transceivers. h t t p :// w w w .x ili n x .c o m /
p u b l i c a t i o n s / x c e l l o n l i n e / p a r t n e r s / x c _ m g b t 4 2 . htm.
[78] B. Schott, P. Bellows, M. French, and R. Parker. Applications of Adaptive Com
puting Systems for Signal Processing Challenges. In Proceedings o f the ASP-DAC,
2002.
[79] R. Scrofano, S. Choi, and V. K. Prasanna. Energy Efficiency of FPGAs and Pro
grammable Processors for Matrix Multiplication. In Proc. o f IEEE International
Conference on Field Programmable Technology, 2002.
[80] Seamless Hardware/Software Co-Verification and FPGA Advantage, h t t p : / /
www.mentor. c o m /fp g a -a d v a n ta g e /.
[81] N. Shenoy, A. Choudhary, and P. Banerjee. An Algorithm for Synthesis of Large
Time-constrained Heterogeneous Adaptive Systems. ACM Transactions on the
Design Automation of Electronic Systems, 2(6):207-225, 2001.
[82] N. Shirazi, W. Luk, and P.Y.K. Cheung. Automating Production of Run-time
Reconfigurable Designs. In Proc. o f IEEE Symposium on Field-Programmable
Custom Computing Machines, 1998.
[83] SimpleScalar Tool Set. h ttp :// w w w .s i m p l e s c a l a r .c o m / .
[84] SimpleScalar-Arm Power Modeling Project, h t t p : / / www. e e c s . u m ich .
edu/~tnm /pow er/.
[85] P. Singer. The Optimal Detector. In Proc. of SPIE Conference: Signal and Data
Processing for Small Targets, 2002.
[86] A. Sinha and A. Chandrakasan. JouleTrack-A Web Based Tool For Software
Energy Profiling. In Proc. o f Design Automation Conference, 2001.
148
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
[87] A. Stammermann, L. Kruse, W. Nebel, A. Pratsch, E. Schmidt, M. Schulte, and
A. Schulz. System Level Optimization and Design Space Exploration for Low
Power. In Proc. of Intl. Symposium on System Synthesis, 2001.
[88] J. Sztipanovits and G. Karsai. Model-integrated Computing. IEEE Computer
Magazine, 30(4):110— 111, April 1999.
[89] TI C5000 Series DSPs, http: //dspvillage . ti . c o m /.
[90] TI TMS320 Series DSPs, http : //dspvillage. ti . c o m /.
[91] F. Vahid and T. Givargis. Embedded System Design: A Unified Hardware/Software
Introduction. John Wiley and Sons, 2002.
[92] M. T. Wang. FPGAs Adapt to Suit Low-power Demands. In EE Times, 2004.
[93] Xilinx XFlow Command Line Tool. h t t p : / / t o o l b o x . x i l i n x . c o m /
d o c s a n / x i l i n x 6 / b o o k s / d a t a / d o c s/d e v /d e v 0 2 1 5 _ 3 0 % . html.
[94] Xilinx IP Core, h t t p : / / w w w . x i l i n x . c o m / i p c e n t e r / .
[95] Xilinx System Generator for Simulink (Matlab). h t t p : / /www. x i 1 i n x . com/
p r o d u c t s / s o f t w a r e / s y s g e n / p r o d u c t _ d e t a i l s . htm.
[96] Xilinx Virtex-II and Virtex-II Pro. ht tp: / /w w w . xi 1 inx. c o m /.
[97] Xilinx Virtex-4. http://www.xilinx.com/products/virtex4/
overview .h tm .
[98] W. Ye, N. Vijaykrishna, M. Kandemir, and M. J. Irwin. The Design and Use
of SimplePower: A Cycle-Accurate Energy Estimation Tool. In Proc. o f Design
Automation Conference, 2000.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Energy -efficient information processing and routing in wireless sensor networks: Cross -layer optimization and tradeoffs
PDF
Energy and time efficient designs for digital signal processing kernels on FPGAs
PDF
A unified mapping framework for heterogeneous computing systems and computational grids
PDF
Improving memory hierarchy performance using data reorganization
PDF
An efficient design space exploration for balance between computation and memory
PDF
Energy efficient hardware-software co-synthesis using reconfigurable hardware
PDF
Compression, correlation and detection for energy efficient wireless sensor networks
PDF
A technical survey of embedded processors
PDF
Energy latency tradeoffs for medium access and sleep scheduling in wireless sensor networks
PDF
Dynamic voltage and frequency scaling for energy-efficient system design
PDF
Application-specific external memory interfacing for FPGA-based reconfigurable architecture
PDF
Dynamic logic synthesis for reconfigurable hardware
PDF
Extending the design space for networks on chip
PDF
A thermal management design for system -on -chip circuits and advanced computer systems
PDF
Design and analysis of server scheduling for video -on -demand systems
PDF
Generation, filtering, and application of subcarriers in optical communication systems
PDF
High performance components of free -space optical and fiber -optic communications systems
PDF
Efficient PIM (Processor-In-Memory) architectures for data -intensive applications
PDF
Efficient acoustic noise suppression for audio signals
PDF
Encoding techniques for energy -efficient and reliable communication in VLSI circuits
Asset Metadata
Creator
Mohanty, Sumit
(author)
Core Title
Hierarchical design space exploration for efficient application design using heterogeneous embedded system
School
Graduate School
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
engineering, electronics and electrical,OAI-PMH Harvest
Language
English
Contributor
Digitized by ProQuest
(provenance)
Advisor
Prasanna, Viktor (
committee chair
), Diniz, Pedro C. (
committee member
), Parker, Alice C. (
committee member
), Raghavendra, Cauligi S. (
committee member
), Sukhatme, Gaurav (
committee member
)
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c16-354366
Unique identifier
UC11341077
Identifier
3180433.pdf (filename),usctheses-c16-354366 (legacy record id)
Legacy Identifier
3180433.pdf
Dmrecord
354366
Document Type
Dissertation
Rights
Mohanty, Sumit
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the au...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus, Los Angeles, California 90089, USA
Tags
engineering, electronics and electrical