Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Energy efficient hardware-software co-synthesis using reconfigurable hardware
(USC Thesis Other)
Energy efficient hardware-software co-synthesis using reconfigurable hardware
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
ENERGY EFFICIENT HARDWARE-SOFTWARE CO-SYNTHESIS USING
RECONFIGURABLE HARDWARE
by
Jingzhao Ou
A Dissertation Presented to the
FACULTY OF THE GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Ful¯llment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
( COMPUTER ENGINEERING )
May 2006
Copyright 2006 Jingzhao Ou
UMI Number: 3237157
3237157
2007
UMI Microform
Copyright
All rights reserved. This microform edition is protected against
unauthorized copying under Title 17, United States Code.
ProQuest Information and Learning Company
300 North Zeeb Road
P.O. Box 1346
Ann Arbor, MI 48106-1346
by ProQuest Information and Learning Company.
Dedication
To my dear parents, wife, daughter, and brother!
Thanks a lot for your love!
ii
Acknowledgments
First and foremost, I would like to express my deepest appreciation and gratefulness to my
adviser, Prof. Prasanna. He has not only taught me knowledge, skills, and experience in terms
of research, but also mentored me in my life and career. I thank him for all his guidance,
support, inspiration, and encouragement during the past years.
I would like to thank other members on my defense and qualify exam committee, including
Prof. Pedro Diniz, Prof. Roger Zimmermann, Prof. Shrikanth Narayanan, and Prof. Timothy
Pinkston. They have o®ered many useful feedbacks that ¯nally bring the completion of this
thesis.
During these years, I have been working in Prof. Prasanna's research group | P-Group.
The pleasant research atmosphere in the P-Group is so helpful and enjoyable. I would like to
thankallthemembersinthegroup,includingZacharyBaker,AmolBakshi,SeonilChoi,Gokul
Govindu, Bo Hong, Sumit Mohanty, Gerald R. \Jerry" Morris, Jeoong Park, Neungsoo Park,
Animesh Pathak, Ronald Scrofano, Reetinder Sidhu, Mitali Singh, Yang Yu, Cong Zhang, Ling
Zhuo. Of these, I want to give special thanks to Amol, who has been sharing o±ce with me
since I joined the group; Seonil, who I have discussed lots of research problems with; Govindu,
who was my o±cemate for one year; and Yang, who is a very good and helpful friend in life.
I would like to thank my colleagues at Xilinx, Inc. This includes Brent Milne, Haibing Ma,
Shay P. Seng, Jim Hwang, and many others. Especially, I would like to thank Brent Milne and
Jim Hwang for o®ering me the internship and full-time job opportunities to work at Xilinx.
These precious opportunities are very important for the writing of this thesis as well as my
career after graduation.
I also want to take the opportunity to thank Prof. Cauligi S. Raghavendra for his guidance
and help during the ¯rst two years of my Ph.D. study.
iii
Iwanttoexpressmywholeheartedgratitudetomyparents, mywife, aswellasotherfamily
members. They have supported me so much during all these ¯ve years, which is a long and
rough journey for both me and them. I owe them too much!
Finally,Iwanttothankmyclosefriends,includingFanBai,XuanChen,RuiqinHe,Hongwei
Wu, Ning Xu, Yan Zhou, Xi Zhu, and Dan Zheng for their cherished friendship and all the
happiness they have brought to me.
iv
Contents
Dedication ii
Acknowledgments iii
List Of Tables ix
List Of Figures xi
1 Introduction 1
1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Challenges and contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Thesis organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2 Recon¯gurable hardware 10
2.1 Recon¯gurable System-on-Chips (SoCs) . . . . . . . . . . . . . . . . . . . . . . 10
2.1.1 Field-programmable gate arrays . . . . . . . . . . . . . . . . . . . . . . 10
2.1.1.1 Classi¯cations . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.1.2 Con¯gurable logic blocks . . . . . . . . . . . . . . . . . . . . . 13
2.1.1.3 Routing resources . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1.2 Pre-compiled embedded hardware components . . . . . . . . . . . . . . 18
2.1.2.1 Embedded general-purpose processors . . . . . . . . . . . . . . 18
2.1.2.2 Pre-compiled DSP blocks . . . . . . . . . . . . . . . . . . . . . 22
2.1.3 Soft processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.1.3.1 De¯nition and overview . . . . . . . . . . . . . . . . . . . . . . 23
2.1.3.2 Customization . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.2 Design °ows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.2.1 Low-level design °ows . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.2.2 High-level design °ows . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3 A framework for high-level hardware-software application development 37
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.3 Our approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.4 An implementation based on MATLAB/Simulink . . . . . . . . . . . . . . . . . 49
3.4.1 High-level design description . . . . . . . . . . . . . . . . . . . . . . . . 49
3.4.2 Arithmetic-level co-simulation . . . . . . . . . . . . . . . . . . . . . . . . 50
3.4.2.1 Simulation of software execution on soft processors. . . . . . . 50
3.4.2.2 Simulation of customized hardware peripherals . . . . . . . . . 52
3.4.2.3 Simulation of communication interfaces . . . . . . . . . . . . . 53
v
3.4.2.4 Exchange of simulation data and synchronization between the
simulators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.4.3 Rapid hardware resource estimation . . . . . . . . . . . . . . . . . . . . 57
3.5 Illustrative examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.5.1 Co-simulation of the processor and the hardware peripherals . . . . . . 59
3.5.1.1 Adaptive CORDIC algorithm for division . . . . . . . . . . . . 59
3.5.1.2 Block matrix multiplication . . . . . . . . . . . . . . . . . . . . 63
3.5.2 Co-simulation of a complete multi-processor platform . . . . . . . . . . 66
3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4 Energy performance modeling and energy e±cient mapping for a class of
application 74
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.2 Knobs for energy-e±cient designs . . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.3 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.4 Performance modeling of RSoC architectures . . . . . . . . . . . . . . . . . . . 80
4.4.1 RSoC model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.4.2 A model for Virtex-II Pro . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.5 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.5.1 Application model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.5.2 Problem de¯nition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.6 Algorithm for energy minimization . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.6.1 Trellis creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.6.2 A dynamic programming algorithm . . . . . . . . . . . . . . . . . . . . 88
4.7 Illustrative examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.7.1 Delay-and-sum beamforming . . . . . . . . . . . . . . . . . . . . . . . . 90
4.7.1.1 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . 90
4.7.1.2 Energy minimization . . . . . . . . . . . . . . . . . . . . . . . 93
4.7.2 MVDR beamforming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
4.7.2.1 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . 96
4.7.2.2 Energy minimization . . . . . . . . . . . . . . . . . . . . . . . 99
4.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5 High-level rapid energy estimation and design space exploration 102
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
5.3 Domain-speci¯c modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
5.4 Our approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
5.4.1 Step 1: Cycle-accurate arithmetic level co-simulation . . . . . . . . . . . 111
5.4.2 Step 2: Energy estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 114
5.4.2.1 Instruction-level energy estimation for software execution . . . 115
5.4.2.2 Domain-speci¯cmodelingbasedenergyestimationforhardware
execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
5.5 Energy estimation for customized hardware components . . . . . . . . . . . . . 119
5.5.1 Software architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
5.5.1.1 PyGen module . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
5.5.1.2 Performance estimator . . . . . . . . . . . . . . . . . . . . . . 122
5.5.1.3 Energy pro¯ler . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
5.5.1.4 Optimization module . . . . . . . . . . . . . . . . . . . . . . . 123
5.5.2 Overall design °ow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
5.5.3 Kernel level development . . . . . . . . . . . . . . . . . . . . . . . . . . 125
vi
5.5.3.1 Parametrized kernel development . . . . . . . . . . . . . . . . 126
5.5.3.2 Support of rapid and accurate energy estimation . . . . . . . . 127
5.5.4 Application level development . . . . . . . . . . . . . . . . . . . . . . . . 131
5.5.4.1 Application description . . . . . . . . . . . . . . . . . . . . . . 131
5.5.4.2 Support of performance optimization . . . . . . . . . . . . . . 131
5.5.4.3 Energy pro¯ling . . . . . . . . . . . . . . . . . . . . . . . . . . 132
5.6 Instruction-level energy estimation for software programs . . . . . . . . . . . . 133
5.6.1 Arithmetic-level instruction based energy estimation . . . . . . . . . . . 133
5.6.2 An implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
5.6.2.1 Step 1 | creation of the arithmetic-level instruction energy
look-up table . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
5.6.2.2 Step 2 | creation of the energy estimator . . . . . . . . . . . 137
5.6.3 Illustrative examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
5.6.3.1 Creation of the arithmetic-level instruction energy look-up table 140
5.6.3.2 Energy Estimation of the Sample Software Programs . . . . . 142
5.7 Illustrative examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
5.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
6 Hardware-software co-design for energy e±cient implementations of operat-
ing systems 153
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
6.2 Real-time operating systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
6.2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
6.2.2 O®-the-shelf operating systems . . . . . . . . . . . . . . . . . . . . . . . 158
6.2.2.1 MicroC/OS-II . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
6.2.2.2 TinyOS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
6.3 On-chip energy management mechanisms . . . . . . . . . . . . . . . . . . . . . 161
6.4 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
6.5 Our approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
6.6 An implementation based on MicroC/OS-II . . . . . . . . . . . . . . . . . . . . 166
6.6.1 Customization of MicroBlaze soft processor . . . . . . . . . . . . . . . . 167
6.6.2 Clock management unit . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
6.6.3 Auxiliary task and interrupt management unit . . . . . . . . . . . . . . 169
6.6.4 Selective wake-up and activation state management unit . . . . . . . . . 173
6.6.5 Analysis of management overhead . . . . . . . . . . . . . . . . . . . . . 174
6.6.6 Illustrative application development . . . . . . . . . . . . . . . . . . . . 176
6.6.6.1 Customization of the MicroBlaze soft processor . . . . . . . . . 177
6.6.6.2 A case study . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
6.7 An implementation based on TinyOS . . . . . . . . . . . . . . . . . . . . . . . . 184
6.7.1 Hardware architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
6.7.1.1 Hardware based task and event management (TEM) unit . . . 184
6.7.1.2 \Explicit" power management unit. . . . . . . . . . . . . . . . 186
6.7.1.3 Split-phaseoperationsforcomputationsusinghardwareperiph-
erals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
6.7.2 Illustrative application development . . . . . . . . . . . . . . . . . . . . 187
6.7.2.1 Customization of the MicroBlaze soft processor . . . . . . . . . 187
6.7.2.2 A case study . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
6.7.3 Analysis of management overhead . . . . . . . . . . . . . . . . . . . . . 192
6.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
vii
7 Concluding remarks and future directions 194
7.1 Concluding remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
7.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
8 Reference list 196
viii
List Of Tables
2.1 Various FPL technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2 Logic resources in a CLB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.1 Resource usage of the CORDIC based division and the block matrix multipli-
cation applications as well as the simulation times using di®erent simulation
techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.1 Maximum operating frequencies of di®erent implementations of an 18£18-bit
multiplication on Virtex-II Pro . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.2 Energy dissipation E
i;s
for executing task T
i
in state s . . . . . . . . . . . . . 85
4.3 Energy dissipation of the tasks in the delay-and-sum beamforming application
(¹J) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.4 Various implementations of the tasks on RL for the MVDR beamforming appli-
cation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
4.5 Energy dissipation of the tasks in the MVDR beamforming application (¹J) . 100
5.1 Energy dissipation of the FFT and matrix multiplication software programs . 144
5.2 Arithmetic level/low-level simulation time and measured/estimated energy per-
formance of the CORDIC based division application and the block matrix mul-
tiplication application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
5.3 Simulation speeds of the hardware-software simulators considered in this . . . 148
6.1 Dynamic power consumption of the FPGA device in di®erent activation states 180
ix
6.2 Various activation states for the FFT application shown in Figure 6.17 and their
dynamic power consumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
x
List Of Figures
1.1 Non-recurring engineering costs of EasyPath . . . . . . . . . . . . . . . . . . . 3
2.1 Classi¯cation of VLSI devices . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Arrangement of slices within a Xilinx Virtex-4 CLB . . . . . . . . . . . . . . . 14
2.3 Virtex-II Pro slices within a CLB . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4 An FPGA device with island-style routing architecture . . . . . . . . . . . . . 16
2.5 Dual-port distributed RAM (RAM16x1D) . . . . . . . . . . . . . . . . . . . . 18
2.6 Cascadable shift registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.7 Pipeline °ow architecture of the PowerPC 405 Auxiliary Processing Unit . . . 20
2.8 IBM CoreConnect bus architecture . . . . . . . . . . . . . . . . . . . . . . . . 22
2.9 Architecture of DSP48 slices on Xilinx Virtex-4 . . . . . . . . . . . . . . . . . 23
2.10 Architecture of CoreMP7 soft processor . . . . . . . . . . . . . . . . . . . . . . 25
2.11 Architecture of MicroBlaze soft processor . . . . . . . . . . . . . . . . . . . . . 26
2.12 Extension of MicroBlaze instruction set through FSL interfaces . . . . . . . . . 29
2.13 Extension of Nios instruction set . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.14 Low-level design °ow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.15 Design °ow of System Generator . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.16 The high-level block set provided by System Generator . . . . . . . . . . . . . 34
3.1 Hardware architecture of the con¯gurable multi-processor platform . . . . . . 39
3.2 Our approach for high-level hardware-software co-simulation . . . . . . . . . . 45
3.3 An implementation of the proposed arithmetic co-simulation environment based
on MATLAB/Simulink . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
xi
3.4 An implementation of the proposed arithmetic co-simulation environment based
on MATLAB/Simulink . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.5 Architecture of the soft processor Simulink block . . . . . . . . . . . . . . . . . 53
3.6 Communication between MicroBlaze and customized hardware designs through
Fast Simplex Links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.7 CORDIC algorithm for division with P =4 . . . . . . . . . . . . . . . . . . . . 60
3.8 TimeperformanceoftheCORDICalgorithmfordivision(P =0 denotes\pure"
software implementations) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.9 Matrixmultiplicationwithcustomizedhardwareperipheralformatrixblockmul-
tiplication with 2£2 blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.10 Time performance of our design of block matrix multiplication . . . . . . . . . 65
3.11 The con¯gurable multi-processor platform with four MicroBlaze processors for
the JPEG2000 encoding application . . . . . . . . . . . . . . . . . . . . . . . . 68
3.12 Execution time speed-ups of the 2-D DWT task . . . . . . . . . . . . . . . . . 69
3.13 Utilization of the OPB bus interface when processing the 2-D DWT task . . . 69
3.14 Simulation speed-ups achieved by the arithmetic level co-simulation environment 70
4.1 The RSoC model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.2 A linear pipeline of tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.3 The trellis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.4 Task graph of the delay-and-sum beamforming application . . . . . . . . . . . 91
4.5 Energydissipationofdi®erentimplementationsofthebroadbanddelay-and-sum
beamforming application (the input data is after 2048-point FFT processing) . 93
4.6 Energydissipationofdi®erentimplementationsofthebroadbanddelay-and-sum
beamforming application (the number of output frequency points is 256) . . . . 94
4.7 Task graph of the MVDR beamforming application . . . . . . . . . . . . . . . 94
4.8 MAC architectures with various input sizes . . . . . . . . . . . . . . . . . . . . 95
4.9 Energy dissipation of task T
1
implemented using various MAC architectures . 96
xii
4.10 Energy dissipation of various implementations of the MVDR beamforming ap-
plication (M =64) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.11 Energy dissipation of various implementations of the MVDR beamforming ap-
plication (the number of points of FFT is 256) . . . . . . . . . . . . . . . . . . 98
5.1 FPGA-based hardware-software co-design . . . . . . . . . . . . . . . . . . . . . 104
5.2 Domain-speci¯c modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.3 The two-step energy estimation approach . . . . . . . . . . . . . . . . . . . . . 109
5.4 Software architecture of our hardware-software co-simulation environment . . . 110
5.5 An implementation of the hardware-software co-simulation environment based
on MATLAB/Simulink . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
5.6 Flow of generating the instruction energy look-up table . . . . . . . . . . . . . 116
5.7 Python classes organized as domains . . . . . . . . . . . . . . . . . . . . . . . . 116
5.8 Python class library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
5.9 High-levelswitchingactivitiesandpowerconsumptionofthePEsthatconstitute
the shown in Figure 5.27 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
5.10 High-level switching activities and power consumption of slice-based multipliers 120
5.11 Architecture of PyGen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
5.12 Python class library within PyGen . . . . . . . . . . . . . . . . . . . . . . . . . 121
5.13 Design °ow using the energy pro¯ler . . . . . . . . . . . . . . . . . . . . . . . . 123
5.14 Design °ow of PyGen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
5.15 Tree structure of the Python extended classes for parameterized FFT kernel
development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
5.16 Class tree organized as domains . . . . . . . . . . . . . . . . . . . . . . . . . . 128
5.17 Power consumption and average switching activities of input/output data of the
butter°ies in an unfolded-architecture for 8-point FFT computation . . . . . . 130
5.18 Estimation error of the butter°ies when default switching activity is used . . . 130
5.19 Trellis for describing linear pipeline applications . . . . . . . . . . . . . . . . . 132
5.20 Flow of instruction energy pro¯ling . . . . . . . . . . . . . . . . . . . . . . . . 136
xiii
5.21 Software architecture of an implementation of the proposed energy estimation
technique . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
5.22 Con¯guration of the MicroBlaze processor system . . . . . . . . . . . . . . . . 138
5.23 Energy pro¯ling of the MicroBlaze instruction set . . . . . . . . . . . . . . . . 140
5.24 Impact of input data causing di®erent arithmetic behavior of MicroBlaze . . . 141
5.25 Impact of di®erent instruction addressing modes . . . . . . . . . . . . . . . . . 142
5.26 Instant average power consumption of the FFT software program . . . . . . . 144
5.27 CORDIC processor for division (P =4) . . . . . . . . . . . . . . . . . . . . . . 147
5.28 Architecture of matrix multiplication with customized hardware for multiplying
2£2 matrix blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
6.1 Overall hardware architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
6.2 Con¯guration of the MicroBlaze soft processor with the COMA scheme . . . . 168
6.3 An implementation of the clock management unit . . . . . . . . . . . . . . . . 168
6.4 Linked list of task control blocks . . . . . . . . . . . . . . . . . . . . . . . . . . 169
6.5 Ready task list . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
6.6 Interrupt management unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
6.7 Priority aliasing for selective component wake-up and activation state manage-
ment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
6.8 Typical context switch overhead . . . . . . . . . . . . . . . . . . . . . . . . . . 175
6.9 Typical interrupt overhead . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
6.10 Instantpowerconsumptionwhenprocessingdata-inputinterruptandFFTcom-
putation task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
6.11 Instant power consumption when processing data-output task . . . . . . . . . 181
6.12 Energy dissipation of the FPGA device for processing one instance of the data-
out task with di®erent operating frequencies . . . . . . . . . . . . . . . . . . . 182
6.13 Average power consumption of the FPGA device with di®erent data in-
put/output rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
6.14 Hardware architecture of h-TinyOS . . . . . . . . . . . . . . . . . . . . . . . . 185
xiv
6.15 An implementation of h-TinyOS on MicroBlaze . . . . . . . . . . . . . . . . . 187
6.16 Implementation of an FFT module . . . . . . . . . . . . . . . . . . . . . . . . . 189
6.17 Top-level con¯guration ProcessC of the FFT computation application . . . . . 190
6.18 Average power consumption for di®erent data input/output rates for h-TinyOS 192
xv
Chapter 1
Introduction
1.1 Overview
Since the introduction of Field-Programmable Gate Arrays(FPGAs) in 1984, the popularity of
recon¯gurable computing has been growing rapidly over the years. Multi-million gates of con-
¯gurable logic are available in modern FPGAs. The integration of pre-compiled heterogeneous
hardwarecomponentshaso®eredFPGAswithextraordinarycomputationcapabilities. Thein-
troduction of various on-chip con¯gurable processor systems on modern FPGAs provide great
design °exibilities similar to general-purpose processors and programmable DSP (Digital Sig-
nal Processing) processors. Especially, compared with traditional ASICs (Application-Speci¯c
Integrated Circuits), recon¯gurable hardware provides many additional advantages, which are
discussed as follows.
² Reduced development tool costs
Designs using FPGAs can greatly reduce the development tool costs compared with ASICs.
ForASICbasedapplicationdevelopment,theaverageseatofthedevelopmenttoolscostsaround
200,000dollarperengineer. Inaddition,engineershavetogothroughalengthylow-leveldesign
process. The low-level design process includes HDL (Hardware Description Language) based
register transfer level (RT level) design description, functional and architectural simulation,
1
synthesis, timing analysis, test insertion, place-and-route, design veri¯cation, °oorplanning,
design-rule checking (DRC). Due to the complicated development process, the engineers usu-
ally need to use the tools from multiple EDA (Electronic Design Automation) vendors, which
would dramatically increase the costs for the development tools. In comparison, for FPGA
based application development, the average seat of the development tools costs between 2,000
and 3,000 dollar, which is much lower than the ASIC development tools. The FPGA based
applicationdevelopmentgoesthroughasimilarprocessasthatofASICsuchassynthesis,func-
tional and architectural simulation, timing analysis, place-and-route, etc. However, due to the
recon¯gurability of FPGA devices, design insertion and design veri¯cation are much simpler
for designs using recon¯gurable hardware than using ASICs. Because of the relatively simpler
development process, there are usually adequate tools provided by FPGA vendors at a very
low costs. There are many value-added tools from other EDA vendors ranging from 20,000 to
30,000 dollar.
² No non-recurring engineering costs
Non-Recurring Engineering (NRE) costs refer to the costs of creating a new product, which
are paid up front. The NRE costs are in contrast with \production costs", which are ongoing
and based on the quantity of material produced. For example, in the semiconductor industry,
theNREcostsarethecostsfordevelopingcircuitdesignsandphotomaskswhiletheproduction
costs are the costs for manufacturing each wafer.
Recon¯gurable hardware is pre-fabricated and pre-tested before shipping to the customers.
Once the ¯nal designs are complete, the end users just need to download the bitstreams of the
¯nal designs to the FPGA devices. No photomasks are required. Therefore, there are no NRE
costs for development using FPGAs. This would greatly reduce the overall costs and risks for
application development.
2
Figure 1.1: Non-recurring engineering costs of EasyPath
Besides, many FPGA vendors provide an option for quickly converting the prototypes in
FPGAs into structure ASIC chips based on the FPGA bitstreams. Examples are such as the
HardCopy technology [4] from Altera and the EasyPath technology [81] from Xilinx. Taking
EasyPath as an example, there is no change to the FPGA bitstream during the conversion.
The silicon used for the EasyPath device is identical to the FPGA silicon, which guarantee the
identical functionalities between the two silicon devices. The EasyPath methodology uses an
unique testing procedure to isolate the die that can be programmed in the factory with the
customer's speci¯c pattern. The customer receives Xilinx devices that are custom programmed
andmarkedfortheirspeci¯capplications. ThedeliverytimeforsuchFPGA-basedASICdevices
is a fraction compared with the traditional ASIC devices. These customized structure ASIC
chips are guaranteed to have the same functionalities as the original FPGA devices. No change
are introduced into the design at all. Finally, there are no tape-out, no prototype cycle and
no change to the physical device itself. The NRE costs of the EasyPath technology compared
with traditional ASIC design technologies are shown in Figure 1.1. We can see that for a large
3
range of volume productions, the ASIC conversion option can have smaller NRE costs than
traditional ASIC designs.
² Fast time to market
FPGAs require a negligible re-spinning time and o®er a fast time-to-market. Once the
design of the system is completed and veri¯ed, the whole product is ready for shipping to the
customers. This is in contrast with ASIC based designs, which would incur turn-around time
up to a few months for producing the photomasks.
² Great product °exibility
Variousfunctionalitiescanbeimplementedonrecon¯gurablehardwarethroughon-chipcon-
¯gurable logic, programmable interconnect, and heterogeneous pre-compiled hardware compo-
nents available on modern FPGA devices. Recon¯gurable hardware devices are widely used to
implement various embedded systems. Recently, there are academic and industrial attempts to
use recon¯gurable hardware for °oating-point based high performance computing applications
such as molecular dynamic simulation [65], etc.
All in all, FPGAs have evolved into recon¯gurable System-on-Chip (SoC) devices and are
an attractive choice for implementing a wide range of applications. FPGAs are on the verge
of revolutionizing many traditional computation application areas, which include digital signal
processing, image processing, high-performance scienti¯c computing, etc.
1.2 Challenges and contributions
Paradoxically, while the integration of multi-million-gate con¯gurable logic, pre-compiled het-
erogeneous hardware components, and on-chip processor systems, o®er modern recon¯gurable
hardwarehighcomputationcapabilityandexceptionaldesign°exibilitytorecon¯gurablehard-
ware,rapidandenergye±cientapplicationsynthesisusingthesehardwaredevicesischallenging
due to the following reasons.
4
² The ever increasing design complexity makes low-level design °ows and techniques
unattractive for the development of many complicated systems.
FPGAs are being used to implement many complex systems. Describing these systems
throughthetraditionallow-leveldesign°owcanturnouttobetimeconsuminginmanydesign
cases and are unattractive. Synthesis and place-and-route of a single low-level implementation
usually take up to a couple of hours on a multi-million gate FPGA device. This would prohibit
e±cient e±cient design space exploration and performance optimization.
Besides,whenweconsiderthedevelopmentofsignalandimageprocessingsystems,suchlow-
level design °ows can introduce bottlenecks in the communication between hardware designers
andalgorithmdevelopers. Peoplefromthesignalprocessingcommunityareusuallynotfamiliar
with HDLs while it is very demanding for a hardware designer to have a profound knowledge
of various complicated digital signal processing and image processing algorithms.
² State-of-the-art design and simulation techniques for general-purpose processors are in-
e±cient for exploring the unprecedented hardware-software design °exibilities o®ered by the
recon¯gurable hardware.
As far as design description are considered, low-level register-transfer/gate level techniques
are ine±cient for constructing recon¯gurable platforms containing hardware and software com-
ponents. There can be complicated communication protocols among these hardware and soft-
ware components. Software programs usually executes for millions of clock cycles, which is
by far beyond the simulation capabilities of the industrial best RTL level simulators. On
the other perspective, development of high-level design techniques are challenging due to the
di®erent high-level abstraction requirements placed by hardware and software executions. A
proper high-level abstraction is crucial for e®ective design space exploration and performance
optimization toward some pre-de¯ned performance metrics.
5
As far as simulation are considered, there are many academic and industrial instruction set
simulators for various types of general-purpose processors. Examples of these instruction set
simulators are such as SimpleScaler [35], Amulator for StrongARM processors []. A relatively
\¯xed" architecture of the processor is assumed by these simulators. Some con¯guration op-
tions, such as cache size and associativities, memory sizes, etc., are usually available for these
simulators. However, as is analyzed in Chapter 5, based on the assumption of relative \¯xed"
processor architecture, instruction-level simulators are unable to simulate the customized in-
structions and hardware peripherals attached to the processors. Hence, they are not suitable
for hardware-software co-design.
² Rapid energy estimation and energy performance optimization is challenging for systems
using recon¯gurable hardware.
On the one hand, energy estimation using RTL (Register Transfer Level) simulation (which
can be accurate) is time consuming and can be overwhelming considering the fact that there
are usually many possible implementations of an application on FPGAs. Especially, the RTL
simulation based energy estimation techniques are impractical for estimating the energy dissi-
pation of the on-chip pre-compiled and soft processors. As we show in Chapter 5, simulating
»2.78 msec actual execution time of a software program based on the post place-and-route
simulationmodelofastate-of-the-artsoftprocessorwouldtakearound3hours. Thetotaltime
forestimatingtheenergydissipationofthesoftwareprogramisaround4hours. Suchanenergy
estimation speed prohibits the application of such low-level technique for software development
considering the fact that many software programs are expected to run for tens and thousands
of seconds.
On the other hand, the basic elements of FPGAs are look-up tables (LUTs), which are too
low-levelanentitytobeconsideredforhighlevelmodelingandrapidenergyestimation. Asingle
high level model can capture the energy dissipation behavior of all possible implementations
6
on FPGAs. Using several common signal processing operations, Choi, et al. show in [15]
that di®erent algorithms and architectures used for implementing these common operations on
recon¯gurable hardware would result in signi¯cant di®erent energy dissipations. A domain-
speci¯c energy performance modeling technique is proposed in [14]. However, to our best
knowledge, there is no software framework that integrates this modeling technique for energy
e±cient application development.
Considering the challenges discussed above, this thesis makes the following four major con-
tributions toward energy e±cient application synthesis using recon¯gurable hardware.
² A framework for high-level hardware-software application development
Various high-level abstractions for describing hardware and software platforms are mixed
and integrated into a single and consistent application development framework. Within the
proposed framework, the end users can quickly construct the complete systems that consist of
both hardware and software components. Most importantly, the proposed framework supports
co-simulation and co-debugging of the high-level description of the systems. By utilizing these
co-simulationand co-debuggingcapabilities, theend userscan quickly verifythe functionalities
of the complete systems without involving their corresponding low-level implementations.
²Energyperformancemodelingforrecon¯gurablesystem-on-chipdevicesandenergye±cient
mapping for a class of application
An energy performance modeling technique is proposed to capture the energy dissipation
behavior of both the recon¯gurable hardware platform and the target application. Especially,
the communication costs and the recon¯guration costs that are pertinent to designs using
recon¯gurablehardwareareaccountedbytheenergyperformancemodels. Basedontheenergy
modelsforthehardwareplatformandtheapplication,adynamicprogrammingbasedalgorithm
isproposedtooptimizetheenergyperformanceoftheapplicationrunningontherecon¯gurable
hardware platform.
7
² A two-step rapid energy estimation technique and high-level design space exploration
Based on the high-level hardware-software co-simulation framework proposed in the above
section,weemployaninstruction-levelenergyestimationtechniqueandadomain-speci¯cmod-
eling technique to provide rapid and fairly accurate energy estimation of hardware-software
co-designs using recon¯gurable hardware.
² Hardware-software co-design for energy e±cient implementations of operating systems
Byutilizingthe\con¯gurability"ofsoftprocessors,wedelegatesomemanagementfunction-
alities of the operating systems to a set of hardware peripherals tightly integrated with the soft
processors. These hardware peripherals also manage the operating states of the various hard-
ware components including the soft processor. Example designs and illustrative examples are
provided to show that a signi¯cant amount of energy can be saved using the hardware-software
co-design technique.
1.3 Thesis organization
Our contributions in the various aspects leading to energy e±cient application synthesis using
FPGAs are discussed in separate chapters of the thesis. Chapter 2 discusses the background
and up-to-date information of recon¯gurable hardware. Chapter 3 presents the framework for
high-level hardware-software development using FPGAs. The development of recon¯gurable
hardware platforms that include multiple general-purpose processors using the proposed devel-
opmentframeworkisalsodiscussedinthischapter. Anenergyperformancemodelingtechnique
for recon¯gurable hardware platforms that integrates heterogeneous hardware components as
wellasadynamic-programmingbasedperformanceoptimizationtechniquebasedontheenergy
models are proposed in Chapter 4. Based on the work in Chapter 3 and a domain-speci¯c en-
ergy modeling performance modeling technique, a two-step rapid energy estimation technique
andadesignspaceexplorationframeworkarediscussedinChapter5. Chapter6putsforwarda
8
hardware-software co-design for energy e±cient implementations of operating systems. Finally,
we conclude in Chapter 7. The future research directions are also discussed in this chapter.
9
Chapter 2
Recon¯gurable hardware
2.1 Recon¯gurable System-on-Chips (SoCs)
Recon¯gurable hardware has evolved to become recon¯gurable system-on-chip devices, inte-
gratedwithmulti-milliongatesofcon¯gurationlogicandvariousheterogeneoushardwarecom-
ponents. The architectures of the con¯gurable resources and the embedded hardware compo-
nents are discussed in the following sections.
2.1.1 Field-programmable gate arrays
Field-programmable gate arrays (FPGAs) are de¯ned as programmable devices containing re-
peated ¯elds of small logic blocks and elements. They are the major components of recon¯g-
urable SoC devices. VLSI (Very Large Scale Integrated Circuits) devices can be classi¯ed as
shown in Figure 2.1. FPGAs are a member of the class of devices called Field-Programmable
Logic (FPL). It can be argued that FPGAs is an ASIC (Application-Speci¯c Integrated Cir-
cuit) technology. However, the traditionally de¯ned ASICs require additional semiconductor
processingstepsbeyondthoserequiredbyFPGAs. Theseadditionalstepsprovideperformance
advantages over FPGAs while introducing a high non-recurring engineering (NRE) costs. Gate
arrays, on the other hand, typically consist of a \sea of NAND gates" whose functions are
10
Figure 2.1: Classi¯cation of VLSI devices
provided by the customers in a \wire list". The wire list is used during the fabrication process
to achieve the distinct de¯nition of the ¯nal metal layer. The designer of an FPGA solution,
however, has full control over the actual design implementation without the need (and delay)
for any physical IC (Integrated Circuit) fabrication facilities.
2.1.1.1 Classi¯cations
Field-programmablelogicsareavailableinvirtuallyallmemorytechnologies: SRAM,EPROM,
E
2
PROM, and antifuse. The speci¯c technology de¯nes whether the ¯eld-programmable logic
device is re-programmable or one-time programmable as well as the way it is programmed. The
con¯guration bitstreams of the most modern ¯eld-programmable devices support both single-
bit and multi-bit streams. Compared with multi-bit streams, single-bit streams can reduces
the wiring requirements. But they would also increases the programming time (typically in the
range of ms). The end users can choose the format of the bitstreams based on the start-up
time and recon¯guration requirements of the speci¯c application.
11
Table 2.1: Various FPL technologies
Technology SRAM EPROM E
2
PROM Anti-fuse Flash
Re-programmable
p p p
¡
p
In-system programmable
p
¡
p
¡
p
Volatile
p
¡ ¡ ¡ ¡
Copy protected ¡
p p p p
Example devices Xilinx Altera AMD Actel Xilinx
Virtex (II/II Pro/4) MAX5K MACH ACT XC9500
Altera Xilinx Altera Cypress
SRAM devices, the dominate technology of FPGAs, are based on static CMOS memory
technology, and are re-programmable and in-system programmable. They require, however,
an external \boot" device for storing the con¯guration bitstreams. The greatest advantage
of SRAM-based ¯eld programmable devices is that they usually o®er a higher computation
capability and a better energy e±ciency than recon¯gurable devices built with other memory
technologies. Electrically programmable read-only memory (EPROM) devices are usually used
in a one-time CMOS programmable mode because of the need to use ultraviolet light for
erasingtheexistingcon¯guration. CMOSelectricallyerasableprogrammableread-onlymemory
(E
2
PROM) have the advantage of a short set-up time. Because the programming information
is not \downloaded" to the device, it is better protected against unauthorized use. A recent
innovation in the memory technologies for building recon¯gurable hardware is called \Flash"
memory and is based on the EPROM technology. These \Flash" based recon¯gurable devices
are usually viewed as \page-wise" in-system reprogrammable devices with physically smaller
cells equivalent to that of E
2
PROM devices.
Finally,theprosandconsofrecon¯gurablehardwarebasedondi®erentmemorytechnologies
are summarized in Table 2.1.
12
Table 2.2: Logic resources in a CLB
Slices LUTs Flip-°ops
MULT Arithmetic Distributed Shift
ANDs Carry-Chains RAM Registers
4 8 8 8 2 64 bit 64 bit
2.1.1.2 Con¯gurable logic blocks
For SRAM based FPGA devices, the basic units are con¯gurable logic blocks. These con¯g-
urablelogicblocksusuallyhaveasimilararchitecture. Theycontainoneormorelook-uptables
(LUTs), various types of multiplexers, °ip-°op registers as well as other hardware components.
A k-input LUT requires 2
k
SRAM cells and a 2
k
-input multiplexer. A k-input LUT can imple-
ment any function of k-inputs. Previous research has shown that LUTs with 4-inputs lead to
FPGAs with the highest area-e±ciency [10]. Thus, for most commercial FPGAs, their LUTs
accept four one-bit inputs inputs and thus can implement the true table of any functions with
four inputs.
To reduce the time for synthesis, place-and-routing, the con¯gurable logic blocks are usu-
ally with evenly distributed through the device. Accordingly, there is a regular pattern for
the programmable interconnection network for connecting these con¯gurable logic blocks (see
Section 2.1.1.3 for more discussions of the interconnection network).
As modern FPGAs contain up to multi-million gates of con¯gurable logic, the con¯gurable
logicblocksareorganizedinahierarchicalarchitecture. TakingtheXilinxVirtexseriesFPGAs
asanexample,eachCLBcontainstwoslicesandeachslicecontainsfourfour-inputLUTs. There
are local interconnects between the two LUTs within a slice and between the two slices within
a CLB. The arrangement of the slices within a CLB is shown in Figure 2.2.
InthemostrecentVirtex-4FPGAdevices,asshowninTable2.2,eachslicecontainsvarious
hardware resources. The architectures and functionalities of these various hardware resources
within a con¯gurable logic block are discussed in detail in the following.
13
Figure 2.2: Arrangement of slices within a Xilinx Virtex-4 CLB
² Look-up tables: Virtex-4 function generators are implemented as 4-input look-up tables
(LUTs). There are four independent inputs for each of the two function generators in a slice
(F and G). The function generators are capable of implementing any arbitrarily de¯ned four-
input Boolean function. The propagation delay through a LUT is independent of the function
implemented. Signals from the function generators can exit the slice (through the X or Y
output), enter the XOR dedicated gate, enter the select line of the carry-logic multiplexer, feed
the D input of the storage element, or go to the MUXF5.
As shown in Figure 2.3, the two 16-bit look-up tables within a slice can also be con¯gured
as a 16-bit shift register, or a 16-bit slice-based RAM. As is discussed in Section ??, dedicated
routingresourcesareavailableamongtheslicesforcascadingandextendingtheseshiftregisters
and slice-based RAMs. Therefore, shift registers and slice-based RAMs of various lengths and
dimensions can be con¯gured on FPGA devices depending the application requirements.
² Multiplexers: In addition to the basic LUTs, the Virtex-4 slices contain various types of
multiplexers(MUXF5andMUXFX).ThesemultiplexersareusedtocombinetwoLUTssothat
any function of ¯ve, six, seven, or eight inputs can be implemented in a CLB. The MUXFX
14
Figure 2.3: Virtex-II Pro slices within a CLB
is either MUXF6, MUXF7, or MUXF8 according to the position of the slice in the CLB. The
MUXFX can also be used to map any function of six, seven, or eight inputs and selected wide
logic functions. Functions with up to nine inputs (MUXF5 multiplexer) can be implemented
in one slice. Wide function multiplexers can e®ectively combine LUTs within the same CLB or
across di®erent CLBs making logic functions with even more input variables.
² Flip-°op registers and latches: Each slice has two °ip-°op registers/latches. The registers
are useful for development of pipelined designs.
2.1.1.3 Routing resources
TherearethreemajorkindsofroutingarchitecturesforcommercialFPGAdevices. TheFPGAs
manufactured by Xilinx, Lucent and Vantis employ a island-style routing architecture; the
FPGAs manufactured by Actel employ a row-based routing architecture; and the FPGAs of
Alteraarewithahierarchical routingarchitecture. Inthefollowing,wefocusontheisland-style
routing architecture to illustrate the routing resources available on modern FPGA devices.
The routing architecture of an island-style FPGA device is shown in Figure 2.4. Logic
blocks are surrounded by routing channels of pre-fabricated wiring segments on all four sides.
15
Figure 2.4: An FPGA device with island-style routing architecture
The input or output pin of a logic block an connect to some or all of the wiring segments in the
channel adjacent to it via a connection block of programmable switches. At every intersection
of a horizontal channel and a vertical channel, there is a switch block. The switch block is
a set of programmable switches that allow some of the wire segments incident to the switch
block be connected to other wire segments. By turning on the appropriate switches, short
wire segments can be connected together to form longer connections. Note that there are some
wire segments which remain unbroken through multiple switch blocks. These longer wires span
multiple con¯gurable logic blocks.
WewilltakeacloselookatthepopularXilinxVirtex-IIProdevicestoillustratethemodern
island-style routing architecture. Virtex-II Pro have fully bu®ered programmable interconnec-
tions, with a number of resources counted between any two adjacent switch matrix rows or
columns. With the bu®ered interconnections, fanout has minimal impact on the timing perfor-
mance of each net.
16
Thereareahierarchyofroutingresourcesforconnectingthecon¯gurablelogicblocksaswell
as other embedded hardware components. The global and local routing resources are described
as follows.
² Long lines: the long lines are bidirectional wires that distribute signals across the device.
Vertical and horizontal long lines span the full height and width of the device.
² Hex lines: the hex lines route signals to every third or sixth block away in all four
directions. Organized in a staggered pattern, hex lines can only be from one end. Hex-line
signals can be accessed either at the endpoints or at the middle point (i.e. three blocks from
the source).
² Double lines: the double lines route signals to every ¯rst or second block away in all four
directions. Organizedinastaggeredpattern,doublelinescanbedrivenonlyattheirendpoints.
Double-line signals can be accessed either at the endpoints or at the midpoint (one block away
from the block).
²Direct connect lines thedirectconnectlinesroutesignalstoneighboringblocks: vertically,
horizontally, and diagonally.
² Fast connect lines: the fast connect lines are the internal CLB local interconnects from
LUT outputs to LUT inputs. As shown in Figure 2.5, a 16x1D dual-port distributed RAM can
be constructed using the fast connect lines with a CLB.
Inadditiontotheglobalandlocalroutingresources, dedicatedroutingsignalsareavailable.
² There are eight global clock nets per quadrant. These clock nets
² Horizontal routing resources are provided for on-chip 3-state buses. Four partitionnable
bus lines are provided per CLB row, permitting multiple buses within a row.
² Two dedicated carry-chain resources per slice column (two per CLB column) propagate
carry-chain MUXCY output signals vertically to the adjacent slice.
17
Figure 2.5: Dual-port distributed RAM (RAM16x1D)
² One dedicated SOP (Sum of Product) chain per slice row (two per CLB row) propagate
ORCY output logic signals horizontally to the adjacent slice
²AsshowninFigure2.6,therearededicatedroutingwires,whichcanbeusedtocascadable
shiftregisterchains. ThesededicatedroutingwiresconnectstheoutputofLUTsinshiftregister
mode to the input of the next LUT in shift-register mode (vertically) inside the CLB.
2.1.2 Pre-compiled embedded hardware components
It is becoming popular to integrate pre-compiled embedded hardware components into a single
FPGA device.
2.1.2.1 Embedded general-purpose processors
AsshowninFigure2.7,thePowerPC405processorembeddedinVirtex-4consistsoftwomajor
components: the PowerPC 405 core and the auxiliary functional unit (APU).
18
Figure 2.6: Cascadable shift registers
² Pre-compiled processor cores
ThePowerPC405coreisresponsiblefortheexecutionofembeddedsoftwareprograms. The
corecontainsfourmajorcomponents: centralprocessingunit(CPU),memorymanagementunit
(MMU), cache unit, and timer and debug unit. The PowerPC 405 CPU implements a 5-stage
instruction pipeline, which consists of fetch, decode, execute, write-back, and load write-back
stages. The PowerPC 405 cache unit implements separate instruction-cache and data-cache
arrays. Eachofthecachearraysis16KBinsize, two-wayset-associativewith8word(32byte)
19
Figure 2.7: Pipeline °ow architecture of the PowerPC 405 Auxiliary Processing Unit
cache lines. Both of the cache arrays are non-blocking, allowing the PowerPC 405 processor to
overlap instruction execution with reads over the PLB (when cache misses occur).
² Instruction extension
Architecture of the APU controller is illustrated in Figure 2.7. The FCM (Fabric Co-
processing Module) interface is a Xilinx adaptation of the native Auxiliary Processor Unit
interface implemented on the IBM processor core. The hard core APU Controller bridges the
PowerPC 405 APU interface and the external FCM interface.
UsingtheAPUcontroller,theusercanextendtheinstructionsetoftheembeddedPowerPC
405 processor through a tight integration of the processor with the surrounding con¯gurable
logicandinterconnectresources. Theactualexecutionoftheextendedinstructionsisperformed
by the FCM. There are two types of instructions that can be extended using an FCM, that is,
20
pre-de¯ned instructions and user-de¯ned instructions. A pre-de¯ned instruction has its format
de¯ned by the PowerPC instruction set. A °oating point unit (FPU) which communicates
with the PowerPC 405 processor core through the FCM interface is a good illustration of
the APU instruction decoding. While the embedded PowerPC 405 only supports ¯xed-point
computation instructions, °oating-point related instruction are also de¯ned in the PowerPC
405 instruction set. The APU controller handles the decoding of all PowerPC °oating-point
related instructions while the FCM is responsible for calculating the °oating-point results. In
contrast, a user-de¯ned instruction has a con¯gurable format and is a much °exible way of
extending the PowerPC instruction set architecture (ISA). The decoding of the two types of
FCM instruction discussed above can be done by the APU controller or by the FCM. By APU
controller decoding, we denote that the APU controller determines what CPU resources are
needed for the instruction execution and passes this information to the CPU. For example, the
APU controllerwill determine if an instruction is a load, a store, or if it needs source data from
the GPR, etc. In case of FCM instruction decoding, the FCM can also perform this part of
decoding and pass the needed information to the APU controller. In case that an FCM needs
todeterminethecomplete functionoftheinstruction, theFCMmustperform afullinstruction
decode as well as handle all instruction execution.
Besides, the APU controller performs clock domain synchronization between the PowerPC
clock and the relatively slow clocks of the FCM. The FCMs are realized using the con¯gurable
logicandroutingresourcesavailableontheFPGAdeviceaswellasotherpre-compiledhardware
resources. Limited by the con¯gurable hardware resources, the FCMs usually have slower
maximumoperatingfrequenciesthanthepre-compiledembeddedgeneral-purposeprocessor. In
many practical application development, the embedded processor and the FCM are operating
at di®erent clock frequencies. In this case, the APU controller synchronizes the operations in
di®erent clock domains.
21
Figure 2.8: IBM CoreConnect bus architecture
²Bus hierarchy TheembeddedPowerPC405processorsupportstheIBMCoreConnectbus
suite, which is shown in Figure 2.8. This include PLB (Processor Local Bus) buses for connect-
ing high-speed, high-performance hardware peripherals, OPB (On-chip Peripheral Bus) buses
for connect low-speed, low-performance hardware peripherals (e.g., the UART serial commu-
nication hardware peripheral, etc.), and DCR (Device Control Register) buses for queuing the
status of the attached hardware peripherals.
2.1.2.2 Pre-compiled DSP blocks
EmbeddedDSPblocksareavailableonmostmodern¯eldprogrammabledevicesforperforming
the computation intensive arithmetic operations required by many digital and image signal
processing applications. These DSP blocks are buried among the con¯gurable logic blocks and
are usually uniformly distributed through out the devices. The Xilinx Virtex-II/Virtex-II Pro
series and Spartan-3 series FPGAs o®er up to 556 dedicated multipliers on a single chip. Each
of these embedded multipliers can ¯nish signed 18-bit-by-18-bit multiplication in one clock
cycle at an operating frequency up to more than 350 MHz. This would provide modern FPGA
devices with an exceptional computation capability.
OneexampleofsuchDSPblocksaretheDSP48blocksavailableonXilinxVirtex-4FPGAs.
ThearchitectureofDSP48blocksisshowninFigure2.9. TheseDSP48blockscanbecon¯gured
22
Figure 2.9: Architecture of DSP48 slices on Xilinx Virtex-4
to e±ciently perform a wide range of basic mathematical functions. This includes adders, sub-
tracters, accumulators, MACCs (Multiply-and-Accumulates), multiply multiplexers, counters,
dividers, square-root functions, shifters and etc.
There are optional pipeline stages within the DSP48 blocks in order to ensure high per-
formance arithmetic functions. The input and output of the DSP48 blocks can be stored in
the dedicated registers within the tiles. Besides, the DSP48 blocks are organized in columns.
There are dedicated routing resources among the DSP48 blocks within a column. These dedi-
cated routing resources provide fast routing between DSP48 blocks while signi¯cantly reducing
routing congestions to the FPGA fabric.
2.1.3 Soft processors
2.1.3.1 De¯nition and overview
FPGA con¯gured soft processors, which are RISC (Reduced Instruction-Set Computer) pro-
cessors realized using con¯gurable resources available on FPGA devices, have become popular
in recent years. Examples of such soft processors include Nios from [3], ARM7-based CoreMP7
23
from Actel [1], OpenRISC from OpenCores [41], SPARC architecture based LEON3 from [23],
and MicroBlaze from [79].
² 3-D design trade-o®s
One advantage of designing using soft processors is that they provide new design trade-o®s
bytimesharingthelimitedhardwareresources(e.g. con¯gurablelogicblocksonXilinxFPGAs)
available on the devices. Many management functionalities and computations with tightly
coupled data dependency between calculation steps (e.g. many recursive algorithms such as
the Levinson Durbin algorithm [32]) are inherently more suitable for software implementations
on processors than the corresponding customized (parallel) hardware implementations. Their
software implementations are more compact and require a much smaller amount of hardware
resources. Such compact designs using soft processors can e®ectively reduce the static energy
dissipation of the complete system by ¯tting into smaller FPGA devices [75].
² \Con¯gurability"
Most importantly, soft processors are \con¯gurable" by allowing the customization of the
instruction set and/or the attachment of customized hardware peripherals in order to speed up
the computation of algorithms with a large degree of parallelism and/or to e±ciently perform
some auxiliary management functionalities as described in Chapter 6. The Nios processor
allows users to customize up to ¯ve instructions. The MicroBlaze processor supports various
dedicated communication interfaces for attaching customized hardware peripherals to it.
² On-chip multi-processor development
Thereisarecenttrendinusingcon¯gurablemulti-processorplatformsonFPGAdevicesfor
applicationdevelopment. Thesemulti-processorplatformsconsistofmultipleFPGAcon¯gured
softprocessorsworkingtogetherinacooperativemanner(e.g.,thebustopologymulti-processor
platform employed by James-Roxby, et al. [37] and the 2-D mesh multi-processor platform
24
Figure 2.10: Architecture of CoreMP7 soft processor
from CrossBow Technologies [24]). There are several advantages o®ered by con¯gurable multi-
processor platforms. One advantage is the reuse of legacy code, which can greatly reduce the
developmentcycle. TherearemanyexistingCprogramsforrealizingvariouscomputationsand
forcoordinatingthecomputationsperformedbythemultipleprocessors. TheseCprogramscan
be easily ported to con¯gurable multi-processor platforms with little or no changes. Another
advantage is that multi-processor platforms implemented using FPGAs are \con¯gurable" and
highly extensible by allowing the attachment of customized hardware peripherals. In addition
to distributing the software execution among the multiple processors, the application designer
can customize the hardware platform and optimize its performance for the applications run-
ningonthem. Twomajortypesofcustomizationcanbeappliedtocon¯gurablemulti-processor
platforms for performance optimization. (1) Attachment of hardware accelerators: The appli-
cation designer can optimize the performance of a single soft processor by adding dedicated
25
Figure 2.11: Architecture of MicroBlaze soft processor
instructions and/or attaching customized hardware peripherals to it. These dedicated instruc-
tions and customized hardware peripherals are used as hardware accelerators to speed up the
execution of the time-consuming portions of the target application (e.g., the computation of
Fast Fourier Transform and Discrete Cosine Transform). The Nios processor allows users to
customize up to ¯ve instructions. The MicroBlaze processor and the LEON3 processor support
various dedicated interfaces and bus protocols for attaching customized hardware peripherals
to them. The LEON3 processor even allows the application designer to have ¯ne control of the
cacheorganization. Conget al. showthataddingcustomizedinstructionstotheNiosprocessor
usingashadowregistertechniqueresultsinanaverage2.75xexecutiontimespeed-upforseveral
data-intensive digital signal processing applications [17]. Also, by adding customized hardware
peripherals, the FPGA con¯gured VLIW (Very-Long Instruction Word) processor proposed by
[39] achieves an average 12x speed-up for several digital signal processing applications from the
MediaBenchbenchmark[43]. Thecustomizationofasinglesoftprocessorisfurtherdiscussedin
Secytion background-customization. (2) Application-speci¯c e±cient communication network
and task management components among the multiple processors: based on the application
26
requirements, the end user can instantiate multiple soft processors, connect them with some
speci¯c topologies and communication protocols, and distribute the various computation tasks
of the target application among these processors. James-Roxby et al. develop a con¯gurable
multi-processor platform for executing a JPEG2000 encoding application [18]. They construct
the multi-processor platform by instantiating multiple MicroBlaze processors and connecting
them with a bus topology. They then employ a Single-Program-Multiple-Data (SPMD) pro-
gramming model to distribute the processing of the input image data among the multiple
MicroBlaze processors. Their experimental results in [37] show that a signi¯cant speed-up is
achieved for the JPEG2000 application on the multi-processor platform compared with the
execution on a single processor. Jin et al. built a con¯gurable multi-processor platform us-
ing multiple MicroBlaze processors arranged as multiple pipelined arrays for performing IPv4
packet forwarding [38]. Their multi-processor platform achieves 2x time performance improve-
ment compared with a state-of-the-art network processor.
2.1.3.2 Customization
There are two major ways of customizing soft processors, which are discussed as follows.
² Tight coupling through instruction customization
The instructions that are not needed by the target applications and the corresponding
hardware components can be removed from the soft processors in order to save the hardware
resources and improve their timing and energy performance. Considering the architecture of
the MicroBlaze processor shown in Figure 2.11, the MicroBlaze processor can be con¯gured
without the °oating-point unit (FPU) and the related instructions for many image processing
applications. These applications usually only require ¯xed-point operations. Removing the
°oating-point unit and the related instructions can reduce more than 60% of the hardware
resources required by MicroBlaze [79].
27
There are two ways for extending the instruction set of soft processors. One way is by
sharing the register ¯les between the soft processors and the customized hardware peripherals.
One example is the MicroBlaze processor as shown in Figure 2.12. MicroBlaze provides a
set of reserved instructions and the corresponding bus interfaces that allow the attachment of
customized hardware peripherals. The main advantage of the register-¯le based approach is
easy development. The design of the attached hardware peripherals are independent of the
soft processors and would not a®ect their performance for executing the embedded software
programs as well as interactions with other hardware peripherals.
The other way of instruction extension is through coupling the customized hardware pe-
ripherals with the ALU (Arithmetic Logic Unit) of the processors in parallel. More °exible and
allow a tighter integration of the hardware peripherals with the processor. However, extra at-
tention is required to make sure that the attached hardware peripherals would not become the
criticalpathoftheprocessorandthuswouldnota®ectthemaximumoperatingfrequenciesthat
canbeachievedbythesoftprocessorcore. Theendusersareresponsibleforproperlypipelining
the hardware peripherals to ensure the overall timing performance of the soft processor.
² Hierarchical coupling through shared bus interfaces
Various customized hardware peripherals with di®erent processing capabilities and data
transmission requirements may be connected to the processors. Hierarchical coupling ensures
e±cientattachmentofthevariouscustomizedhardwareperipheralstotheprocessorsbygroup-
ingthemintodi®erenthierarchies. Thetwodefacto hierarchicalbusarchitecturesarediscussed
as follows.
² CoreConnect bus architecture includes the Processor Local Bus (PLB), the On-chip Pe-
ripheral Bus (OPB), a bus bridge, two arbiters, and a Device Control Register (DCR) bus [36].
It is an IBM proprietary on-chip bus communication link that enables chip designs from multi-
ple sources to be interconnected to create new chips. It is widely adopted by the PowerPC 405
28
Figure 2.12: Extension of MicroBlaze instruction set through FSL interfaces
processors embedded in the Xilinx Virtex-II Pro and Virtex-4 FPGAs as well as the MicroB-
laze soft processors. The PLB bus is used for tight integration of high performance hardware
peripherals to the processor core while the OPB bus is used for connecting low-speed hardware
peripherals to the processor core. Both of the PLB and OPB buses form a star network for
the hardware peripherals connecting them. In contrast, the DCR bus connects the hardware
peripherals in a serial manner. Thus, its performance and response time degrade in proportion
to the hardware peripherals connected to it. The DCR bus is mainly used for querying the
status of the hardware peripherals.
²Advanced Microprocessor Bus Architecture (AMBA) isanopenbusarchitecturestandard.
Similar to the CoreConnect bus architecture, AMBA de¯nes a multilevel busing system, with
a system bus and a lower-level peripheral bus. These include two system buses: the AMBA
High-SpeedBus(AHB)ortheAdvancedSystemBus(ASB),andtheAdvancedPeripheralBus
(APB). The AHB bus can be used to connects embedded processors such as an ARM core to
high-performance peripherals, DMA (Direct Memory Access) controllers, on-chip memory and
29
Figure 2.13: Extension of Nios instruction set
interfaces. The APB is designed with a simpler bus protocol and is mainly used for ancillary
or general purpose peripherals.
2.2 Design °ows
The design °ows of application development using recon¯gurable hardware can be roughly
classi¯ed low-level register-transfer/gate level based design °ows and high-level design °ows,
which are discussed as follows.
2.2.1 Low-level design °ows
Thelow-leveldesign°owsforapplicationdevelopmentusingrecon¯gurablehardwareareshown
in Figure 2.14. The end user uses hardware description languages (HDLs), such as VHDL
and Verilog, to describe the low-level (i.e., register-transfer and/or gate-level) architecture
and behavior of the target systems. In order to satisfy speci¯c application requirements and
30
achievehigherperformance,designconstraintsarerequiredinadditiontotheselow-leveldesign
descriptions. Therearetwokindsofdesignconstraints. Onekindofdesignconstraintsarearea
constraints, which include the locations of the input/output pins and the hardware bindings
of some operations. For example, some multiplication operations are performed using the
embedded multipliers and some storage are implemented using the embedded memory blocks.
The other kind of design constraints are timing constraints. The user can analyze and identify
the timing critical paths of the low-level implementations, add timing constraints which forces
the placing-and-routing tools to put higher priorities and more e®orts on meeting the critical
timing constraints.
After synthesizing and placing-and-routing the low-level designs, the end users can perform
functional and architectural simulation to further verify the correctness of the systems. Once
the functional correctness of the systems is veri¯ed, the end user can actually implement the
low-level designs using the low-level design descriptions and the design constraints.
2.2.2 High-level design °ows
As modern recon¯gurable hardware are being used to implement many complex systems, low-
level design °ows can be ine±cient in many cases. High-level design °ows based on high-level
modelingenvironments(suchasMATLAB/Simulink)andhigh-levellanguages(suchasC/C++
and Java) are becoming popular for application development using recon¯gurable hardware.
One important kind of high-level design tools are those based on MATLAB/Simulink, such
as System Generator from Xilinx and DSP Builder from Altera. The design °ow of System
Generator is illustrated in Figure 2.15. System Generator provides an additional set of blocks
accessibleintheSimulinklibrarybrowser,fordescribinghardwaredesignsdeployedontoXilinx
FPGAs. A screenshot of the Xilinx block set is shown in Figure 2.16, The System Generator
blockscaninteractandbeco-simulatedwithotherSimulinkblocks. Onlyblocksandsubsystems
31
Figure 2.14: Low-level design °ow
consistingofblocksfromtheXilinxblocksetaretranslatedbySystemGeneratorintohardware
implementations. Each of the System Generator blocks is associated with a parameterization
GUI (Graphical User Interface). These GUI interfaces allows the user to specify the high-level
behaviors of the blocks, as well as low-level implementation options (e.g. target FPGA device,
target system clock period, and etc.) The automatic translation from the Simulink model
into hardware realization is accomplished by mapping the Xilinx blocks into IP (Intellectual
Property)librarymodules, inferringcontrolsignalsandcircuitryfromsystemparameters(e.g.,
sample periods, data types), and ¯nally converting the Simulink hierarchy into a hierarchical
32
Figure 2.15: Design °ow of System Generator
VHDL netlist. In addition, System Generator creates the necessary command ¯les to create
the IP block netlists using CORE Generator invokes CORE Generator, and creates project
and script ¯les for HDL simulation, synthesis, technology mapping, placement, routing, and
bit-stream generation. To ensure e±cient compilation of multi-rate systems, System Generator
createsconstraint¯lesforthephysicalimplementationtools. System Generator alsocreatesan
HDL test bench for the generated realization, including test vectors computed during Simulink
simulation. The end user can use tools such as ModelSim [46] to further veri¯cation.
33
Figure 2.16: The high-level block set provided by System Generator
Another important kind of high-level design tools are those based on high-level language,
such as C/C++, MATLAB m code, and etc.
² ImpulseC from Impulse Accelerated Technology, Inc
ImpulseC extends standard C to support a modi¯ed form of the communicating sequential
processes (CSPs) programming model. The CSP programming model is conceptually similar
to the data-°ow based ones. Compared with data-°ow programming models, CSP based pro-
gramming models put more focus on simplifying the expression of highly parallel algorithms
through the use of well-de¯ned data communication, such as message passing and synchroniza-
tion mechanisms, etc.
In ImpulseC applications, hardware and software elements are described in processes and
communicate with each other primarily through bu®ered data streams that are implemented
34
directlyinhardware. Thedatabu®ering, whichisimplementingusingFIFOsthatarespeci¯ed
and con¯gured by the ImpulseC programmer, makes it possible to write parallel applications
at a relatively high level of abstraction. The ImpulseC application developers can be freed of
clock-cycle by clock-cycle synchronization.
ImpulseCismainlydesignedfordata°ow-orientedapplications,butisalso°exibleenoughto
supportalternateprogrammingmodelsincludingtheuseofsharedmemoryasacommunication
mechanism. The programming model that you, as an ImpulseC programmer, will select will
depend on the requirements of your application, but also on the architectural constraints of
the selected programmable platform target. Especially, ImpulseC is suitable for developing
applications that have the following properties:
. The application features high data rates to and from di®erent data sources and between
processing elements.
.Datasizesare¯xed(typicallyonebytetooneword),witharelativelysmallstreampayload
to prevent processes from becoming blocked.
. Multiple related but independent computations are required to be performed on the same
data stream.
. The data consists of low or ¯xed precision data values, typically ¯xed width integers or
¯xed point fractional values.
. There are references to local or shared memories, which may be used for storage of data
arrays, coe±cients and other constants, and to deposit results of computations.
.Therearemultipleindependentprocessescommunicatingprimarilythroughthedatabeing
passed, with occasional synchronization being requested via messages.
The ImpulseC library provides minimal extensions to the C language (in the form of new
data types and intrinsic function calls) that allow multiple, parallel program segments to be
described, interconnected and synchronized. The ImpulseC compiler translates and optimizes
35
ImpulseC programs into appropriate lower-level representations, including RTL-level VHDL
that can be synthesized to FPGAs, and standard C (with associated library calls) that can be
compiledontosupportedmicroprocessorsthroughtheuseofwidelyavailableCcross-compilers.
In addition to ImpulseC, there are other C based design tools, such as CatapultC from
Mentor Graphics [47], PICO Express from Synfora []. In contract to ImpulseC, these design
tools are based on standard C code, rather than providing a minimal extension. A special
codingstyleoftheCcodeisrequiredtoderivethedesiredlow-levelhardwareimplementations.
The similarity with ImpulseC is that these C based design tools can greatly speed up the
development process and simulation speed. This would facilitate design space exploration and
design veri¯cation processes.
² MATLAB language based AccelDSP Synthesis Tool
AccelDSP Synthesis Tool from Xilinx [80] is a high-level MATLAB language based tool
for designing DSP blocks for Xilinx FPGAs. The tool automates °oating-point to ¯xed-point
conversion,generatessynthesizableVHDLorVerilogfromMATLABlanguage,andalsocreates
testbenches for veri¯cation. You can also generate ¯xed-point C++ simulation models or
corresponding Simulink based System Generator block models from a MATLAB algorithm.
36
Chapter 3
A framework for high-level hardware-software application
development
3.1 Introduction
Paradoxically, while FPGA based con¯gurable multi-processor platforms o®er a high degree
of °exibility for application development, performing design space exploration on con¯gurable
multi-processor platforms is very challenging. State-of-the-art design tools rely on low-level
simulation, which is based on the register transfer level (RTL) and/or gate level implemen-
tations of the platform, for design space exploration. These low-level simulation techniques
are ine±cient for exploring the various hardware and software design trade-o®s o®ered by
con¯gurable multi-processor platforms. This is because of two major reasons. One reason is
that low-level simulation based on register transfer/gate level implementations is too time con-
suming for evaluating the various possible con¯gurations of the FPGA based multi-processor
platformandhardware-softwarepartitioningandimplementationpossibilities. Especially,RTL
and gate-level simulation is ine±cient for simulating the execution of software programs run-
ning on con¯gurable multi-processor platforms. Considering the design examples shown in
37
Section 3.5.1.2, low-level simulation using ModelSim [46] takes more than 25 minutes to simu-
lateaFFTcomputationsoftwareprogramrunningontheMicroBlazeprocessorwitha1.5msec
execution time. This simulation speed can be overwhelming for development on con¯gurable
multi-processor platforms as software programs usually take minutes or hours to complete on
soft processors. The other reason is that FPGA based con¯gurable multi-processor platforms
pose a potentially vast design space and optimization possibilities for application development.
There are various hardware-software partitions of the target application and various possible
mappings of the application to the multiple processors. For customized hardware development,
there are many possible realizations of the customized hardware peripherals depending on the
algorithms, architectures, hardware bindings, etc., employed by these peripherals. The com-
munication interfaces between various hardware components (e.g., the topology for connecting
the processors, the communication protocols for exchanging data between the processors and
thecustomizedhardwareperipheralsandbetweentheprocessors)wouldalsosigni¯cantlymake
e±cient design space exploration more challenging. Exploring such a large design space using
time-consuming low-level simulation techniques becomes intractable.
Weaddressthefollowingdesignprobleminthissection. Thetargetapplicationiscomposed
of a number of tasks with data exchange and execution precedence between each other. Each
of the task is mapped to one or more processors for execution. The hardware architecture of
the con¯gurable multi-processor platform of interest is shown in Figure 3.1. Application devel-
opment on the multi-processor platform involves both software designs and hardware designs.
For software designs, the application designer can choose the mapping and scheduling of the
tasks for distributing the software programs of the tasks among the processors so as to process
the input data in parallel. For hardware designs, there are two levels of customization that
can be done to the hardware platform. On the one hand, customized hardware peripherals
can be attached to the soft processors as hardware accelerators to speed up some computation
38
Figure 3.1: Hardware architecture of the con¯gurable multi-processor platform
steps. Di®erent bus interfaces can be used for exchanging data between the processors and
their own hardware accelerators. On the other hand, the processors are connected through
a communication network for cooperative data processing. Various topologies and communi-
cation protocols can be used when constructing the communication network. Based on the
above assumptions, our objective is to build an environment for design space exploration on
con¯gurable multi-processor platform that has the following desired properties. (1) The vari-
ous hardware and software design possibilities o®ered by the multi-processor platform can be
described within the design environment. (2) For a speci¯c realization of the application on
the multi-processor platform, the hardware execution (i.e., the execution within the customized
hardware peripheral and the communication interfaces) and the software execution within the
softprocessorsaresimulatedrapidlyinaconcurrentmannersoastofacilitatetheexplorationof
the various hardware-software design °exibilities. (3) The results gathered during the hardware-
software co-simulation process should facilitate the identi¯cation of the candidate designs and
diagnose of performance bottlenecks. Finally, after the candidate designs are found out through
the co-simulation process, the corresponding low-level implementations with the desired high-
level behavior can be generated automatically.
As the main contribution of this chapter, we demonstrate that a design space exploration
approachbasedonarithmeticlevelcycle-accuratehardware-softwaremodelingandco-simulation
39
canachievetheobjectivedescribedabove. AsdiscussedinSection6.5,ontheonehand,thearith-
metic level modeling and co-simulation technique does not involve the low-level RTL and/or
gate level implementations of the multi-processor platform and thus can greatly increase the
simulation speed. It enables e±cient exploration of the various design trade-o®s o®ered by
the con¯gurable multi-processor platform. On the other hand, maintaining cycle-accuracy
during the co-simulation process helps the application designer to identify performance bot-
tlenecks and thus facilitate the identi¯cation of candidate designs with \good" performance.
We consider performance metrics of time and hardware resource usage in this chapter. It also
facilitates the automatic generation of low-level implementations once the candidate designs
are identi¯ed through the arithmetic level co-simulation process. For illustrative purposes,
we show an implementation of the proposed arithmetic level co-simulation technique based
on MATLAB/Simulink [44]. Through the design of three widely used numerical computation
and image processing applications, we show that the proposed cycle-accurate arithmetic level
co-simulation technique achieves speed-ups in simulation time up to more than 800x as com-
pared to those achieved by low-level simulation techniques. For these three applications, the
designs identi¯ed using our co-simulation environment achieve execution speed-ups up to more
than 5.6x compared with other designs considered in our experiments. Finally, we have veri-
¯ed the low-level implementations generated from the arithmetic level design description on a
commercial FPGA prototyping board.
This chapter is organized as follows. Section 3.2 discusses related work. Section 3.3 de-
scribes our approach for building an arithmetic level cycle-accurate co-simulation environment.
An implementation of the co-simulation environment based on MATLAB/Simulink is provided
in Section 3.4 to illustrate the proposed co-simulation technique. Two signal processing appli-
cations and one image processing application are provided in Section 3.5 to demonstrate the
e®ectiveness of our co-simulation technique. Finally, we conclude in Section 3.6.
40
3.2 Related work
Many previous techniques have been proposed for performing hardware-software co-design
and co-simulation on FPGA based hardware platforms. Hardware-software co-design and co-
simulation on FPGAs using compiler optimization techniques are proposed by Gupta et al.
[27], Hall et al. [29], Palem et al. [56], and Plaks [57]. A hardware-software co-design tech-
nique for a data-driven accelerator is proposed by Becker et al. [31]. To our best knowledge,
none of the prior work addresses the rapid design space exploration problem for con¯gurable
multi-processor platforms.
State-of-the-art hardware/software co-simulation techniques can be roughly classi¯ed into
four major categories, which are discussed as follows.
² Techniques based on low-level simulation: Since con¯gurable multi-processor platforms
are con¯gured using FPGA resources, the hardware-software co-simulation of these platforms
can use low-level hardware simulators directly. This technique is used by several commercial
integrated design environments (IDEs) for application development using con¯gurable multi-
processor platforms. This includes SOPC Builder from [3] and Embedded Development Kit
(EDK) from [79]. When using these IDEs, the multi-processor platform is described using a
simple con¯guration script. These IDEs will then automatically generate the low-level imple-
mentations of the multi-processor platform and the low-level simulation models based on the
low-level implementations. Software programs are also compiled into binary executable ¯les,
which are then used to initialize the memory blocks in the low-level simulation models. Based
on the low-level simulation models, low-level hardware simulators (e.g., [46]) can be used to
simulate the behavior of the complete multi-processor platform. From one standpoint, the sim-
ple con¯guration scripts used in these IDEs provide very limited capabilities for describing the
two types of optimization possibilities o®ered by con¯gurable multi-processor platforms. From
41
anotherstandpoint,asweshowinSection3.5,suchlow-levelsimulationbasedco-simulationap-
proachistootime-consumingforsimulatingthevariouspossibilitiesofapplicationdevelopment
on the multi-processor platforms.
² Techniques based on high-level languages: One approach of performing hardware-software
co-simulation is by adopting high-level languages such C/C++ and Java. When applying these
techniques, the hardware-software co-simulation can be performed by compiling the designs
using their high-level language compilers and running the executable ¯les resulting from the
compilation process. There are several commercial tools based on C/C++. Examples of such
co-simulation techniques include Catapult C from [47] and Impulse C, which is used by the
CoDeveloper design tool from [68]. In addition to supporting the standard ANSI/ISO C, both
CatapultCandImpulseCprovidelanguageextensionsforspecifyinghardwareimplementation
properties. The application designer describes his/her designs using these extended C/C++
languages, compile the designs using standard C/C++ compilers, generate the binary exe-
cutable ¯les, and verify the functional behavior of the designs by analyzing the output of the
executable¯les. Toobtainthecycle-accuratefunctionalbehaviorofthedesigns,theapplication
designer still needs to generate the VHDL simulation models of the designs, and perform low-
level simulation using cycle-accurate hardware simulators. The DK3 tool from [12] supports
development on con¯gurable platforms using Handel-C [13] and SystemC [51], extensions of
C/C++ language. While Handel-C and SystemC allows for the description of hardware and
software designs at di®erent abstraction levels. However, to make a design described using
Handel-C or SystemC suitable for direct register transfer level generation, the application de-
signer needs to write his/her designs at nearly the same level of abstraction as handcrafted
register transfer level hardware implementations [45]. This would prevent e±cient design space
exploration for con¯gurable multi-processor platforms.
42
² Techniques based on software synthesis: For software synthesis based hardware-software
co-simulation techniques, the input software programs are synthesized into Co-design Finite
State Machine (CFSM) models which demonstrate the same functional behavior as that of the
software programs. These CFSM models are then integrated into the simulationmodels for the
hardware platform for hardware-software co-simulation. One example of the software synthesis
approach is the POLIS hardware-software co-design framework [8]. In POLIS, the input soft-
wareprogramsaresynthesizedandtranslatedintoVHDLsimulationmodelswhichdemonstrate
the same functional behavior. The hardware simulation models are then integrated with the
simulation models of other hardware components for co-simulation of the complete hardware
platform. Thesoftwaresynthesisapproachcangreatlyacceleratethetimeforco-simulatingthe
low-level register transfer/gate level implementations of multi-processor platforms. One issue
with the software synthesis co-simulation approach is that it is di±cult to synthesize compli-
catedsoftware programs (e.g., operating systems, video encoding/decoding software programs)
into hardware simulation models with the same functional behavior. Another issue is that the
software synthesis approach is based on low-level implementations of the multi-processor plat-
form. Theapplicationdesignerneedstogeneratethelow-levelimplementationsofthecomplete
multi-processor platform before he/she can perform co-simulation. The large amount of e®orts
required by generating the low-level implementations prevents e±cient exploration of various
con¯gurations of the con¯gurable multi-processor platform.
² Techniques based on integration of low-level simulators: Another approach for hardware-
software co-simulation is through the integration of low-level hardware-software simulators.
One example is the Seamless tool from [46]. It contains pre-compiled cycle-accurate behavioral
processor models for simulating the execution of software programs on various processors. It
alsoprovidespre-compiledcycle-accuratebusmodelsforsimulatingthecommunicationbetween
43
hardware and software executions. Using these pre-compiled processor models and bus mod-
els, the application designer can separate the simulation of hardware executions and software
executions and use the corresponding hardware and software simulators for co-simulation. The
Seamless toolprovidesdetailedsimulationinformationforverifyingthecorrectnessofhardware
and software co-designs. However, it is still based on the time-consuming low-level simulation
techniques. Therefore, this low-level simulation based technique is ine±cient for application
development on con¯gurable multi-processor platforms.
To summarize, the design approaches discussed above rely on low-level simulation to ob-
tain the cycle-accurate functional behavior of the hardware-software execution on con¯gurable
multi-processor platforms. Such reliance on low-level simulation models prevents them from
e±ciently exploring the various con¯gurations of multi-processor platforms. Compared with
these approaches, the arithmetic level co-simulation technique proposed in this chapter allows
for arithmetic level cycle-accurate hardware-software co-simulation of the multi-processor plat-
form without involving the low-level implementations and low-level simulation models of the
complete platform. Our approach is able to achieve a signi¯cant simulation speed-up while
providing cycle-accurate simulation details compared with low-level simulation techniques.
3.3 Our approach
Our approach for design space exploration is based on an arithmetic level co-simulation tech-
nique. Inthefollowingparagraphs,we¯rstpresentthearithmeticlevelco-simulationtechnique.
Designspaceexplorationformulti-processorplatformcanbeperformedbyanalyzingthearith-
metic level co-simulation results.
²Arithmeticlevelmodelingandco-simulation: Thearithmeticlevelmodelingandco-simulation
technique for con¯gurable multi-processor platforms is shown in Figure 3.2. The con¯gurable
multi-processor platform consists of three major components: soft processors for executing
44
Figure 3.2: Our approach for high-level hardware-software co-simulation
software programs; customized hardware peripherals as hardware accelerators for parallel ex-
ecution of some speci¯c computation steps; and communication interfaces for data exchange
between various hardware components. The communication interfaces include bus interfaces
for exchanging data between the processors and their customized hardware peripherals, and
communication networks for coordinating the computations and communication among the
soft processors. When employing our arithmetic level co-simulation technique, arithmetic level
(\high-level") abstractions are created to model each of the three major components. The
arithmetic level abstractions can greatly speed up the co-simulation process while allowing the
application designer to explore the optimization opportunities provided by con¯gurable multi-
processor platforms. By \arithmetic level", we mean that only the arithmetic aspects of the
hardware and software execution are modeled by these arithmetic abstractions. Taking multi-
plication as an example, its low-level implementation on Xilinx Virtex-II/Virtex-II Pro FPGAs
canberealizedusingeitherslice-basedmultipliersorembeddedmultipliers. Itsarithmeticlevel
abstraction only capture the arithmetic property, i.e. multiplication of the values presented at
its input ports. Taking the communication between di®erent hardware components as another
example, its low-level implementations can use registers (°ip-°ops), slices, or embedded mem-
oryblockstorealizedatabu®ering. Itsarithmeticlevelabstractiononlycapturethearithmetic
45
level data movements on the communication channels (e.g., the handshaking protocols, access
priorities for communication through shared communication channels). Co-simulation based
on the arithmetic level abstractions of the hardware components does not involve the low-level
implementation details. Using the arithmetic-level abstraction, the user can specify that the
movement of the data be delayed for one clock cycle before it becomes available to the destina-
tion hardware components. He/she does not need to provide low-level implementation details
(suchasthenumberof°ip-°opregistersandtheconnectionsoftheregistersbetweentherelated
hardware components) for realizing this arithmetic operation. Such low-level implementations
can be generated automatically once the high-level design description is ¯nished. Thus, these
arithmetic-level abstractions can signi¯cantly speed up the time required to simulate the hard-
ware and software arithmetic behavior of the multi-processor platform. The implementation of
thearithmeticlevelco-simulationtechniquepresentedinSection3.4demonstratesasimulation
speed-up up to more than 800x compared with the behavioral simulation based on low-level
implementations of the hardware platforms.
The con¯gurable multi-processor platform is described using the arithmetic level abstrac-
tions. The arithmetic level abstractions allow the application designer to specify the various
ways of constructing the data paths through which the input data is processed. Co-simulation
based on the arithmetic level abstractions gives the status of the data paths during the ex-
ecution of the application. For example, the development of a JPEG2000 application on a
state-of-the-art con¯guration multi-processor platform using the arithmetic level abstractions
is shown in Section 3.5.2. The arithmetic level abstraction of the JPEG2000 application speci-
¯esthedatapathsthattheinputimagedataareprocessedamongthemultipleprocessors. The
co-simulation based on the arithmetic level abstraction gives the intermediate processing re-
sultsatvarioushardwarecomponentsthatconstitutethemulti-processorplatform. Usingthese
intermediate processing results, the application designer can analyze performance bottlenecks
46
and identify the candidate designs of the multi-processor platform. Application development
using the arithmetic level abstractions focus on the description of data path. Thus, the pro-
posed arithmetic level co-simulation technique is especially suitable for the development of
data-intensive applications such as signal and image processing applications.
Thearithmeticlevelabstractionsofthecon¯gurablemulti-processorplatformaresimulated
using their corresponding hardware and software simulators. These hardware and software
simulators are tightly integrated into our co-simulation environment and concurrently simulate
the arithmetic behavior of the complete multi-processor platform. Most importantly, the sim-
ulations performed within the integrated simulators are synchronized between each other at
each clock cycle and provide cycle accurate simulation results for the complete multi-processor
platform. By \cycle-accurate", we mean that for each clock cycle during the co-simulation
process, the arithmetic behavior of the multi-processor platform simulated by the proposed
co-simulation environment matches with the arithmetic behavior of the corresponding low-
level implementations. For example, when simulating the execution of software programs, the
cycle-accurate co-simulation takes into account the number of clock cycles required by the
soft processors for completing a speci¯c instruction (e.g., the multiplication instruction of the
MicroBlaze processor takes three clock cycles to complete). When simulating the hardware
execution on customized hardware peripherals, the cycle-accurate co-simulation takes into ac-
count the number of clock cycles required by the pipelined customized hardware peripherals to
process the input data. The cycle-accurate co-simulation process also takes into account the
delays caused by the communication channels between the various hardware components. One
approach for achieving such cycle accuracy between di®erent simulators in the actual imple-
mentations is by maintaining a global simulation clock in the co-simulation environment. This
global simulation clock can be used to specify the time at which a piece of data is available for
processing by the next hardware or software component based on the delays speci¯ed by the
47
user. Maintaining such cycle-accurate property in the arithmetic level co-simulation ensures
that the results from the arithmetic level co-simulation are consistent with the arithmetic be-
havior of the corresponding low-level implementations. The cycle-accurate simulation results
allow the application designer to observe the instant interactions between hardware and soft-
ware executions. The instant interaction information is used in the design space exploration
process for identifying performance bottlenecks in the designs.
² Design space exploration: Design space exploration is performed based on the arithmetic
level abstractions. The functional behavior of the candidate designs can be veri¯ed using the
proposed arithmetic level co-simulation technique. As shown in Section 3.5, the application
designer can use the simulation results gathered during the cycle-accurate arithmetic level co-
simulation process to identify the candidate designs and diagnose the performance bottlenecks
when performing design space exploration. Considering the development of a JPEG2000 ap-
plication on a multi-processor platform discussed in Section 3.5.2, using the results from the
arithmetic level simulation can identify that the bus connecting the multiple processors limits
the performance of the complete system when more processors are employed. Besides, these
detailed simulation results also facilitate the automatic generation of the low-level implementa-
tions. Byspecifyingthelow-levelhardwarebindingsforthearithmeticoperations(e.g.,binding
the embedded multipliers for realization of the multiplication arithmetic operation), the appli-
cation designer can also rapidly obtain the hardware resource usage for a speci¯c realization
of the application [67]. The cycle-accurate arithmetic level simulation results and the rapidly
estimated hardware resource usage information can help the application designer to e±ciently
explore the various optimization opportunities and identify \good" candidate designs. For ex-
ample, in the development of a block matrix multiplication algorithm shown in Section 3.5.1.1,
the application designer can explore the impact of the size of the matrix blocks on the per-
formance of the complete algorithm. Finally, for the designs identi¯ed by the arithmetic level
48
co-simulation process, low-level implementations with corresponding arithmetic behavior are
automatically generated based on the arithmetic level abstractions.
3.4 An implementation based on MATLAB/Simulink
For illustrative purposes, we provide an implementation of our arithmetic level co-simulation
approach based on MATLAB/Simulink for application development using con¯gurable multi-
processor platforms. The software architecture of the implementation is shown in Figure 3.4
and Figure 3.5. Arithmetic level abstractions of the customized hardware peripherals and the
communication interfaces (including the bus interfaces and the communication network) are
created within the MATLAB/Simulink modeling environment. Thus, the hardware execution
platform is described and simulated within MATLAB/Simulink. Additionally, we create soft
processor Simulink blocks for integrating cycle-accurate instruction set simulators targeting the
soft processors. The execution of the software programs distributed among the soft processors
is simulated using these soft processor Simulink blocks.
3.4.1 High-level design description
Our hardware-software development framework provides a new block called Software Co-
Simulation block, in the Simulink modeling environment. This new Software Co-Simulation
block uni¯es the application development capabilities o®ered by both the System Generator
tool and the EDK (Embedded Development Kit) tool into the MATLAB/Simulink modeling
environment. More speci¯cally, by o®ering the Software Co-Simulation block, the proposed
framework presents the end user with the following two high-level abstractions for rapid con-
struction of the hardware-software execution systems.
1. Arithmetic-level abstractions for hardware platform development: The end user can de-
scribe the arithmetic-level behavior of the hardware platform using the Simulink blocks
49
provided by the System Generator tool. For example, during the high-level arithmetic-
levelmodeling,theusercanspecifythatthepresenceofavaluetoitsdestinationSimulink
blocksbedelayedforoneclockcycle. Thecorrespondinglow-levelimplementationiscom-
posed of a speci¯c number of °ip-°op registers based on the data type of the high-level
signalandthewiresthatconnectthemproperlytothelow-levelimplementationsofother
components. The application framework automatically generates the corresponding low-
level implementation.
2. Interfacing level abstractions for con¯guration of the processor systems: EDK provides
a simple script based MHS (Microprocessor Hardware System) ¯le for describing the
connections between the hardware peripherals of the processors. The end user can also
specify the setting of the processor (e.g. cache size and memory mapping, etc.) and the
hardware peripherals of the processor within the MHS ¯le.
3.4.2 Arithmetic-level co-simulation
Our arithmetic level co-simulation environment consists of four major components: simulation
of software execution on soft processors, simulation of customized hardware peripherals, simula-
tion of communication interfaces, and exchange of simulation data and synchronization. They
are discussed in detail in the following subsections.
3.4.2.1 Simulation of software execution on soft processors
Soft processor Simulink blocks (e.g., MicroBlaze Simulink blocks targeting MicroBlaze proces-
sors) are created for simulating the software programs running on the processors. Each soft
processor Simulink block simulates the software programs executed on one processor. Multi-
ple soft processor Simulink blocks are employed in order to properly simulate the con¯gurable
multi-processor platform.
50
Figure 3.3: An implementation of the proposed arithmetic co-simulation environment based on
MATLAB/Simulink
The software architecture of a soft processor Simulink block is shown Figure 3.5. The
input C programs are compiled using the compiler for the speci¯c processor (e.g., the GNU C
compilermb-gcc forMicroBlaze)andtranslatedintobinaryexecutable¯les(e.g., .ELF ¯lesfor
MicroBlaze). Thesebinaryexecutable¯lesarethensimulatedusingacycle-accurateinstruction
set simulator for the speci¯c processor. Taking the MicroBlaze processor as an example, the
executable .ELF ¯les are loaded into mb-gdb, the GNU C debugger for MicroBlaze. A cycle-
accurate instruction set simulator for the MicroBlaze processor is provided by Xilinx. mb-gdb
sends instructions of the loaded executable ¯les to the MicroBlaze instruction set simulator
and performs cycle-accurate simulation of the execution of the software programs. mb-gdb also
sends/receives commands and data to/from MATLAB/Simulink through the soft processor
51
Figure 3.4: An implementation of the proposed arithmetic co-simulation environment based on
MATLAB/Simulink
Simulinkblockandinteractivelysimulatetheexecutionofthesoftwareprogramsinconcurrence
with the simulation of the hardware designs within MATLAB/Simulink.
3.4.2.2 Simulation of customized hardware peripherals
ThecustomizedhardwareperipheralsaredescribedusingtheMATLAB/SimulinkbasedFPGA
design tools. For example, System Generator supplies a set of dedicated Simulink blocks for
describing parallel hardware designs using FPGAs. These Simulink blocks provide arithmetic
level abstractions of the low-level hardware components. There are blocks that represent the
basic hardware resources (e.g., °ip-°op based registers, multiplexers), blocks that represent
control logic, mathematical functions, and memory, and blocks that represent proprietary IP
(Intellectual Property) cores (e.g., the IP cores for Fast Fourier Transform and ¯nite impulse
¯lters). Considering the Mult Simulink block for multiplication provided by System Generator,
itcapturesthearithmeticbehaviorofmultiplicationbypresentingatitsoutputporttheproduct
ofthevaluespresentedatitstwoinputports. Whetherthelow-levelimplementationofthe Mult
52
Figure 3.5: Architecture of the soft processor Simulink block
Simulink block is realized using embedded or slice-based multipliers is ignored in its arithmetic
level abstraction. The application designer assembles the customized hardware peripherals by
dragginganddroppingtheblocksfromtheblocksettohis/herdesignsandconnectingthemvia
theSimulinkgraphicinterface. Simulationofthecustomizedhardwareperipheralsisperformed
within the MATLAB/Simulink. MATLAB/Simulink maintains a simulation timer for keeping
track of the simulation process. Each unit of simulation time counted by the simulation timer
corresponds to one clock cycle experienced by the corresponding low-level implementations.
3.4.2.3 Simulation of communication interfaces
AsshowninFigure3.4,thesimulationofthecommunicationinterfacesiscomposedoftwoparts,
simulationofthebusinterfacesbetweentheprocessorsandtheircorrespondingperipheralsand
simulation of the communication network between the processors, which are described in the
following paragraphs.
² Simulation of the dedicated bus interfaces: Simulation of the dedicated bus interfaces be-
tweentheprocessorsandtheirhardwareperipheralsisperformedbythesoftprocessorSimulink
53
Figure 3.6: Communication between MicroBlaze and customized hardware designs through
Fast Simplex Links
block. Basically, the soft processor Simulink blocks need to simulate the input/output commu-
nication protocols and the data bu®ering operations of the dedicated bus interfaces.
We use MicroBlaze processors and the dedicated Fast Simplex Link (FSL) bus interfaces to
illustrate the co-simulation process. FSLs are unidirectional FIFO (First-In-First-Out) chan-
nels. Data can move in and out of the FSL channels in two clock cycles. As shown in Fig-
ure 3.6, a MicroBlaze processor provides eight FSL channels for data input and another eight
channels for data output. Using these FSL channels, the application designer can attach cus-
tomized hardware peripherals as hardware accelerators to the MicroBlaze processor. Both
synchronous (blocking) and asynchronous (non-blocking) read/write operations are supported
by MicroBlaze. For blocking read/write operations, the MicroBlaze processor is stalled until
the read/write operations ¯nish. For non-blocking read/write operations, MicroBlaze resumes
its normal execution immediately regardless of the outcome of the read/write operations.
The MicroBlaze Simulink blocks simulate the FSL FIFO bu®ers and the bus interfaces
between the customized hardware peripherals and the FSL channels. The width of the FSL
channelsis32bitwhiletheirwidthsarecon¯gurablefrom1to8192dependingontheapplication
requirements and the available hardware resources on the FPGA device. Let \#" denote
the ID of the FSL input/output channel for accessing the MicroBlaze processor. \#" is an
integer number between 0 and 7. When the In# write input port of the MicroBlaze Simulink
block becomes high, it indicates that there is data from the customized hardware peripherals
54
simulated in MATLAB/Simulink. The data will be written into the FSL FIFO bu®er stored
at the internal data structure of the MicroBlaze Simulink block. The MicroBlaze Simulink
block would then store the MATLAB/Simulink data presented at the In# write input port
into the internal data structure and raises the Out# exists output port stored in its internal
data structure to indicate the availability of the data. Similarly, when the FSL FIFO bu®er is
full, the MicroBlaze Simulink block will raise the In# full output port in MATLAB/Simulink
to prevent further data coming into the FSL FIFO bu®er.
The MicroBlaze Simulink blocks also simulate the bus interfaces between the MicroBlaze
processor and the FSL channels. A set of dedicated C functions is provided for the MicroB-
laze to control the communication through FSLs. After compilation using mb-gcc, these C
functions are translated to the corresponding dedicated assembly instructions. For example, C
function microblaze nbread datafsl(val,id) is used for non-blocking reading of data from the id-
thFSLchannel. ThisCfunctionistranslatedintoadedicatedassemblyinstruction nget(*val*,
rfsl#id) during the compilation of the software program. By observing the execution of these
dedicatedCfunctionsduringco-simulation,wecancontrolthehardwareandsoftwareprocesses
to correctly simulate the interactions between the processors and their hardware peripherals.
During the simulation of the software programs, the MicroBlaze Simulink block keeps track
of the status of the MicroBlaze processor by communicating with mb-gdb. As soon as the
dedicated assembly instruction described above for writing data to the customized hardware
peripherals through the FSL channels is encountered by the mb-gdb, it informs the MicroBlaze
Simulink block. The MicroBlaze Simulink block will then stall the simulation of software
programs in mb-gdb, extract the data from mb-gdb, and try to write the data into the FSL
FIFO bu®er stored at its internal data structure. For communication in the blocking mode,
the MicroBlaze Simulink block stalls simulation of the software programs in mb-gdb until the
write operation is completed. That is, the simulation of software programs gets stalled until
55
the In# full °ag bit stored at the MicroBlaze Simulink block internal data structure becomes
low. This indicates that the FSL FIFO bu®er is ready to accept more data. Otherwise, for
communication in non-blocking mode, the MicroBlaze Simulink block resumes the simulation
of the software programs immediately after writing data to the FSL FIFO bu®er regardless
of the outcome of the write operation. Data exchange for the read operation is handled in a
similar manner.
² Simulation of the communication network: Each soft processor and its customized hard-
ware peripherals are developed as a soft processor subsystem in MATLAB/Simulink. In order
to provide the °exibility for exploring the di®erent topologies and communication protocols
for connecting the processors, the communication network that connects the soft processor
subsystems and coordinates the computations and communication between these subsystems is
describedandsimulatedwithinMATLAB/Simulink. Asshowninthedesignexamplediscussed
inSection3.5.2,anOPB(On-chipPeripheralBus)SimulinkblockiscreatedtorealizetheOPB
shared bus interface. The multiple MicroBlaze subsystems are connected to the OPB Simulink
block to form a bus topology. A hardware semaphore (i.e. a mutual exclusive access con-
troller) is used to coordinate the hardware-software execution among the multiple MicroBlaze
processors. The hardware semaphore is described and simulated within MATLAB/Simulink.
3.4.2.4 Exchange of simulation data and synchronization between the simulators
The soft processor Simulink blocks are responsible for exchanging simulation data between the
software and hardware simulators. The input and output ports of the soft processor Simulink
blocks are used to separate the simulation of the software programs running on the soft pro-
cessor and that of the other Simulink blocks, e.g., the hardware peripherals of the processor as
well as other soft processors employed in the design. The input and output ports of the soft
processor Simulink blocks correspond to the input and output ports of the low-level hardware
implementations. Forlow-levelportsthatarebothinputandoutputports,theyarerepresented
56
as separate input and output blocks su±xed with port names \ in" and \ out" respectively on
the Simulink blocks. The MicroBlaze Simulink blocks send the values of the FSL registers at
theMicroBlazeinstructionsetsimulatortotheinputportsofthesoftprocessorSimulinkblocks
as input data for the hardware peripherals. Vice versa, the MicroBlaze Simulink blocks collect
the simulation output of the hardware peripherals from the output ports of the soft processor
Simulink blocks and use the output data to update the values of the FSL registers stored at its
internal data structure.
When exchanging the simulation data between the simulators, the soft processor Simulink
blocks take into account the number of clock cycles required by the processors and the cus-
tomizedhardwareperipheralstoprocesstheinputdata. Theyalsotakeintoaccountthedelays
caused by transmitting the data through the dedicated bus interfaces and the communica-
tion network. By doing so, the hardware and software simulation are synchronized on a cycle
accurate basis.
Moreover, a global simulation timer is used to keep track of the simulation time of the
complete multi-processor platform. All hardware and software simulations are synchronized
with this global simulation timer. For the MATLAB/Simulink based implementation of the
co-simulation environment, one unit of simulation time counted by the global simulation timer
equals one unit of simulation time within MATLAB/Simulink and one clock cycle simulation
time of the MicroBlaze instruction set simulator. It is ensured that one unit of the simulation
time counted by the global simulation timer also corresponds to one clock cycle experienced by
the corresponding low-level implementations of the multi-processor platform.
3.4.3 Rapid hardware resource estimation
Being able to rapidly obtain the hardware resources occupied by various con¯gurations of the
multi-processor platform is required for design space exploration. For Xilinx FPGAs, we focus
57
on the number of slices, the number of BRAM memory blocks and embedded 18-bit-by-18-bit
multipliers used for constructing the multi-processor platform.
For the multi-processor platform based on MicroBlaze processors, the hardware resources
are used by the following four types of hardware components: the MicroBlaze processors, the
customized hardware peripherals, the communication interfaces (the dedicated bus interfaces
and the communication network), and the storage of the software programs. Resource usage
of the MicroBlaze processors, the two LMB (Local Memory Bus) interface controllers, and
the dedicated bus interfaces is estimated from the Xilinx data sheet. Resource usage of the
customized hardware designs and the communication network is estimated using the resource
estimation technique provided by Shi et al. [67]. Since the software programs are stored in
BRAMs, we obtain the size of the software programs using the mb-objdump tool and then
calculate the numbers of BRAMs required to store these software programs. The resource
usage of the multi-processor platform is obtained by summing up the hardware resources used
by the above four types of hardware components.
3.5 Illustrative examples
To demonstrate the e®ectiveness of our approach, we show in this section the development of
two widely used numerical computation applications (i.e. CORDIC algorithm for division and
block matrix multiplication) and one image processing application (i.e. JPEG2000 encoding)
on a popular con¯gurable multi-processor platform. The two numerical computation applica-
tions are widely deployed in systems such as radar systems and software de¯ned radio systems
[43]. Implementing these applications using soft processors provides the capability of handling
di®erent problem sizes depending on the speci¯c application requirements.
58
We ¯rst demonstrate the co-simulation process of an individual soft processor and its cus-
tomized hardware peripherals. We then show the co-simulation process of the complete multi-
processor platform, which is constructed by connecting the individual soft processors with
customized hardware peripherals through a communication network. Our illustrative examples
focus on the MicroBlaze processors and the FPGA design tools from Xilinx due to their wide
availability. Our co-simulation approach is also applicable to other soft processors and FPGA
design tools. Virtex-II Pro FPGAs [79] are chosen as our target devices. Arithmetic level
abstractions of the hardware execution platform are provided by System Generator 8.1EA2.
Automatic generation of the low-level implementations of the multi-processor platform is real-
ized using both System Generator 8.1EA2 and EDK (Embedded Development Kit) 7.1. The
ISE (Integrated Software Environment) 7.1 [79] is used for synthesizing and implementing
(including placing-and-routing) the complete multi-processor platform. Finally, the functional
correctnessofthemulti-processorplatformisveri¯edusinganML300Virtex-IIProprototyping
board from Xilinx [79].
3.5.1 Co-simulation of the processor and the hardware peripherals
3.5.1.1 Adaptive CORDIC algorithm for division
The CORDIC (COordinate Rotation DIgital Computer) iterative algorithm for dividing b by a
[5] is described as follows. Initially, we set X
¡1
= a, Y
¡1
= b, Z
¡1
= 0, and C
¡1
= 1. Let N
denote the number of iterations performed by the CORDIC algorithms. During each iteration
i (i = 0, 1,¢¢¢, N¡1), the following computation is performed.
59
8
>
>
>
>
>
>
>
>
>
>
<
>
>
>
>
>
>
>
>
>
>
:
X
i
= X
i¡1
Y
i
= Y
i¡1
+d
i
¢X
i¡1
¢C
i¡1
Z
i
= Z
i¡1
¡d
i
¢C
i¡1
C
i
= C
i¡1
¢2
¡1
(3.1)
where, d
i
= +1 if Y
i
< 0 and d
i
=¡1 otherwise. After N iterations of processing, we have
Z
N
¼ ¡b=a. Implementing this CORDIC algorithm using soft processors not only leads to
compact designs but also o®ers dynamic adaptivity for practical application development. For
example, many telecommunication systems have a wide dynamic data range, it is desired that
the number of iterations can be dynamically adapted to the environment where the telecom-
munication systems are deployed. Also, for some CORDIC algorithms, the e®ective precision
of the output data cannot be computed analytically. One example is the hyperbolic CORDIC
algorithms. The e®ective output bit precision of these algorithms depends on the angular value
Z
i
during iteration i and needs to be determined dynamically.
² Implementation: The hardware architecture of our CORDIC algorithm for division based
on MicroBlaze is shown in Figure 3.7. The customized hardware peripheral is con¯gured with
P processor elements (PEs). Each PE performs one iteration of computation described in
Equation 5.1. All the PEs form a linear pipeline. We consider 32-bit data precision in our
Figure 3.7: CORDIC algorithm for division with P =4
60
Table 3.1: Resource usage of the CORDIC based division and the block matrix multiplication
applications as well as the simulation times using di®erent simulation techniques
Designs
Estimated/actual resource usage
Slices BRAMs Multipliers
24 iteration CORDIC div. with P =2 729 / 721 1 / 1 3 / 3
24 iteration CORDIC div. with P =4 801 / 793 1 / 1 3 / 3
24 iteration CORDIC div. with P =6 873 / 865 1 / 1 3 / 3
24 iteration CORDIC div. with P =8 975 / 937 1 / 1 3 / 3
12£12 matrix mult. with 2£2 blocks 851 / 713 1 / 1 5 / 5
12£12 matrix mult. with 4£4 blocks 1043 / 867 1 / 1 7 / 7
Simulation time
Our environment ModelSim (Behavioral)
0.041 sec 35.5 sec
0.040 sec 34.0 sec
0.040 sec 33.5 sec
0.040 sec 33.0 sec
1.724 sec 1501 sec
0.787 sec 678 sec
designs. Sincesoftwareprogramsareexecutedinaserialmannerintheprocessor,onlyoneFSL
channel is used for sending the data from MicroBlaze to the customized hardware peripheral.
Thesoftwareprogramcontrolsthenumberofiterationsforeachsetofdatabasedonthespeci¯c
application requirement. To support more than 4 iterations for the con¯guration shown in
Figure 3.7, the software program sends X
out
, Y
out
and Z
out
generated by PE
3
back to PE
0
for
further processing until the desired number of iterations is reached.
For the processing elements shown in Figure 3.7, C
0
is provided by the software program
based on the number of times that the input data has passed through the linear pipeline. C
0
is sent out from the MicroBlaze processor to the FSL as a control word. That is, when there is
data available in the corresponding FSL FIFO bu®er and Out# control is high, PE
0
updates
its local copy of C
0
and then continues to propagate it to the following PEs along the linear
pipeline. FortheotherPEs, C
i
isupdatedasC
i
=C
i¡1
¢2
¡1
,i=1;2;¢¢¢ ;P¡1,andisobtained
by right shifting C
i¡1
from the previous PE.
61
When performing division on a large set of data, the input data is divided into several sets.
These sets are processed one by one. Within each set of data, the data samples are fed into the
customized hardware peripheral consecutively in a back-to-back manner. The output data of
the hardware peripheral is stored at the FIFO bu®ers of the data output FSLs and is further
sent back to the processor. The application designer needs to select a proper size for each set
of data so that the results generated would not over°ow the FIFO bu®ers of the data output
FSL channels.
0 2 4 6 8
2
4
6
8
10
12
14
P (number of PEs in customized hardware)
Time (μsec)
20 iterations
24 iterations
Figure 3.8: Time performance of the CORDIC algorithm for division (P = 0 denotes \pure"
software implementations)
² Design space exploration: We consider di®erent implementations of the CORDIC algorithm
with di®erent P, the number of processing elements used for implementing the linear pipeline.
When more processing elements are employed in the design, the execution of the CORDIC
division algorithm can be accelerated. However, the con¯guration of the MicroBlaze processor
would also consume more hardware resources.
The time performance of various con¯gurations of the CORDIC algorithm for division is
showninFigure3.8,whileitsresourceusageisshowninTable3.1. Theresourceusageestimated
62
using our design tool is calculated as shown in Section 3.4.3. The actual resource usage is ob-
tained from the place-and-route reports (.par ¯les) generated by ISE. For CORDIC algorithms
with 24 iterations, attaching a customized linear pipeline of 4 PEs to the soft processor results
in a 5.6x improvement in time performance compared with \pure" software implementation,
while it requires 280 (30%) more slices.
² Simulation speed-ups: The simulation time of the CORDIC algorithm for division using our
high-levelco-simulationenvironmentisshownTable3.1. Forcomparisonpurpose,wealsoshow
the simulation time of the low-level behavioral simulation using ModelSim. For the ModelSim
simulation, the time for generating the low-level implementations is not accounted for. We
only consider the time for compiling the VHDL simulation models and performing the low-
level simulation within ModelSim. Compared with the low-level simulation in ModelSim, our
simulation environment achieves speed-ups in simulation time ranging from 825x to 866x and
845x on average for the four designs shown in Table 3.1.
3.5.1.2 Block matrix multiplication
In our design of block matrix multiplication, we ¯rst decompose the original matrices into a
number of smaller matrix blocks. Then, the multiplication of these smaller matrix blocks is
performed within the customized hardware peripheral. The software program is responsible for
controllingdata to and from the customized hardware peripheral, combiningthe multiplication
resultsofthesematrixblocks,andgeneratingtheresultingmatrix. AsisshowninEquation3.2,
to multiply two 4£4 matrices, A and B, we decompose them into four 2£2 matrix blocks
respectively (i.e. A
i;j
and B
i;j
, 1 · i;j · 2. To minimize the required data transmission
between the processor and the hardware peripheral, the matrix blocks of matrix A are loaded
into the hardware peripheral column by column so that each block of matrix B only needs to
be loaded once into the hardware peripheral.
63
Figure 3.9: Matrix multiplication with customized hardware peripheral for matrix block mul-
tiplication with 2£2 blocks
A¢B =
0
B
B
@
A
11
A
12
A
21
A
22
1
C
C
A
¢
0
B
B
@
B
11
B
12
B
21
B
22
1
C
C
A
=
0
B
B
@
A
11
B
11
+A
12
B
21
A
11
B
12
+A
12
B
22
A
21
B
11
+A
22
B
21
A
21
B
12
+A
22
B
22
1
C
C
A
(3.2)
² Implementation: The architecture of our block matrix multiplication based on 2£2 matrix
blocks is shown in Figure 3.9. Similar to the design of the CORDIC algorithm, the data
elements of matrix blocks from matrix B (e.g., b
11
, b
21
, b
12
and b
22
in Figure 3.9) are fed
into the hardware peripheral as control words. That is, when data elements of matrix blocks
from Matrix B are available in the FSL FIFO, Out# control becomes high and the hardware
peripheral puts these data elements into the corresponding registers. Thus, when the data
elements of matrix blocks from matrix A come in as normal data words, the multiplication and
accumulation are performed accordingly to generate the output results.
² Design space exploration: We consider di®erent implementations of the block matrix mul-
tiplication algorithm with di®erent number of N, the size of the matrix blocks used by the
customized hardware peripherals. For larger N employed by the customized hardware periph-
erals, a shorter execution time can potentially be achieved by the block matrix multiplication
64
4 6 8 10 12
0
500
1000
1500
2000
2500
3000
3500
N (size of input matrices, N×N)
Time (μsec)
2x2 matrix blocks
"Pure" software
4x4 matrix blocks
Figure 3.10: Time performance of our design of block matrix multiplication
application. At the same time, more hardware resources would be used by the con¯guration of
the MicroBlaze processor.
Thetimeperformanceofvariousimplementationsofblockmatrixmultiplicationisshownin
Figure 3.10 while their resource usage is shown in Table 3.1. For multiplication of two 12£12
matrices, theMicroBlazeprocessorwithacustomizedhardwareperipheralfor performing4£4
matrix block multiplication results in a 2.2x speed-up compared with \pure" software imple-
mentation. Also, attaching the customized hardware peripheral to the MicroBlaze processor
requires an additional 767 (17%) more slices.
Note that attaching a customized hardware peripheral for computing 2£2 matrix blocks
to the MicroBlaze processor results in worse performance for all the performance metrics con-
sidered. It uses 8.8% more execution time, 56 (8.6%) more slices and 2 (67%) more embedded
multipliers compared with the corresponding \pure" software implementations. This is be-
cause in this con¯guration, the communication overhead for sending data to and back from
the customized hardware peripheral is greater than the time saved by the parallel execution of
multiplying the matrix blocks.
65
² Simulation speed-ups: Similar to Section 3.5.1.1, we compare the simulation time in the
proposed cycle-accurate arithmetic level co-simulation environment with that of low-level be-
havioral simulation in ModelSim. Speed-ups in simulation time of 871x and 862x (866x on
average) are achieved for the two di®erent designs of the matrix multiplication application as
shown in Table 3.1.
² Analysis of simulation performance: For both the CORDIC division application and the
block matrix multiplication application, our co-simulation environment consistently achieves
simulation speed-ups of more than 800x, compared with the low-level behavioral (functional)
simulation using ModelSim.
ByutilizingthepublicC++APIs(ApplicationProgramInterfaces)providedbytheSystem
Generator tool, we are able to tightly integrate the instruction set simulator for MicroBlaze
processor with the simulation models for the other Simulink blocks. The simulators integrated
into our co-simulation environment run in lock-step with each other. That is, the synchro-
nization of the simulation processes within the hardware and software simulators and the ex-
change of simulation data between them occur at each Simulink simulation cycle. Thus, the
hardware-software partitioning of the target application and the amount of data that needs to
be exchanged between the hardware and software portions of the application would have little
impact on the simulation speed-ups that can be achieved. Therefore, for the various settings of
the two applications considered in this section, the variance of the simulation speed-ups is rel-
atively small and we are able to obtain consistent speed-ups for all the design cases considered
in our experiments.
3.5.2 Co-simulation of a complete multi-processor platform
The co-simulation of the complete multi-processor platform is illustrated through the design of
the 2-D DWT (Discrete Wavelet Transform) processing task of a motion JPEG2000 encoding
66
application. Motion JPEG2000 encoding is a widely used image processing application. Per-
forming JPEG2000 encoding on a 1024-pixel-by-768-pixel 24-bit color image takes around 8
seconds on a general purpose processor [24]. Encoding a 1 minute video clip with 50 frames
per second would take over 6 hours. 2-D DWT is one of the most time-consuming processing
tasks ofthemotionJPEG2000 encoding application. InthemotionJPEG2000 application, the
original input image is decomposed into a set of separate small image blocks. The 2-D DWT
processing is performed on each of the small image blocks to generate output for each of them.
The 2-D DWT processing within the motion JPEG2000 application exhibits a large degree of
parallelism, which can be used to accelerate its execution. Employing a con¯gurable multi-
processor platform for 2-D DWT processing allows for rapid development while potentially
leading to a signi¯cant speed-up in execution time.
Thedesignofthecon¯gurablemulti-processorplatformforperforming2-DDWTprocessing
is shown in Figure 3.11. Di®erent numbers of MicroBlaze processors are used to concurrently
process the input image data. Each of the processors has its local memory accessible through
the instruction-side and data-side LMB buses. The local memory is used to store a copy of
software programs and data for the MicroBlaze processors. By utilizing the dual-port BRAM
blocksavailableonXilinxFPGAs,twoprocessorsshareoneBRAMblockastheirlocalmemory.
Customized hardware peripherals can be attached to the processors as hardware accelerators
through the dedicated FSL interfaces. The multiple MicroBlaze processors are connected to-
getherusinganOPB(On-chipPeripheralBus)sharedbusinterfaceandgetaccesstotheglobal
hardware resources (e.g., the o®-chip global input data memory and the hardware semaphore).
ASingle-Program-Multiple-Data(SPMD)programmingmodelisemployedforthedevelop-
ment of software programs on the multi-processor platform. The input image data is divided
into multiple small image blocks. Coordinated by a OPB based hardware semaphore, the mul-
tiple processors fetch the image blocks from the o®-chip global input data memory through the
67
Figure3.11: Thecon¯gurablemulti-processorplatformwithfourMicroBlazeprocessorsforthe
JPEG2000 encoding application
OPBbus. TheMicroBlazeprocessorsstoretheinputimageblocksintheirlocalmemory. Then,
theprocessorsrunthe2-DDWTsoftwareprogramsstoredintheirlocalmemorytoprocessthe
local copies of the image blocks. Once the 2-D DWT processing of an image block is ¯nished,
the processors sendthe outputimage data tothe global inputdata memorycoordinated by the
hardware semaphore. See [37] for more details on the design of the multi-processing platform.
To simulate the multi-processor platform, multiple MicroBlaze Simulink blocks are used.
Each of the MicroBlaze Simulink blocks is responsible for simulating one MicroBlaze proces-
sor. The 2-D DWT software programs are compiled and provided to the MicroBlaze Simulink
blocks for simulation using the MicroBlaze instruction set simulator. The hardware accelera-
tors are described within MATLAB/Simulink. Each processor and its hardware accelerators
are placed in a MATLAB/Simulink subsystem. The arithmetic behavior of these MicroBlaze
based subsystems can be veri¯ed separately by following the co-simulation procedure described
in Section 3.5.1 before they are integrated into the complete multi-processor platform. The
simulation of the OPB shared bus interface is performed by an OPB Simulink block. The
68
globalinput/output data memory and the hardwaresemaphore are described within the MAT-
LAB/Simulink modeling environment.
1 2 3 4 5 6 7 8
1
1.5
2
2.5
3
3.5
Execution speed−up
Number of processors
With hardware accelerators
Without hardware accelerators
Figure 3.12: Execution time speed-ups of the 2-D DWT task
1 2 3 4 5 6 7 8
0
20
40
60
80
100
Utilizatoin (percentage)
Number of processors
With hardware accelerators
Without hardware accelerators
Figure 3.13: Utilization of the OPB bus interface when processing the 2-D DWT task
²Designspaceexploration: Weconsiderdi®erentcon¯gurationsofthemulti-processorplatform
as described above with the number of MicroBlaze processors used for processing the input
image data.
69
1 2 3 4 5 6 7 8
800
850
900
950
1000
Simulation speed−up
Number of processors
Without hardware accelerators
With hardware accelerators
Figure 3.14: Simulation speed-ups achieved by the arithmetic level co-simulation environment
Thetimeperformanceofthemulti-processorplatformunderdi®erentcon¯gurationsfor2-D
DWTprocessingisshowninFigure3.12. Forcasesthatnohardwareacceleratorsareemployed,
the execution time speed-ups obtained from our arithmetic level co-simulation environment are
consistent with those of the low-level implementations reported in [37]. For both cases that
either hardware accelerators are employed or the hardware accelerators are not employed, the
maximum execution time speed-up achieved by the multi-processor platform is 3.18x. When
no hardware accelerators are employed, the time performance of the multi-processor system
fails to improve on increasing the number of MicroBlaze processors beyond 4. When hardware
accelerators are employed, the time performance of the multi-processor system fails to improve
on increasing the number of MicroBlaze processors beyond 2. Therefore, when no hardware
accelerators are employed, the optimal con¯guration of the multi-processor platform is the
con¯guration that uses 4 MicroBlaze processors. When hardware accelerators are employed,
the optimal con¯guration of the multi-processor platform is the con¯guration that uses two
MicroBlaze processors.
70
Theseoptimalcon¯gurationsofthemulti-processorplatformareidenti¯edwithinourarith-
metic level co-simulation environment without the generation of low-level implementations and
low-level simulation. This is compared against the development approach taken by James-
Roxby, et al. [37]. In [37], the optimal con¯gurations of the multi-processor platform are iden-
ti¯ed after the low-level implementations of the multi-processor platform are generated and
time-consuming low-level simulation is performed based on these low-level implementations.
Besides, the application designer can further identify the performance bottlenecks using the
simulationinformationgatheredduringthearithmeticlevelco-simulationprocess. Considering
the 2-D DWT application, the utilization of the OPB bus interface can be obtained from the
arithmetic level co-simulation processes, which is shown in Figure 3.13. The OPB bus provides
a shared channel for communication between the multiple processors. For the processing of
one image block, each MicroBlaze processor needs to fetch the input data from the global data
input memory through the OPB bus. After the 2-D DWT processing, each processor needs
to send the result data to the global data memory through the OPB bus. The OPB bus thus
acts asaprocessing bottleneckthatlimitsthemaximumspeed-upsthatcanbeachievedbythe
multi-processor platform.
² Simulation speed-ups: The simulation speed-ups achieved using our arithmetic level co-
simulation approach as compared to those of low-level simulation using ModelSim are shown in
Figure 3.14. Similar to Section 3.5.1, the time for generating the low-level simulation models
that can be simulated in ModelSim is not accounted for for the experimental results shown
in Figure 3.14. For the various con¯gurations of the multi-processor platform, we are able to
achieve simulation speed-ups ranging from 830x to 967x and 889x on average when simulating
usingourarithmeticlevelco-simulationenvironmentascomparedwiththelow-levelbehavioral
simulation using ModelSim.
71
² Analysis of simulation performance: For the 2-D DWT processing task, the proposed arith-
meticlevelco-simulationenvironmentisabletoachievesimulationspeed-upsverycloseto that
of the two numerical processing applications discussed in Section 3.5.1.
Higher simulation speed-ups are achieved as more processors are employed in the designs.
The MicroBlaze cycle-accurate instruction set simulator is a manually optimized C simulation
model. SimulatingtheexecutionofsoftwareprogramsonMicroBlazeismuchmoree±cientus-
ing the instruction set simulator than using the behavioral simulation model within ModelSim.
As more processors are employed in the design, a larger portion of the complete system will be
simulated using the instruction set simulators, which leads to increased simulation speed-ups
of the complete system.
In addition, with the same number of MicroBlaze processors employed in the systems,
simulationofsystemswithouthardwareacceleratorsconsistentlyhaveslightlyhighersimulation
speed-ups compared with those with hardware accelerators. This is mainly due to two reasons.
One reason is the high simulation speed o®ered by the MicroBlaze instruction set simulator as
discussedabove. Also,lessSimulinkblocksareusedfordescribingthearithmetic-levelbehavior
of the systems when no hardware accelerators are used. This would reduce the communication
overhead between the Simulink simulation models and the instruction-set simulation. This
would contribute to the increase of simulation speed-ups.
3.6 Summary
In this chapter, we propose a design space exploration technique based on arithmetic level
cycle-accuratehardware-softwareco-simulationforapplicationdevelopmentusingFPGAbased
con¯gurablemulti-processorplatforms. Animplementationoftheproposedtechniquebasedon
MATLAB/Simulink is provided to illustrate the construction of the proposed arithmetic level
72
co-simulation environment. The design of several numerical computation and image process-
ing applications were provided to demonstrate the e®ectiveness of the proposed design space
exploration technique based on arithmetic level co-simulation.
73
Chapter 4
Energy performance modeling and energy e±cient
mapping for a class of application
4.1 Introduction
Recon¯gurable hardware has evolved to become Recon¯gurable System-on-Chips (RSoCs).
Many modern recon¯gurable hardware integrates general-purpose processor cores, recon¯g-
urable logic, memory, etc., on a single chip. This is driven by the advantages of programmable
design solutions over application speci¯c integrated circuits and a recent trend in integrating
con¯gurablelogic,e.g.,FPGA,andprogrammableprocessors,o®eringthe\bestofbothworlds"
on a single chip.
Inrecentyears, energye±ciencyhasbecomeincreasinglyimportantinthedesignofvarious
computationandcommunicationsystems. Itisespeciallycriticalinbatteryoperatedembedded
and wireless systems. RSoC architectures o®er high e±ciency with respect to time and energy
performance. They have been shown to achieve energy reduction and increase in computa-
tional performance of at least one order of magnitude compared with traditional processors [9].
One important application of RSoCs is software de¯ned radio (SDR). In SDR, dissimilar and
complex wireless standards (e.g. GSM, IS-95, wideband CDMA) are processed in a single base
74
station, where a large amount of data from the mobile terminals results in high computational
requirement. The state-of-the-art RISC processors and DSPs are unable to meet the signal
processing requirement of these base stations [49, 20]. Minimizing the power consumption has
also become a key issue for these base stations due to their high computation requirements
that dissipate a lot of energy as well as the inaccessible and distributed locations of the base
stations. RSoCs stand out as an attractive option for implementing various functions of SDR
due to their high performance, high energy e±ciency, and recon¯gurability.
Inthesystemsdiscussedabove, theapplicationisdecomposedintoanumberoftasks. Each
task is mapped onto di®erent components of the RSoC device for execution. By synthesis, we
mean ¯nding a mapping that determines an implementation for the tasks. We can map a task
to hardware implementations on recon¯gurable logic, or software implementations using the
embeddedprocessorcore. Besides,RSoCso®ermanycontrolknobs(seeSection4.2fordetails),
which can be used to improve energy e±ciency. In order to better exploit these control knobs,
a performance model of the RSoC architectures and algorithms for mapping applications onto
these architectures are required. The RSoC model should allow for a systematic abstraction
of the available control knobs and enable system-level optimization. The mapping algorithms
should capture the parameters from the RSoC model, the communication costs for moving
data between di®erent components on RSoCs, and the con¯guration costs for changing the
con¯guration of the recon¯gurable logic. These communication and con¯guration costs cannot
be ignored compared with that required for computation. We show that a simple greedy
mappingalgorithmthatmapseachtaskontoeitherhardwareorsoftware,dependinguponwhich
dissipates the least amount of energy, does not always guarantee minimum energy dissipation
in executing the application.
We propose a three-step design process to achieve energy e±cient hardware/software co-
synthesis on RSoCs. First, we develop a performance model that represents a general class of
75
RSoC architectures. The model abstracts the various knobs that can be exploited for energy
minimization during the synthesis process. Then, based on the RSoC model, we formulate a
mapping problem for a class of applications that can be modeled as linear pipelines. Many
embedded signal processing applications, such as the ones considered in this chapter, are com-
posed of such a linear pipeline of processing tasks. Finally, a dynamic programming algorithm
is proposed for solving the above mapping problem. The algorithm is shown to be able to ¯nd
a mapping that achieves minimum energy dissipation in polynomial time.
We synthesize two beamforming applications onto Virtex-II Pro to demonstrate the e®ec-
tivenessofourdesignmethodology. Virtex-IIProisastate-of-the-artRSoCdevicefromXilinx.
In this device, PowerPC 405 processor(s), recon¯gurable logic, and on-chip memory are tightly
coupled through on-chip routing resources [85]. The beamforming applications considered can
be used in embedded sonar systems to detect the direction of arrival (DOA) of close by objects
. They can also be deployed at the base stations using the SDR technique to better exploit the
limited radio spectrum [63].
The organization of this chapter is as follows. Section 4.2 identi¯es the knobs for energy-
e±cient designs on RSoC devices. Section 4.3 discusses related work. Section 4.4 describes
the proposed RSoC model. Section 4.5 describes the class of linear pipeline applications we
are targeting and formulates the energy-e±cient mapping problem. Section 4.6 presents our
dynamic programming algorithm. Section 4.7 illustrates the algorithm using two state-of-
the-art beamforming applications. The modeling process and the energy dissipation results
of implementing the two applications onto Virtex-II Pro are also given in this section. We
conclude in Section 4.8.
76
4.2 Knobs for energy-e±cient designs
VarioushardwareandsystemleveldesignknobsareavailableinRSoCarchitecturestooptimize
the energy e±ciency of designs. For embedded processor cores, dynamic voltage scaling and
dynamic frequency scaling can be used to lower the power consumption. The processor cores
can be put into idle or sleep mode if desired to further reduce their power dissipation. For
memory, the memory (SDRAM) on Triscend A7 CSoC devices can be changed to be in active,
stand-by, disabled, or power-down state. Memory banking, which can be applied to the block-
wise memory (BRAMs) in Virtex-II Pro, is another technique for low power designs. In this
technique, the memory is split into banks and is selectively activated based on the use.
For recon¯gurable logic, there are knobs at two levels that can be used to improve energy
e±ciency of the designs: low level and algorithm level.
Low-level knobs refer to knobs at the register-transfer or gate level. For example, Xilinx
exposes the various features on their devices to designers through the unisim library [82]. One
low-level knob is clock gating, which is employed to disable the clock to blocks to save power
when the output of these blocks is not needed. In Virtex-II Pro, it can be realized by using
primitivessuchasBUFGCEtodynamicallydriveaclocktreeonlywhenthecorrespondingblock
isused[85]. Choosinghardwarebindings isanotherlow-levelknob. Abindingisamappingofa
computation to a speci¯c component on RSoC. Alternative realizations of a functionality using
di®erent components on RSoC result in di®erent amounts of energy dissipation for the same
computation. For example, there are three possible bindings for storage elements in Virtex-
II Pro, which are registers, slice based RAMs, and embedded Block RAMs (BRAMs). The
experiments by Choi et al. [64] show that registers and slice based RAMs have better energy
e±ciencyforimplementingsmallamountofstoragewhileBRAMshavebetterenergye±ciency
for implementing large amount of storage.
77
Algorithm-level knobs refer to knobs that can be used during the algorithm development
to reduce energy dissipation. It has been shown that energy performance can be improved
signi¯cantly by optimizing a design at the algorithm level [61]. One algorithm-level knob is
architecture selection. It plays a major role in determining the amount of interconnect and
logic to be used in the design and thus a®ects the energy dissipation. For example, matrix
multiplication can be implemented using a linear array or a 2-D array. A 2-D array uses more
interconnects and can result in more energy dissipation compared with a 1-D array. Another
algorithm-level knob is the algorithm selection. An application can be mapped onto recon¯g-
urable logic in several ways by selecting di®erent algorithms. For example, when implementing
FFT, a radix-4 based algorithm would signi¯cantly reduce the number of complex multipli-
cations that would otherwise be needed if a radix-2 algorithm is used. Other algorithm-level
knobs are parallel processing and pipelining.
As recon¯gurable architectures are becoming domain-speci¯c and integrate recon¯gurable
logic with a mix of resources, such as the ASMBL (Application Speci¯c Modular Block) archi-
tecture proposed by Xilinx [70]. More control knobs for application development are expected
to be available on RSoCs in the future.
4.3 Related work
Gupta et al. [27] and Xie et al. [78] have considered the hardware/software co-design problem
in the context of recon¯gurable architectures. They use techniques such as con¯guration pre-
fetching, to minimize the execution time. Energy e±ciency is not addressed by their research.
Experiments for re-mapping of critical software loops from a microprocessor to hardware
implementations using con¯gurable logic are carried out by Villarreal et al. [77]. Signi¯cant
energysavingsisachievedforaclassofapplications. However,asystematictechniquethat¯nds
78
Table 4.1: Maximum operating frequencies of di®erent implementations of an 18£18-bit mul-
tiplication on Virtex-II Pro
Design VHDL(inferred) VHDL(unisim) IP cores
F
max
»120 MHz »207 MHz »354 MHz
theoptimalhardwareandsoftwareimplementationsoftheseapplicationsisnotaddressed. Such
a systematic technique is a focus of this chapter.
Ahardware-softwarebi-partitioningalgorithmbasedonnetwork°owtechniquesfordynami-
callyrecon¯gurablesystemshasbeenproposedbyRakhmatovetal. [62]. Whiletheiralgorithm
can be used to minimize the energy dissipation, designs on RSoCs are more complicated than
a hardware-software bi-partitioning problem due to the many control knobs discussed in the
previous section.
A C-to-VHDL high-level synthesis framework is proposed by Gupta et al. [27]. The input
to their design °ow is C code and they employ a set of compiler transformations to optimize
the resulting designs. However, generic HDL description is usually not enough to achieve best
performance as the recent FPGAs integrate many heterogeneous components. Use of device
speci¯c design constraint ¯les and vendor IP cores as that in the MATLAB/Simulink based
design °ow plays an important role in achieving good performance. For example, Virtex-II Pro
has embedded multipliers. We consider three implementations of an 18£18-bit multiplication
using these multipliers. In the ¯rst implementation, we use VHDL for functional description.
The use of embedded multipliers for implementing the functions is inferred by the synthesis
tool. In the second implementation, this is accomplished by directly controlling the related
low-level knobs through the unisim library. In the third implementation, we use the IP core
for multiplication from Xilinx. Low-level knobs and the device speci¯c design constraints are
already applied for performance optimization during the generation of these IP cores. The
maximum operating frequencies of these implementations are shown in Table 6.1. The im-
plementation using IP core has by far the fastest maximum operating frequency. The reason
79
for such performance improvement is that the speci¯c locations of the embedded multipliers
require appropriate connections between the multipliers and the registers around them. Use
of appropriate location and timing constraints as in the generation of the IP cores leads to
improved performance when using these multipliers [2]. It is expected that such constraint
¯les and vendor IP cores will also have a signi¯cant impact on energy e±ciency of the designs.
Therefore, comparing with Gupta et al.'s approach, we consider task graphs as input to our
design °ow. In our approach, the energy e±ciency of the designs is improved by making use of
the various controlknobs on the target device and parameterized implementations of the tasks.
Systemleveltoolsarebecomingavailabletosynthesizeapplicationsontoarchitecturescom-
posed of both hardware and software components. Xilinx o®ers Embedded Development Kit
(EDK) that integrates hardware and software development tools for Xilinx Virtex-II Pro [85].
In this design environment, the portion of the application to be synthesized onto software is
described using C/C++ and is compiled using GNU gcc. The portion of the application to
be executed in hardware is described using VHDL/Verilog and is compiled using Xilinx ISE.
In the Celoxica DK2 tool [13], Handel-C (C with additional hardware description) is used for
both hardware and software designs. Then, the Handel-C compiler synthesizes the hardware
and software onto the device. While these system level tools provide high level languages to
describe applications and map them onto processors and con¯gurable hardware, none of them
address synthesis of energy e±cient designs.
4.4 Performance modeling of RSoC architectures
An abstraction of RSoC devices is proposed in this section. A model for Virtex-II Pro is
developed to illustrate the modeling process.
80
4.4.1 RSoC model
In Figure 4.1, the RSoC model consists of four components: a processor, a recon¯gurable logic
(RL) such as FPGA, a memory, and an interconnect. There are various implementations of
the interconnect. For example, in Triscend CSoC [74], the interconnect between the ARM7
processor and the SDRAM is a local bus while the interconnect between the SDRAM and the
con¯gurable system logic is a dedicated data bus and a dedicated address bus. In Virtex-II
Pro [85], the interconnect between the PowerPC processor and the RL is implemented using
the on-chip routing resource. We abstract all these buses and connections as an interconnect
with (possibly) di®erent communication time and energy costs between di®erent components.
We assume that the memory is shared by the processor and the RL.
Figure 4.1: The RSoC model
Since the operating state of the interconnect depends on the operating state of the other
components, an operating state of the RSoC device, denoted as a system state, is thus only
determinedbytheoperatingstatesfortheprocessor,theRL,andthememory. LetS denotethe
set of all possible system states. Let PS(s), RS(s) and MS(s) be functions of a system state
s, s2 S. The output of these functions are integers that represent the operating states of the
processor,theRLandthememory,respectively. Anoperatingstateoftheprocessorcorresponds
to the state in which the processor is idle or is operating with a speci¯c power consumption.
Supposethatanidlemodeanddynamicvoltagescalingwith v¡1voltagesettingsareavailable
81
on the processor. The processor is assumed to operate at a speci¯c frequency for each of the
voltagesettings. Then,theprocessorhasv operatingstates,0·PS(s)·v¡1,withPS(0)=0
beingthestateinwhichtheprocessorisintheidlemode. TheRLisidlewhenthereisnoinput
data and it is clock gated without switching activity on it. Thus, when the RL is loaded with a
speci¯ccon¯guration, itcanbeintwodi®erentoperatingstatesdependingonwhetheritisidle
or processing the input data. Suppose that there are c con¯gurations for the RL, then the RL
has 2c operating states, 0·RS(s)·2c¡1. We number the operating states of RL such that
(a) for 0·RS(s)·c¡1, RS(s) is the state in which the RL is idle, loaded with con¯guration
RS(s); (b) for c · RS(s) · 2c¡ 1, RS(s) is the state which the RL is operating, loaded
with con¯guration RS(s)¡c. Each power state of the memory corresponds to an operating
state. For example, when memory banking is used to selectively activate the memory banks,
each combination of the activation states of the memory banks represents an operating state
of the memory. Suppose that the memory has m operating states, then 0· MS(s)· m¡1.
The operating state of the interconnect is related to the operating states of the other three
components. Considering the above, the total number of distinct system states is 2vcm.
The application is modeled as a collection of tasks with dependencies (see Section 4.5.1 for
details). Suppose that task i
0
is to be executed immediately preceding task i. Also, suppose
that task i
0
is executed in system state s
0
and task i is executed in system state s. If s
0
6= s,
a system state transition is required. The transition between di®erent system states incurs
certain amount of energy. Our model consists of the following parameters:
²¢EV
PS(s
0
);PS(s)
: statetransitionenergydissipationintheprocessorfromPS(s
0
)toPS(s)
² ¢EC
RS(s
0
);RS(s)
: state transition energy dissipation in the RL from RS(s
0
) to RS(s)
² ¢EM
MS(s
0
);MS(s)
: memory state transition energy dissipation from MS(s
0
) to MS(s)
² IP: processor power consumption in the idle state
² IR: RL power consumption in the idle state
82
² PM
MS(s)
: memory power consumption in state MS(s)
² MP
MS(s)
: average energy dissipation for transferring one bit data between the memory
and the processor when memory is in state MS(s)
² MR
MS(s)
: average energy dissipation for transferring one bit data between the memory
and the RL when memory is in state MS(s)
The system state transition costs depend not only on the source and destination system
states of the transition but also on the requirement of the application. Let ¢
i
0
;i;s
0
;s
be the
energy dissipation for such system state transition. ¢
i
0
;i;s
0
;s
can be calculated as
¢
i
0
;i;s
0
;s
=¢EV
PS(s
0
);PS(s)
+¢EC
RS(s
0
);RS(s)
+¢EM
MS(s
0
);MS(s)
+¢A
i
0
i
(4.1)
where,¢A
i
0
i
istheadditionalcostfortransferringdatafromtaski
0
totaskiinagivenmapping.
For this given mapping, ¢A
i
0
i
can be calculated based on the application models discussed in
Section 4.5.1 and the communication costs MP
MS(s)
and MR
MS(s)
.
4.4.2 A model for Virtex-II Pro
TherearefourcomponentstobemodeledinVirtex-IIPro. OneistheembeddedPowerPCcore.
Duetothelimitationsinmeasuringthee®ectsoffrequencyscaling, weassumethat theproces-
sor has only two operating states: On and O®, and is operating at a speci¯c frequency when it
is On. Thus, v =2. We ignore IP since the PowerPC processor does not draw any power if it
is not used in a design. ¢EV
0;1
and ¢EV
1;0
is also ignored since changing the processor states
dissipatesnegligibleamountofenergycomparedwiththatwhenitperformscomputation. Two
partial recon¯guration methods which are module based and small bit manipulation based [86]
are available on the RL of Virtex-II Pro. For the small bit manipulation based partial recon-
¯guration, switching the con¯guration of one module on the device to another con¯guration
83
requires downloading the di®erence between the con¯guration ¯les for this module. This is dif-
ferentfromthemodulebasedapproach,whichrequiresdownloadingtheentirecon¯guration¯le
for the module. Thus, the small bit manipulation based partial recon¯guration has relatively
lowlatencyandenergydissipationcomparedwiththemodulebasedone. Therefore, weusethe
small bit manipulation based method. We estimate the recon¯guration cost as the product of
the number of slices used by the implementation and the average cost for downloading data for
con¯guring one FPGA slice. According to [82], the energy for recon¯guring the entireVirtex-II
Pro XC2VP20 device is calculated as follows. Let ICC
Int
denote the current for powering the
core of the device. We assume that the current for con¯guring the device is mainly drawn
from ICC
Int
. From the data sheet [85], ICC
Int
= 500 mA@1.5V during con¯guration and
ICC
Int
= 300 mA@1.5V during normal operation, the recon¯guration power is estimated as
(500-300)£ 1.5 = 300 mW. The time for recon¯guring the entire device using SelectMAP (50
MHz) is 20.54 ms. Thus, the energy for recon¯guring the entire device is 6162 ¹J. There are
9280 slices on the device. Together with the slice usage from the post place-and-route report
generated by the Xilinx ISE tool [82], we estimate the energy dissipation of recon¯guration as
¢EC
RS(s
0
);RS(s)
= 6162£ (total number of slices used by the RL in operating state RS(s)) /
9280¹J.ThequiescentpoweristhestaticpowerdissipatedbytheRLwhenitison. Thispower
cannot be optimized at the system level if we do not power on and o® the RL. Thus, it is not
considered in this chapter. Since IR represents the quiescent power, it is set to zero. We also
ignore the energy dissipation for enabling/disabling clocks to the design blocks on the RL in
the calculation of ¢EC
RS(s
0
);RS(s)
since it is negligible compared with the other energy costs.
For memory modeling, we use BRAM. It has only one available operating state, m = 1 and
MS(s) = 0. Since the memory does not change its state, ¢EM
MS(s
0
);MS(s)
= 0. The BRAM
dissipatesnegligible amount of energy whenthere is no memory access. Weignore thisvalue so
thatPM
0
=0. Using the power model from [64] and [84], energy dissipation MR
0
is estimated
84
. . .
T
0
T
1
T
2
T
n-1
Figure 4.2: A linear pipeline of tasks
as42.9nJ/Kbyte. Thecommunicationbetweenprocessorandmemoryfollowscertainprotocols
on the bus. This is speci¯ed by the vendor. Its energy e±ciency is di®erent depending on the
bus protocols used. Energy cost MP
0
is measured through low-level simulation.
Table 4.2: Energy dissipation E
i;s
for executing task T
i
in state s
0·s·(v¡1)cm¡1 (v¡1)cm·s·vcm¡1 vcm·s·(2v¡1)cm¡1
1 EP
i;s
ER
i;s
EP
i;s
+ER
i;s
2 TP
i;s
¢IR TR
i;s
¢IP 0
3 TP
i;s
¢PM
MS(s)
TR
i;s
¢PM
MS(s)
max(TP
i;s
;TR
i;s
)¢PM
MS(s)
4.5 Problem formulation
A model for a class of applications with linear dependency constraints is described in this
section. Then, a mapping problem is formulated based on both the RSoC model and the
application model.
4.5.1 Application model
The application consists of a set of tasks, T
0
, T
1
, T
2
, ¢¢¢, T
n¡1
, with linear precedence con-
straints. T
i
must be executed before initiating T
i+1
, i = 0;¢¢¢ ;n¡2. Due to the precedence
constraints, only one task is executed at any time. The execution can be on the processor, on
the RL, or on both. There is data transfer between adjacent tasks. The transfer can occur
between the processor and the memory or between the RL and the memory, depending on
where the tasks are executed.
The application model consists of the following parameters:
85
² D
i
in
and D
i
out
: amount of data input from memory to task T
i
and data output from task
T
i
to memory.
² EP
i;s
and TP
i;s
: processor energy and time cost for executing task T
i
in system state s.
EP
i;s
=TP
i;s
=1 if task T
i
cannot be executed in system state s.
² ER
i;s
and TR
i;s
: RL energy and time cost for executing task T
i
in system state s.
ER
i;s
=TR
i;s
=1 if task T
i
cannot be executed in system state s.
4.5.2 Problem de¯nition
WenowformulatetheproblembasedontheparametersoftheRSoCmodelandtheapplication
model. In this chapter, an energy e±cient mapping is de¯ned as the mapping that minimizes
the overall energy dissipation for executing the application over all possible mappings.
Duringtheexecutionoftheapplication,ataskcanbeginexecutionassoonasitspredecessor
task ¯nishes the execution. Thus, for any possible system state s, the processor and the
RL cannot be in idle state at the same time. The total number of possible system states is
jSj=(2v¡1)cm. Letthesystemstatesbenumberedfrom0to(2v¡1)cm¡1. Then, depending
on the sources of energy dissipation, we divide the system states into three categories:
² For 0· s· (v¡1)cm¡1, s denotes the system state in which the processor is in state
PS(s) (1· PS(s)· v¡1), the RL is in the idle state loaded with con¯guration RS(s) and
the memory is in state MS(s). PS(s), RS(s) and MS(s) are determined by solving equation
s=(PS(s)¡1)cm+RS(s)m+MS(s).
²For(v¡1)cm·s·vcm¡1,sdenotesthesystemstateinwhichtheprocessorisintheidle
state (PS(s)=0), the RL is operating with con¯guration RS(s)¡c (c·RS(s)·2c¡1) and
the memory is in state MS(s). PS(s), RS(s) and MS(s) are determined by solving equation
s=(RS(s)¡c)m+MS(s)+(v¡1)cm.
86
²Forvcm·s·(2v¡1)cm¡1,sdenotesthesystemstateinwhichtheprocessorisinstate
PS(s) (1·PS(s)·v¡1)), the RL is operating in state RS(s)¡c (c·RS(s)·2c¡1) and
the memory is in state MS(s). PS(s), RS(s) and MS(s) are determined by solving equation
s=(PS(s)¡1)cm+(RS(s)¡c)m+MS(s)+vcm.
Let E
i;s
denote the energy dissipation for executing T
i
in state s. Then, E
i;s
is calculated
as the sum of the following:
² The energy dissipated by the processor or/and the RL that is/are executing T
i
;
²IftheprocessorortheRLisintheidlestate,theidleenergydissipationofthecomponent;
² The energy dissipated by the memory during the execution of T
i
.
The above three sources of energy dissipation are calculated as in Table 4.2.
We calculate the system state transition costs using Equation (4.1). Since a lin-
ear pipeline of tasks is considered, i
0
= i ¡ 1. The energy dissipation for state tran-
sitions between the execution of two consecutive tasks T
i¡1
and T
i
, namely, ¢
i¡1;i;s
0
;s
is calculated as ¢EV
PS(s
0
);PS(s)
+ ¢EC
RS(s
0
);RS(s)
+ ¢EM
MS(s
0
);MS(s)
+D
i
out
¢MP
MS(s
0
)
+
D
i
in
¢MP
MS(s)
.
Let s
i
denote the system state while executing T
i
under a speci¯c mapping, 0·i·n¡1.
The overall system energy dissipation is given by
E
total
=E
0;s
0
+
n¡1
X
i=1
(E
i;s
i
+¢
i¡1;i;s
0
i¡1
;s
i
) (4.2)
Now, the problem can be stated as: ¯nd a mapping of tasks to system states, that is, a
sequence of s
0
;s
1
;¢¢¢ ;s
n¡1
, such that the overall system energy dissipation given by Equation
(4.2) is minimized.
87
4.6 Algorithm for energy minimization
We create a trellis according to the RSoC model and the application model. Based on the
trellis, a dynamic programming algorithm is presented in Section 4.6.2.
4.6.1 Trellis creation
A trellis is created as illustrated in Figure 4.3. It consists of n+2 steps, ranging from -1 to n.
Each step corresponds to one column of nodes shown in the ¯gure. Step -1 and step n consist
of only one node 0, which represents the initial state and the ¯nal state of the system. Step i,
0·i·n¡1, consists ofjSj nodes, numbered from 0;1;¢¢¢ ;jSj¡1, each of which represents
the system state for executing task T
i
. The weight of node N
s
in step i is the energy cost E
i;s
for executing task T
i
in system state s. If task T
i
cannot be executed in system state s, then
E
i;s
= 1. Since node N
0
in step -1 and step n do not contain any tasks, E
¡1;0
= E
n;0
= 0.
There are directed edges (1) from node N
¡1
in step -1 to node N
j
in step 0, 0·j·jSj¡1; (2)
from node N
j
in step i¡1 to node N
k
in step i for i=1;¢¢¢ ;n¡1, 0·j;k·jSj¡1; and (3)
from node N
j
in step n¡1 to node N
0
in step n, 0·j·jSj¡1. The weight of the edge from
node N
s
0 in step i¡1 to node N
s
in step i is the system state transition energy cost ¢
i¡1;i;s
0
;s
,
0·s
0
;s·jSj¡1. Note that all the weights are non negative.
4.6.2 A dynamic programming algorithm
Basedonthetrellis,ourdynamicprogrammingalgorithmisdescribedbelow. Weassociateeach
node with a path cost P
i;s
. De¯ne P
i;s
as the minimum energy cost for executing T
0
;T
1
;¢¢¢ ;T
i
with T
i
executed in node N
s
in step i. Initially, P
¡1;0
= 0. Then, for each successive step i,
0·i·n, we calculate the path cost for all the nodes in the step. The path cost P
i;s
for node
N
s
in step i is calculated as
88
. . .
. . .
. . .
step -1 step 0 step 1 step -1 step
0 0 0 0
1
0
|S|-1 |S|-1 |S|-1
1 1
Δ
-1,0,0,0
Δ
-1,0,0,|S|-1
Δ
0,1,0,0
Δ
0,1,|S|-1,|S|-1
Δ
n-1,n,0,0
Δ
n-1,n,|S|-1,0
Δ
-1,0,0,1
Δ
n-1,n,1,0
n n
Figure 4.3: The trellis
² For i=0:
P
i;s
=¢
¡1;0;0;s
+E
1;s
, for 0·s·jSj¡1 (4.3)
² For 1·i·n¡1:
P
i;s
= min
0·s
0
·jSj¡1
fP
i¡1;s
0 +¢
i¡1;i;s
0
;s
+E
i;s
g, for 0·s·jSj¡1
(4.4)
² For i=n:
P
i;s
= min
0·s
0
·jSj¡1
fP
i¡1;s
0 +¢
i¡1;i;s
0
;s
+E
i;s
g, for s=0 (4.5)
Only one path cost is associated with node N
0
in step n. A path that achieves this path cost is
de¯ned as a surviving path. Using this path, we identify a sequence of s
0
;s
1
;¢¢¢ ;s
n¡1
, which
speci¯es how each task is mapped onto the RSoC device. From the above discussion, we have,
Theorem 1 The mapping identi¯ed by a surviving path achieves the minimum energy dissipa-
tion among all the mappings.
89
Since we need to consider O((2v ¡ 1)cm) possible paths for each node and there are
O((2v¡1)cm¢n) nodes in the trellis, the time complexity of the algorithm is O(v
2
c
2
m
2
n).
The con¯gurations and the hardware resources are not reused between tasks in most cases,
which means that the trellis constructed in Figure 4.3 is usually sparsely connected. Therefore,
the following pre-processing can be applied to reduce the running time of the algorithm: (1)
nodes with1 weight and the edges incident on these nodes are deleted from the trellis; (2) the
remaining nodes within each step are renumbered. After this two-step pre-processing, we form
a reduced trellis and the dynamic programming algorithm is run on the reduced trellis.
4.7 Illustrative examples
To demonstrate the e®ectiveness of our approach, we implement a broadband delay-and-sum
beamforming application and an MVDR (minimum-variance distortionless response) beam-
forming application on Virtex-II Pro, a state-of-the-art recon¯gurable SoC device. These ap-
plications are widely used in many embedded signal processing systems and in SDR [20].
4.7.1 Delay-and-sum beamforming
Using the model for Virtex-II Pro discussed in Section 4.4.2, implementing the delay-and-sum
beamforming application is formulated as a mapping problem. This problem is then solved
using the proposed dynamic programming algorithm.
4.7.1.1 Problem formulation
The task graph of the broadband delay-and-sum beamforming application [32] is illustrated in
Figure 4.4. A cluster of seven sensors samples data. Each set of the sensor data is processed by
an FFT unit and then all the data is fed into the beamforming application. The output is the
spatialspectrumresponse,whichcanbeusedtodeterminethedirectionsoftheobjectsnearby.
90
Theapplicationcalculatestwelvebeamsandiscomposedofthreetaskswithlineardependences:
calculation of the relative delay for di®erent beams according the positions of the sensors (T
0
),
computation of the frequency responses (T
1
), and calculation of the amplitude for each output
frequency (T
2
). The data in and data out are performed via the I/O pads on Virtex-II Pro.
ThenumberofFFTpointsintheinputdatadependsonthefrequencyresolutionrequirements.
The number of output frequency points is determined by the spectrum of interest. The three
tasks can be executed either on the PowerPC processor or on the RL. The amount of data
input (D
i
in
) and output (D
i
out
) varies with the tasks. For example, when both the numbers of
FFT points and the output frequency points are 1024, D
1
in
and D
1
out
for task T
1
are 14 bytes
and 84 Kbytes, respectively.
Figure 4.4: Task graph of the delay-and-sum beamforming application
We employ the algorithm-level control knobs discussed in Section 4.2 to develop various
designs on the RL. There are many possible designs. For the sake of illustration, we implement
two designs for each task. One of the main di®erences among these designs is the degree of
parallelism, which a®ects the number of resources, such as I/O ports and sine/cosine look-up
tables, used by the tasks. For example, one con¯guration of task T
0
handles two input data
per clock cycle and requires more I/O ports than the other con¯guration that handles only
one input per clock cycle. While the ¯rst con¯guration would dissipate more power and more
recon¯gurationenergythanthesecondone,itreducesthelatencytocompletethecomputation.
Similarly, one con¯guration for task T
2
uses two sine/cosine tables and thus can generate the
outputinoneclockcyclewhiletheothercon¯gurationusesonlyone sine/cosine tableandthus
requires two clock cycles in order to generate the output.
91
Each task is mapped on the RL to obtain TR
i;s
and ER
i;s
values. The designs for the
RL were coded using VHDL and are synthesized using XST (Xilinx Synthesis Tool) provided
by Xilinx ISE 5.2.03i [82]. The VHDL code for each task is parameterized according to the
application requirements such as the number of FFT points, and the architectural control
knobs such as precision of input data and hardware binding for storing intermediate data. The
utilization of the device resources is obtained from the place-and-route report ¯les (.par ¯les).
To obtain the power consumption of our designs, the VHDL code was synthesized using XST
for XC2VP20 and the place-and-route design ¯les (.ncd ¯les) are obtained. Mentor Graphics
ModelSim 5.7 was used to generate the simulation results (.vcd ¯les). The .ncd and .vcd ¯les
were then provided to Xilinx XPower [87] to obtain the average power consumption. TR
i;s
is
calculated based on our designs running at 50 MHz and 16-bit precision. ER
i;s
is calculated
based on both TR
i;s
and power measurement from XPower.
Table 4.3: Energy dissipation of the tasks in the delay-and-sum beamforming application (¹J)
E
0;0
9.17 E
1;0
38.47 E
2;0
2.31
E
0;1
4.65 E
1;1
26.31 E
2;1
2.29
E
0;2
48.31 E
1;2
3039.52 E
2;2
123.24
For the PowerPC core on Virtex-II Pro, we develop C code for each task, compiled it using
the gcc compiler for PowerPC, and generated the bitstream using the tools provided by Xilinx
Embedded Development Kit (EDK). We used the SMART model from Synopsis [73], which
is a cycle-accurate behavioral simulation model for PowerPC, to simulate the execution of the
C code. The data to be computed is stored in the BRAMs of Virtex-II Pro. The latencies
for executing the C code are obtained directly by simulating the designs using ModelSim 5.7.
The energy dissipation is obtained assuming a clock frequency of 300MHz and the analytical
expressionforprocessorpowerdissipationprovidedbyXilinx[85]as: 0.9mW/MHz£300MHz
= 270 mW. Then, we estimate the TP
i;s
and EP
i;s
values. Note that the quiescent power is
ignored in our experiments as discussed in Section 4.4.2.
92
Considering both the PowerPC and the FPGA, we have three system states for each of
the three tasks on the reduced trellis after the pre-processing discussed in Section 4.6.2. Thus,
0· s· 2. Table 4.3 shows the E
i;s
values for the three tasks when the number of input FFT
points and the output frequency points is 1024.
256 512 1024 2048
0
100
200
300
400
500
600
Dynamic Programming: DP
Greedy: GR
DP
GR
DP
GR
DP
GR
DP
GR
Number of output frequency points
Energy dissipation (μ J)
Execution
Communication
Configuration
Figure 4.5: Energy dissipation of di®erent implementations of the broadband delay-and-sum
beamforming application (the input data is after 2048-point FFT processing)
For simple designs, the values of the parameters discussed above can be obtained through
low-level simulations. However, for complex designs with many possible parameterizations,
such low-level simulation can be time consuming. This is especially the case for designs on
RL. However, using the domain-speci¯c modeling technique proposed in [14] and the power
estimation tool proposed by us in [53], it is possible to have rapid and fairly accurate system-
wide energy estimation of data paths on RL without the time consuming low-level simulation.
4.7.1.2 Energy minimization
We create a trellis with ¯ve steps to represent this beamforming application. After the pre-
processing discussed in Section 4.6.2, step -1 and step 3 contain one node each while step 0,
93
256 512 1024
0
100
200
300
400
500
600
Dynamic Programming: DP
Greedy: GR
DP
GR
DP
GR
DP
GR
Number of FFT points
Energy dissipation (μ J)
Execution
Communication
Configuration
Figure 4.6: Energy dissipation of di®erent implementations of the broadband delay-and-sum
beamforming application (the number of output frequency points is 256)
Figure 4.7: Task graph of the MVDR beamforming application
1 and 2 contain three nodes each on the reduced trellis. By using the values described above,
we obtain the weights of all the nodes and the edges in the trellis. Based on this, our dynamic
programming based mapping algorithm is used to ¯nd the mapping that minimizes the overall
energy dissipation.
For the purpose of comparison, we consider a greedy algorithm that always maps each task
to the system state in which executing the task dissipates the least amount of energy. The
results are shown in Figure 4.5 and Figure 4.6. For all the considered problem sizes, energy
reduction ranging from 41% to 54% can be achieved by our dynamic programming algorithm
over the greedy algorithm.
ConsideringthecasewhereboththenumberofFFTpointsoftheinputdataandthenumber
of output frequency points are 2048, the greedy algorithm maps task T
0
on the RL. However,
94
A I
A R
Rst
IN
1
R
IN
2
R
IN
1
I
IN
2
I
OUT
R
OUT
I
A I
A R
A I
A R
0
0
0
0
Rst
IN
1
R
IN
2
R
IN
1
I
IN
2
I
IN
3
R
IN
3
I
IN
4
R
IN
4
I
OUT R
OUT I
A I
A R
A I
A R
0
0
0
0
IN
1
R
IN
2
R
IN
1
I
IN
2
I
IN
3
R
IN
3
I
IN
4
R
IN
4
I
OUT R
OUT I
A I
A R
A I
A R
0
0
0
0
0
0
0
0
Rst
IN
5
R
IN
5
I
IN
6
R
IN
6
I
IN
7
R
IN
7
I
IN
8
R
IN
8
I
(a) 2-input (b) 4-input (c) 8-input
Figure 4.8: MAC architectures with various input sizes
the dynamic programming algorithm maps this task on the processor and a 54% reduction
of overall energy dissipation is achieved by doing so. The reason for the energy reduction
is analyzed as follows. Task T
0
is executed e±ciently on the RL for both the con¯guration
¯les employed (ranging from 4.15 to 9.17 ¹J). But the con¯guration costs for these two ¯les
are high (ranging from 272.49 to 343.03 ¹J) since task T
0
needs sine/cosine functions. The
Xilinx FPGA provides the CORE Generator lookup table [85] to implement the sine/cosine
functions. For 16-bit input and 16-bit output sine/cosine lookup tables, the single output
design (sine or cosine) needs 50 slices and the double output (both sine and cosine) design
needs 99 slices. Twoand three sine/cosine look-up tables are used in the two designs employed
for T
0
, which increases the recon¯guration costs for this task. The amount of computation
energy dissipation of task T
0
is relatively small in this case and thus the con¯guration energy
cost impact the overall energy dissipation signi¯cantly. Therefore, executing the task on the
processor dissipates less amount of energy than executing it on the RL.
95
4.7.2 MVDR beamforming
Using a similar approach as in Section 4.7.1, we implemented an MVDR (Minimum Variance
Distortionless Response) beamforming application on Virtex-II Pro. Details of the design pro-
cess are discussed as follows.
4.7.2.1 Problem formulation
The task graph of the MVDR beamforming application is illustrated in Figure 4.7. It can be
decomposed into ¯ve tasks with linear constraints. In T
0
, T
1
and T
2
, we implemented a fast
algorithm described in [32] for MVDR spectrum calculation. It consists of: Levinson Durbin
recursiontocalculatethecoe±cientsofaprediction-error¯lter(T
0
),correlationofthepredictor
coe±cients (T
1
), and the MVDR spectrum computation using FFT (T
2
). This fast algorithm
eliminates a lot of computation that is required by the direct calculation. We employ an LMS
(Least Mean Square) algorithm (T
3
) to update the weight coe±cients of the ¯lter due to its
simplicity and numerical stability. A spatial ¯lter (T
4
) is used to ¯lter the input data. The
coe±cients of the ¯lter are determined by the previous tasks.
8 16 32
0
500
1000
1500
2000
2500
3000
3500
M
Energy dissipation (μ J)
2 inputs
4 inputs
8 inputs
Figure 4.9: Energy dissipation of task T
1
implemented using various MAC architectures
96
256 512 1024 2048
0
1
2
3
4
5
6
7
x 10
5
Dynamic Programming: DP
Greedy: GR
DP
GR
DP
GR
DP
GR
DP
GR
Number of output frequency points
Energy dissipation (μ J)
Execution
Communication
Configuration
Figure 4.10: Energy dissipation of various implementations of the MVDR beamforming appli-
cation (M =64)
We considered the low-level and algorithm-level control knobs discussed in Section 4.2 and
developed various designs for the tasks which are listed in Table 4.4. Di®erent degrees of
parallelism are employed in designs for task T
0
and T
1
. Task T
2
uses FFT to calculate the
MVDR spectrum. We employed the various FFT designs discussed in [64] which are based on
theradix-4algorithmaswellasthedesignfromXilinxCOREGenerator. Clockgating, various
degree of parallelism and memory bindings are used in these FFT designs to improve energy
e±ciency. V
p
and H
p
are the vertical and horizontal parallelism employed by the designs (see
[64] for more details). Two di®erent bindings, one using slice based RAMs and the other using
BRAMs, were used to store the intermediate values. The number of dedicated multipliers used
in the designs for task T
3
and T
4
is varied. Using the approach described in Section 4.7.1,
we developed parameterized VHDL code for T
0
, T
1
, and T
2
. Parameterized designs for T
3
and T
4
are realized using a MATLAB/Simulink based design tool developed by us in [53]. All
the designs for the RL and the processor core were mapped on the corresponding components
97
256 512 1024 2048
0
1
2
3
4
5
6
7
x 10
5
Dynamic Programming: DP
Greedy: GR
DP
GR
DP
GR
DP
GR
DP
GR
Number of output frequency points
Energy dissipation (μ J)
Execution
Communication
Configuration
Figure 4.11: Energy dissipation of various implementations of the MVDR beamforming appli-
cation (the number of points of FFT is 256)
for execution. Values of the parameters of the RSoC model and the application model were
obtained through low-level simulation. The synthesis designs for the RL run at a clock rate of
50 MHz. The data precision is 10 bits. Table 4.5 shows the E
i;s
values when M = 8 and the
number of FFT points is 16 after the pre-processing discussed in Section 4.6.2.
Let M denote the number of antenna elements. For task T
0
and task T
1
, we need to
perform a complex multiply-and-accumulate (MAC) for problem sizes from 1 to M. Typically,
M =8;16areusedintheareaofsoftwarede¯nedradiowhile M =32;64areusedinembedded
sonar systems. There are several trade-o®s which a®ect the energy e±ciency when selecting
the number of inputs to the complex MAC when implementing task T
0
and T
1
. Architectures
for complex MACs with 2, 4, and 8 inputs are shown in Figure 4.8. For a ¯xed M, using a
complexMACarchitecturethathandlesmoreinputdataatthesametimereducestheexecution
latency. However, itdissipatesmorepower. ItalsooccupiesmoreFPGAslices, whichincreases
the con¯guration cost. The energy dissipation when using MACs with di®erent input sizes for
98
task T
1
is analyzed in Figure 4.9. While a MAC with input size of 4 is most energy e±cient for
task T
1
when M =8, a MAC with input size of 2 is most energy e±cient when M =64. Also,
the number of slices required for a complex MAC that can handle 2, 4, and 8 inputs are 100,
164, and 378, respectively. This incurs di®erent con¯guration costs between the tasks.
4.7.2.2 Energy minimization
A trellis with seven steps was created to represent the MVDR application. After applying
the pre-processing technique discussed in Section 4.6.2, step -1 and step 5 contain one node
each, step 0 and 1 contain four nodes each, step 2 contains seven nodes, step 3 contains 3
nodes, and step 4 contains 5 nodes on the reduced trellis. Using the values described in the
previous section, we obtain the weights of all the nodes and the edges on the reduced trellis.
The proposed dynamic programming algorithm is then used to ¯nd a mapping that minimizes
the overall energy dissipation.
TheresultsareshowninFigure4.11. Foralltheconsideredproblemsizes, energyreduction
from41%to46%areachievedbyourdynamicprogrammingalgorithmoveragreedyalgorithm
that maps each task onto either hardware or software, depending upon which dissipates the
least amount of energy.
Forboththedynamicprogrammingalgorithmandthegreedyalgorithm,taskT
0
andtaskT
1
aremappedtotheRL.However,thedynamicprogrammingalgorithmmapsthemtothedesign
using 2-input complex MAC while the greedy algorithm maps them to the designs based on
4-input complex MAC in cases such as when M = 16. Designs based on the 2-input complex
MAC are not the ones that dissipate the least amount of execution energy for all the cases
considered. However, the designs based on the MAC with 2 inputs occupy less amount of area
than those based on the MACs with 4 and 8 inputs. By doing so, the recon¯guration energy
during the execution of the MVDR beamforming application is reduced by ranging from 34%
to 66% in our experiments. For task T
2
, the dynamic programming algorithm maps it onto the
99
Table4.4: VariousimplementationsofthetasksonRLfortheMVDRbeamformingapplication
T
0
and T
1
T
2
T
3
Design No. of inputs Design V
p
H
p
Binding Design Number of embedded
to the MAC 1 1 2 SRAM multipliers used
1 2 2 1 3 SRAM 1 11
2 4 3 1 3 BRAM 2 7
3 8 4 1 4 BRAM 3 3
5 1 5 BRAM
6 Xilinx design
T
4
Design Number of embedded
multipliers used
1 32
2 28
3 24
4 20
5 16
Table 4.5: Energy dissipation of the tasks in the MVDR beamforming application (¹J)
E
0;0
145.2 E
1;0
61.0 E
2;0
67.2 E
3;0
21.4 E
4;0
5.0
E
0;1
161.9 E
1;1
47.3 E
2;1
432.3 E
3;1
69.9 E
4;1
10.1
E
0;2
274.4 E
1;2
59.8 E
2;2
363.2 E
3;2
82.1 E
4;2
17.7
E
0;3
24590.2 E
1;3
5397.8 E
2;3
2223.1 E
4;3
23.3
E
2;4
12687.4 E
4;4
27.6
E
2;5
167.7
E
2;6
16038.0
PowerPCprocessorcorewhilethegreedyalgorithmmapsitontheRL.Whiletheparameterized
FFT designs by Choi et al. [64] minimize the execution energy dissipation of T
4
through the
employment of parallelism, radix and choices of storage types, such energy minimization is
achieved by using more area on the RL. This increases the recon¯guration energy costs and
thus is not an energy e±cient option when synthesizing the beamforming application.
100
4.8 Summary
A three-step design process for energy e±cient application synthesis using RSoCs is proposed
in this chapter. The design of two beamforming applications are presented to illustrate the
e®ectiveness of our techniques.
Whileourworkfocusedonenergyminimization,withminormodi¯cations,theperformance
model can be extended to capture throughput and area, and the mapping algorithm can be
usedtominimizetheend-to-endlatency. Weusedalineartaskgraphtomodeltheapplication.
When the task graph is generalized to be a DAG (Directed Acyclic Graph), it can be shown
that the resulting optimization problem becomes NP-hard.
101
Chapter 5
High-level rapid energy estimation and design space
exploration
5.1 Introduction
Soft processors provide an exceptional design °exibility to recon¯gure hardware and have be-
comepopularinthedevelopmentofmanyembeddedsystems. Examplesofsuchsoftprocessors
include Nios from Altera [3], LEON3 from Gaisler [23], and MicroBlaze and PicoBlaze from
Xilinx [79]. As shown in Figure 5.1, for FPGA-based application development, the application
designercanmapportionsoftheapplicationtobeexecutedeitheronsoftprocessorsassoftware
programs or on customized hardware peripherals attached to the processors. On the one hand,
customized hardware peripherals are e±cient for executing many data intensive computations.
On the other hand, processors are e±cient for executing many control and management func-
tionalitiesaswellascomputationswithtightdatadependencybetweencomputationsteps(e.g.
recursive algorithms). Designs using processors can lead to more compact designs and require
much smaller amount of resources than customized hardware peripherals. Compact designs
that ¯t into a small FPGA device can e®ectively reduce quiescent energy dissipation [75].
102
Energy e±ciency is an important performance metric in the design of many embedded
systems (e.g. software de¯ned radio systems [20]). Many possible hardware-software mappings
and implementations of the applications are possible. They can cause signi¯cant variation in
energy dissipation for the FPGA hardware platform. Being able to rapidly obtain the energy
dissipation of these di®erent mappings and implementations of the applications is crucial to
execute them on FPGAs for energy e±ciency. We address the following design problem.
Problem de¯nition: The FPGA device is con¯gured with a soft processor and several cus-
tomized hardware peripherals. The processor and the hardware peripherals communicate with
each other through some speci¯c bus protocols. The target application is decomposed into
a set of tasks. Each task can be mapped on to the soft processor (software) or a speci¯c
customized hardware peripheral (hardware) for execution. We are interested in a speci¯c map-
ping and execution scheduling of the tasks. For tasks executed on customized hardware pe-
ripherals, their implementations are described using arithmetic modeling environments (e.g.
MATLAB/Simulink [44], MILAN [7]). For tasks executed on the soft processor, the software
programs are described as C code and compiled using the C compiler for the speci¯c platform.
One or more sets of sample input data are also given. Under these assumptions, our objective
is to rapidly and (fairly) accurately obtain the energy dissipation of the complete application.
Therearetwomajorchallengesforachievingsuchrapidand(fairly)accurateenergyestima-
tionforhardware-softwareco-designsonFPGAplatforms. Onechallengeisthatstate-of-the-art
design tools rely on time-consuming low-level estimation techniques based on RTL (Register
TransferLevel)andgatelevelsimulationmodelstoobtainenergydissipation. Whiletheselow-
level energy estimation techniques can be accurate, they are too time-consuming and would be
intractable when used to evaluate energy performance of the di®erent implementations on FP-
GAs, especially for software programs running on soft processors. Taking the designs shown in
103
Figure 5.1: FPGA-based hardware-software co-design
Section?? as an example, simulating»2.78 msec actual execution time of a matrix multiplica-
tion application based on post place-and-route simulation models takes»3 hours in ModelSim
[46]. Using XPower [79] to analyze the simulation record ¯le and calculate energy dissipation
requires an additional »1 hour. Thus, low-level energy estimation techniques are impractical
for evaluating the energy performance of di®erent implementations of the applications. An-
otherchallengeisthathigh-levelenergyperformancemodeling, whichcanproviderapidenergy
estimation, is di±cult for designs using FPGAs. FPGAs have look-up tables as their basic
elements. They lack a single high level model as that of general purpose processors, which can
capture the energy dissipation behavior of all the possible implementations on them.
We propose in this chapter a two-step rapid energy estimation technique for hardware-
software co-design using FPGAs. In the ¯rst step, arithmetic level high-level abstractions are
104
created for the hardware and software execution platforms. Based on the high-level abstrac-
tions, cycle-accurate arithmetic level hardware-software co-simulation is performed to obtain
the arithmetic behavior of the complete system. Activity information of the corresponding
low-level implementation of the system is estimated from the cycle-accurate arithmetic level
co-simulation process. In the second step, by utilizing the estimated low-level activity infor-
mation, an instruction-level energy estimation technique is employed to estimate the energy
dissipation of software execution. Also, a domain-speci¯c modeling technique is employed to
estimate the energy dissipation of hardware execution. The energy dissipation of the complete
system is obtained by summing up the energy dissipation for hardware and software execution.
For illustrative purposes, we provide one implementation of the proposed energy estimation
technique based on MATLAB/Simulink. To demonstrate the e®ectiveness of our approach,
we show the design of two widely used numerical computation applications. For these two
applications, our estimation technique achieves up to 6534x speed-ups compared with that
basedonlow-levelsimulations. Comparedwiththeresultsfromactualmeasurement,theenergy
estimates obtained using our approach achieve an average estimation error of 11.6% for various
implementationsofthetwoapplications. Theimplementationsofthetwoapplicationsidenti¯ed
using our energy estimation technique achieve energy reductions up to 52.2% compared with
other implementations considered in our experiments.
The organization of this chapter is described as follows. Section 5.2 discusses related work.
Section 5.4 describes our rapid energy estimation technique. A MATLAB/Simulink implemen-
tationofourtechniqueisalsoprovidedinthissection. Thedesignoftwonumericalcomputation
applications is shown in Section 5.7 to demonstrate the e®ectiveness of our approach. Finally,
we conclude in Section 5.8.
105
5.2 Related work
ThereareseveralenergyestimationtechniquesfordesignsusingFPGAs. Onetechniqueisbased
on low-level simulation, which is taken by commercial tools such as Quartus II [3] and XPower
[79] as well as academic tools such as the one developed by Poon et al. [58]. Taking XPower
as an example, the end user generates the low-level implementation of the complete design
and creates a post place-and-route simulation model based on the low-level implementation.
Then, sample input data is provided to perform post place-and-route simulation of the design
using tools such as ModelSim [46]. A simulation ¯le which records the switching activity of
each logic component and interconnect on the FPGA device is generated during the simulation
process. XPower calculates the power consumption of each logic component and interconnect
using the recorded activity information as well as the pre-measured power characteristics (e.g.
capacitance) of the logic components and interconnects. The energy dissipation of a design
is obtained by summing up the energy dissipation of the logic components and interconnects.
While such low-level simulation based energy estimation techniques can be accurate, they are
ine±cient for estimating the energy dissipation of hardware-software co-design using FPGAs.
This is because such low-level post place-and-route simulations are very time consuming, es-
pecially when simulating the execution of software programs. In another energy estimation
technique, the parameters that have signi¯cant impact on energy dissipation are pre-de¯ned or
provided by the application designer. These parameters are then used by the energy models
for energy estimation. The technique is used by tools such as the RHinO tool [76] and the web
power analysis tools from Xilinx [83]. While energy estimation using this technique can be fast
as they avoid the time-consuming low-level simulation process, its estimation accuracy varies
among applications and application designers. One reason is that di®erent applications would
demonstrate di®erent energy dissipation behaviors. We show in [54] that using pre-de¯ned pa-
rametersforenergyestimationwouldresultinenergyestimationerrorsashighas32%forinput
106
data with di®erent characteristics. Another reason is that requiring the application designer
to provide these important parameters would demand him/her to have a deep understanding
of the energy behavior of the target devices and the target applications, which can prove to be
very di±cult for many practical designs. Especially, it is not suitable for estimating the en-
ergy estimation of software executions as quite a number of di®erent instructions with di®erent
energy dissipation are executed on soft processors for di®erent software programs.
Previous research was focusing on estimating the energy dissipation of a few popular com-
mercial and academic processors. For example, JouleTrack estimates the energy dissipation of
software programs on StrongARM SA-1100 and Hitachi SH-4 processors [69]. Wattch [11] and
SimplePower [88] estimate the energy dissipation of an academic processor. We proposed an
instruction-level energy estimation technique in [55], which can provide rapid and fairly accu-
rate energy estimation for soft processors. However, since these energy estimation frameworks
and tools target a relative ¯xed processor execution platform, they do not address the energy
dissipated by the customized hardware peripherals and the communication interfaces. Thus,
they are not suitable for hardware-software co-designs on FPGA platforms.
5.3 Domain-speci¯c modeling
Domain-speci¯c modeling is a hybrid (top-down followed by bottom-up) modeling approach.
It starts with a top-down analysis of the algorithms and the architectures for implementing
the kernel. Through the top-down analysis, the various possible low-level implementations of
the kernel are grouped into domains depending on the architectures and algorithms used. By
doing so, we enforce a high-level architecture for the implementations that belong to the same
domain. With such enforcement, high-level modeling within the domains becomes possible.
Analytical formulation of energy functions is derived within each domain to capture the energy
behavior of the implementations belonging to the domains. Then, a bottom-up approach is
107
followed to estimate the constants in these analytical energy functions for the identi¯ed do-
mains through low-level sample implementations. This includes pro¯ling individual system
components through low-level simulations, real experiments, etc. These domain-speci¯c energy
functions are platform-speci¯c. That is, the constants in the energy functions would have dif-
ferent values for di®erent FPGA platforms. During the application development process, these
energy functions are used for rapid energy estimation of hardware implementations belonging
to their corresponding domains.
Thedomain-speci¯cmodelscanbehierarchical. Theenergyfunctionsofakernelcancontain
the energy functions of the sub-kernel that constitutes the kernel. Besides, characteristics of
the input data (e.g. switching activities) can have considerable impact on energy dissipation
andarealsoinputstotheenergyfunctions. Thischaracteristicinformationisobtainedthrough
low-level simulation or through the arithmetic level co-simulation described in Section 5.4.1.
See [14] for more details regarding the domain-speci¯c modeling technique.
Figure 5.2: Domain-speci¯c modeling
108
5.4 Our approach
Figure 5.3: The two-step energy estimation approach
Our two-step approach for rapid energy estimation of hardware-software co-design using
FPGAs is illustrated in Figure 6.1. In the ¯rst step, we build a cycle-accurate arithmetic level
hardware-software co-simulation environment to simulate the applications running on FPGAs.
By\arithmeticlevel",wedenotethatonlythearithmeticaspectsofthehardware-softwareexe-
cution are captured by the co-simulation environment. For example, low-level implementations
of multiplication on Xilinx Virtex-II FPGAs can be realized using either slice-based multipliers
or embedded multipliers. The arithmetic level co-simulation environment only captures the
multiplication arithmetic aspect during the simulation process. By \cycle-accurate", we de-
note that for each of the clock cycles under simulation, the arithmetic level behavior of the
complete FPGA hardware platform predicted by the cycle-accurate co-simulation environment
shouldmatchwiththearithmeticlevelbehaviorofthecorrespondinglow-levelimplementations.
When simulating the execution of software programs on soft processors, the cycle-accurate co-
simulationshouldtakeintoaccountthenumberofclockcyclesrequiredforcompletingaspeci¯c
109
instruction(e.g. themultiplicationinstructionoftheMicroBlazeprocessortakesthreeclockcy-
clesto¯nish)andtheprocessingpipelineoftheprocessor. Also, whensimulatingtheexecution
on customized hardware peripherals, the cycle-accurate co-simulation should take into account
the delay in number of clock cycles caused by the processing pipelines within the customized
hardware peripherals. Our arithmetic level simulation environment ignores the low-level im-
plementation details and focuses on only the arithmetic behavior of the designs. It can greatly
speedupthehardware-softwareco-simulationprocess. Inaddition, thecycle-accurateproperty
is maintained among the hardware and software simulators during the co-simulation process.
Thus, theactivityinformationofthecorrespondinglow-levelimplementationsofthehardware-
software execution platform, which are used in the second step for energy estimation, can be
accurately estimated from the high-level co-simulation process.
Figure 5.4: Software architecture of our hardware-software co-simulation environment
In the second step, we utilize the information gathered during the arithmetic level co-
simulation process for rapid energy estimation. The types and the number of each type of
instructions executed on soft processors are obtained from the cycle-accurate instruction sim-
ulation process. By utilizing an instruction-level energy estimation technique, the instruction
execution information is used to estimate the energy dissipation of the software programs run-
ning on the soft processor. For customized hardware implementations, the switching activities
of the low-level implementations are estimated by analyzing the switching activities of the
arithmetic level simulation results. Then, with the estimated switching activity information,
110
energy dissipation of the hardware peripherals is estimated by utilizing a domain-speci¯c en-
ergy performance modeling technique proposed by us [14]. Energy dissipation of the complete
systems is obtained by summing up the energy dissipation of the software programs and the
hardware peripherals.
Forillustrativepurposes, animplementationofourrapidenergyestimationtechniquebased
on MATLAB/Simulink is described in the following subsections.
5.4.1 Step 1: Cycle-accurate arithmetic level co-simulation
The software architecture of an arithmetic level hardware-software co-simulation environment
for designs using FPGAs is illustrated in Figure 5.4. The low-level implementation of the
FPGA based hardware-software execution platform consists of three major components: the
soft processor for executing software programs; customized hardware peripherals as hardware
acceleratorsforparallelexecutionofsomespeci¯ccomputations; and communication interfaces
for exchanging data and control signals between the processor and the customized hardware
components. Anarithmeticlevel(\high-level")abstractioniscreatedforeachofthethreemajor
components. These high-level abstractions are simulated using their corresponding simulators.
These hardware and software simulators are tightly integrated into our co-simulation environ-
ment and concurrently simulate the arithmetic behavior of the hardware-software execution
platform. Most importantly, the simulation among the integrated simulators are synchronized
at each clock cycle and provide cycle accurate simulation results for the complete hardware-
software execution platform. Once the design process using the arithmetic level abstraction is
completed, the application designer speci¯es the required low-level hardware bindings for the
arithmetic operations (e.g. binding the embedded multipliers for realization of the multiplica-
tion arithmetic operation). Finally, low-level implementations of the complete platform with
111
correspondingarithmeticbehaviorcanbeautomaticallygeneratedbasedonthearithmeticlevel
abstractions of the hardware-software execution platforms.
We provide an implementation of the arithmetic level co-simulation approach based on
MATLAB/Simulink,thesoftwarearchitectureofwhichisshowninFigure5.11. Thefourmajor
functionalitiesofourMATLAB/Simulinkbasedco-simulationenvironmentaredescribedinthe
following paragraphs.
Figure 5.5: An implementation of the hardware-software co-simulation environment based on
MATLAB/Simulink
² Cycle-accurate simulation of the software programs: The input C programs are compiled
using the compiler for the speci¯c processor (e.g. the GNU C compiler mb-gcc for MicroBlaze)
and translated into binary executable ¯les (e.g. .ELF ¯les for MicroBlaze). These binary
executable ¯les are then simulated using a cycle-accurate instruction set simulator for the
speci¯c processor. Taking the MicroBlaze processor as an example, the executable .ELF ¯les
are loaded into mb-gdb, the GNU C debugger for MicroBlaze. A cycle-accurate instruction set
simulator for the MicroBlaze processor is provided by Xilinx. mb-gdb sends instructions of the
loaded executable ¯les to the MicroBlaze instruction set simulator and perform cycle-accurate
simulation of the execution of the software programs. mb-gdb also sends/receives commands
and data to/from MATLAB/Simulink through the Simulink block for the soft processor and
112
interactivelysimulatetheexecutionofthesoftwareprogramsinconcurrencewiththesimulation
of the hardware designs within MATLAB/Simulink.
² Simulation of customized hardware peripherals: The customized hardware peripherals
are described using the MATLAB/Simulink based FPGA design tools. For example, System
Generator supplies a set of dedicated Simulink blocks for describing parallel hardware designs
using FPGAs. These Simulink blocks provide arithmetic level abstractions of the low-level
hardware components. There are blocks that represent the basic hardware resources (e.g. °ip-
°op based registers, multiplexers), blocks that represent control logic, mathematical functions,
and memory, and blocks that represent proprietary IP (Intellectual Property) cores (e.g. the
IP cores for Fast Fourier Transform and ¯nite impulse ¯lters). Considering the Mult Simulink
block for multiplication provided by System Generator, it captures the arithmetic behavior of
multiplication by presenting at its output port the product of the values presented at its two
input ports. The low-level design trade-o® that the Mult Simulink block can be realized using
either embedded or slice-based multipliers is not captured in its arithmetic level abstraction.
Theapplicationdesignerassemblesthecustomizedhardwareperipheralsbydragginganddrop-
ping the blocks from the block set to his/her designs and connecting them via the Simulink
graphic interface. Simulation of the customized hardware peripherals is performed within the
MATLAB/Simulink. MATLAB/Simulinkmaintainsasimulationtimerforkeepingtrackofthe
simulation process. Each unit of simulation time counted by the simulation timer equals to one
clock cycle experienced by the corresponding low-level implementations. Finally, once the de-
sign process completes, the low-level implementations of the customized hardware peripherals
are automatically generated by the MATLAB/Simulink based design tools.
² Data exchange and synchronization among the simulators: The Simulink block for soft
processor is responsible for exchanging simulation data between the software and hardware
simulators during the co-simulation process. MATLAB/Simulink provides Gateway In and
113
Gateway Out Simulink blocks for separating the simulation of the hardware designs described
using System Generator with the simulation of other Simulink blocks (including the MicroB-
laze Simulink blocks). These Gateway In and Gateway Out blocks identify the input/output
communication interfaces of the customized hardware peripherals. For the MicroBlaze proces-
sor, the MicroBlaze Simulink block sends the values of the processor registers stored at the
MicroBlaze instruction set simulator to the Gateway In blocks as input data to the hardware
peripherals. Vice versa, the MicroBlaze Simulink blocks collect the simulation output of the
hardware peripherals from Gateway Out blocks and use the output data to update the values
of the processor registers stored at the MicroBlaze instruction set simulator. The Simulink
block for soft processor also simulates the communication interfaces between the soft processor
and the customized hardware peripherals described in MATLAB/Simulink. For example, the
MicroBlaze Simulink block simulates the communication protocol and the FIFO bu®ers for
communication through the Xilinx dedicated FSL (Fast Simplex Link) interfaces [79].
Besides, the Simulink block for soft processor maintains a global simulation timer which
keepstrackofthesimulationtimeexperiencedbythehardwareandsoftwaresimulators. When
exchanging the simulation data among the simulators, the Simulink block for soft processor
takes into account the number of clock cycles required by the processor and the customized
hardwareperipherals toprocess theinput data aswellas thedelayscaused bytransmittingthe
databetweenthem. Then,theSimulinkblockincreasestheglobalsimulationtimeraccordingly.
By doing so, the hardware and software simulation are synchronized on a cycle accurate basis.
5.4.2 Step 2: Energy estimation
The energy dissipation of the complete system is obtained by summing up the energy dissipa-
tion of software execution and that of hardware execution, which are estimated separately by
utilizing the activity information gathered during the arithmetic level co-simulation process.
114
5.4.2.1 Instruction-level energy estimation for software execution
Aninstruction-levelenergyestimationtechniqueisemployedtoestimatetheenergydissipation
of the software execution on the soft processor. An instruction energy look-up table is created
which stores the energy dissipation of each type of instructions for the speci¯c soft processor.
The types and the number of each type of instructions executed when the software program
is running on the soft processor are obtained during the arithmetic level hardware-software
co-simulationprocess. Byquerying theinstructionenergy look-uptable, the energy dissipation
of these instructions is obtained. Energy dissipation for executing the software program is
calculated as the sum of the energy dissipation of the instructions.
We use the MicroBlaze processor to illustrate the creation of the instruction energy look-
up table. The overall °ow for generating the look-up table is illustrated in Figure 5.20. We
developed sample software programs that target each instruction in the MicroBlaze processor
instruction set by embedding assembly code into the sample C programs. In the embedded
assembly code, we repeatedly execute the instruction of interest for a certain amount of time
withmorethan100di®erentsetsofinputdataandundervariousexecutioncontexts. ModelSim
wasusedtoperformlow-levelsimulationforexecutingthesamplesoftwareprograms. Thegate-
levelswitchingactivitiesofthedeviceduringtheexecutionofthesamplesoftwareprogramsare
recorded by ModelSim as simulation record ¯les (.vcd ¯les). Finally, low-level energy estimator
suchasXPowerwasusedtoanalyzethesesimulationrecord¯lesandestimateenergydissipation
of the instructions of interest. See [55] for more details on the construction of instruction-level
energy estimators for FPGA con¯gured soft processors.
5.4.2.2 Domain-speci¯cmodelingbasedenergyestimationforhardwareexecution
Energy dissipation of the customized hardware peripherals is estimated through the integra-
tion of a domain-speci¯c energy performance modeling technique proposed by us [14]. As is
115
Figure 5.6: Flow of generating the instruction energy look-up table
shown in Figure 5.2, a kernel (a speci¯c functionality performed by the customized hardware
peripherals) can be implemented on FPGAs using di®erent architectures and algorithms. For
example, matrix multiplication on FPGAs can employ a single processor or a systolic architec-
ture. FFT on FPGAs can adopt a radix-2 based or a radix-4 based algorithm. These di®erent
architectures and algorithms use di®erent amount of logic components and interconnect, which
prevent modeling their energy performance through a single high-level model.
Figure 5.7: Python classes organized as domains
In order to support the domain-speci¯c energy performance modeling technique, the appli-
cationdesignermustbeabletogroupdi®erentdesignsofthekernelsintodomainsandassociate
the performance models identi¯ed through domain-speci¯c modeling with the domains. Since
the organization of the MATLAB/Simulink block set is in°exible and is di±cult for further
116
re-organization and extension, we map the blocks in the Simulink block set into classes in the
object-oriented Python scripting language [60]. As shown in Figure 5.8, the mapping follows
some naming rules. For example, block xbsBasic r3/Mux, which represents hardware multi-
plexers, is mapped to a Python class CxlMul. All the design parameters of this block, such as
inputs (number of inputs), precision (precision), are mapped to the data attributes of its cor-
responding class and are accessible as CxlMul.inputs and CxlMul.precision. Information on
the input and output ports of the blocks is stored in data attributes ips and ops. By doing so,
hardware implementations are described using Python language and are automatically trans-
lated into corresponding designs in MATLAB/Simulink. For example, for two Python objects
A and B, A.ips[0:2] = B.ops[2:4] has the same e®ect as connecting the third and fourth output
ports of the Simulink block represented by B to the ¯rst two input ports of the Simulink block
represented by A.
Figure 5.8: Python class library
After mapping the block set to the °exible class library in Python, re-organization of the
classhierarchyaccordingtothearchitecturesandalgorithmsrepresentedbytheclassesbecomes
possible. Considering the example shown in Figure 5.7, Python class A represents various
implementations of a kernel. It contains a number of subclasses A(1), A(2), ¢¢¢, A(N). Each
of the subclasses represents one implementation of the kernel that belongs to the same domain.
117
Energy performance models identi¯ed through domain-speci¯c modeling (i.e. energy functions
shown in Figure 5.2) are associated with these classes. Input to these energy functions is
determined by the attributes of Python classes when they are instantiated. When invoked, the
estimate() method associated with the Python classes returns the energy dissipation of the
Simulink blocks calculated using the energy functions.
Besides, as a key factor that a®ects energy dissipation, switching activity information is
required before these energy functions can accurately estimate energy dissipation of a design.
The switching activity of the low-level implementations is estimated using the information ob-
tained from the high-level co-simulation described in Section 5.4.1. For example, the switching
activity of the Simulink block for addition is estimated as the average switching activity of the
two input and the output data. The switching activity of the processing elements (PEs) of
the CORDIC design shown in Figure 5.27 is calculated as the average switching activity of all
the wires that connect the Simulink blocks contained by the PEs. As shown in Figure 5.9,
high-level switching activities of the processing elements (PEs) shown in Figure 5.27 obtained
within MATLAB/Simulink coincide with their power consumption obtained through low-level
simulation. Therefore, using such high-level switching activity estimates can greatly improve
the accuracy of our energy estimates. Note that for some Simulink blocks, their high-level
switching activities may not coincide with their power consumption under some circumstances.
For example, Figure 5.10 illustrates the power consumption of slice-based multipliers for input
data sets with di®erent switching activities. These multipliers demonstrate \ceiling e®ects"
when switching activities of the input data are larger than 0.23. Such \ceiling e®ects" are cap-
tured when deriving energy functions for these Simulink blocks in order to ensure the accuracy
of our rapid energy estimates.
118
1 2 3 4
0
0.05
0.1
0.15
0.2
0.25
Processing elements of the CORDIC divider
High−level switching activity
Power consumption (mW)
0.5
1
1.5
2
2.5
3
Power
Figure 5.9: High-level switching activities and power consumption of the PEs that constitute
the shown in Figure 5.27
5.5 Energyestimationforcustomizedhardwarecomponents
5.5.1 Software architecture
PyGen iswritteninPython,whichisanobject-orientedscriptinglanguagewithconcisesyntax,
°exible data types and dynamic typing [60]. It is widely used in many software systems. There
are also attempts to use Python for hardware designs [28].
ThesoftwarearchitectureofPyGen isshowninFigure5.11. Itcontainsfourmajormodules.
The architecture and the function of these modules are described in the following.
5.5.1.1 PyGen module
The PyGen module is a Python module. It is responsible for creating communication between
PyGen and MATLAB/Simulink and mapping the basic building blocks in System Generator
to Python classes.
MATLAB provides three ways for creating such communication: MATLAB COM (Compo-
nent Object Model) server, MATLAB engine, and a Java interface [44]. We build the commu-
nication interface through the MATLAB COM server by using the Python Win32 extensions
119
5 10 15
0
0.1
0.2
0.3
0.4
Date sets
High−level switching activity
Power consumption (mW)
1
2
3
4
5
Switching activity
Power
Figure 5.10: High-level switching activities and power consumption of slice-based multipliers
from [30]. Through this interface, PyGen and System Generator can obtain the relevant infor-
mation from each other and control the behavior of each other. For example, moving a design
block in System Generator can change the placement properties of the corresponding Python
object and vice versa. After a design is described in Python, the PyGen module communicates
with MATLAB/Simulink and creates a corresponding design in Simulink. Since the PyGen
module is a basic module, application designers are required to import it ¯rst using the script
import PyGen every time they describe their designs in Python.
Using some speci¯c naming convention, the PyGen module maps the basic block set pro-
vided by System Generator to the corresponding classes (basic classes) in Python, which is
showninFigure5.12. Forexample, block xbsBasic r3/Mux, whichisaSystem Generator block
representing hardware multiplexers, is mapped to a Python class CxlMul. All the design pa-
rameters of this block, such as inputs (number of inputs), precision (precision), are mapped
to the data attributes of its corresponding class and are accessible as CxlMul.inputs and
CxlMul.precision. The information on the input and output ports of the blocks is stored in
dataattribute ips andops. Therefore, fortwoPythonobjects A andB,A.ips[0:2] = B.ops[2:4]
120
Figure 5.11: Architecture of PyGen
has the same e®ect as connecting the third and fourth output ports of block B to the ¯rst two
input ports of A.
Using the PyGen module, application designers describe their designs by instantiating
classes from the Python class library, which is equivalent to dragging and dropping blocks
from the System Generator block set to their designs. By leveraging the object-oriented class
inheritance in Python, application designers can extend the class library by creating their own
classes (extended classes, represented by the shaded blocks in Figure 5.12) and derive parame-
terized designs. This is further discussed in Section 5.5.3.1.
Figure 5.12: Python class library within PyGen
121
5.5.1.2 Performance estimator
AfteraPyGen classisinstantiated,thereisaperformancemodelassociatedwiththegenerated
object. The performance model captures the performance of this object, such as resource
utilization, latency, and energy dissipation, etc. The resource utilization can be obtained by
invoking the Resource Estimator block provided by System Generator and parsing its output.
Wearecurrentlyinterestedinthenumbersofslices,theamountofBlockRAM,andthenumber
of embedded multipliers used in a design. Regarding latency, it can be obtained directly from
thelatency dataattributeiftheobjectisinstantiatedfromthebasicclasses,orcanbecalculated
based on the construction and the data attributes of the object if the object is instantiated
from the extended classes. To obtain the energy performance, we integrate a domain-speci¯c
modeling technique for rapid and accurate energy estimation proposed in [14]. This is further
discussed in Section 5.5.3.2.
5.5.1.3 Energy pro¯ler
The energy pro¯ler can analyze the energy dissipation of a given component and interconnect
of a design. The design °ow using the pro¯ler is shown in Figure 5.13. After the design is
created, the application designers follow the standard FPGA design °ow to synthesize and
implement the design. Design ¯les (.ncd ¯les) that represent the FPGA netlist are generated.
Then, it is simulated using ModelSim to generate simulation ¯les (.vcd ¯les). These ¯les record
the switching activity of the various hardware components on the device. The design ¯les
(.vcd ¯les) and the simulation ¯les (.vcd ¯les) are then fed back to the pro¯ler within PyGen.
The pro¯ler has an interface with XPower [79] and can obtain the average power consumption
of the clock network, nets, logic, and I/O pads by querying XPower through this interface.
Since the VHDL code generated by System Generator maintains the naming hierarchy of the
original design, the pro¯ler sums up these power values according to this naming hierarchy
122
andoutputsthepowerconsumptionof System Generator blocksor PyGen objects. Combining
with appropriate timing information, the power values can be further translated to values of
energy dissipation.
The energy pro¯ler can identify the energy hot spots in designs. More importantly, as
discussed in Section 5.5.4.3, it can be used to generate feedback information and to improve
the accuracy of the performance estimator.
Figure 5.13: Design °ow using the energy pro¯ler
5.5.1.4 Optimization module
The optimization module provides two functions: description of the design constraints and
optimization of the design with respect to the constraints. Since parameterized designs are
developed as Python classes, application designers realize the two functions by writing Python
code and manipulating the PyGen classes. This gives the designers complete °exibility to
incorporate a variety of optimization algorithms and provides a way to quickly traverse the
MATLAB/Simulink design space.
123
5.5.2 Overall design °ow
BasedonthearchitectureofPyGen discussedabove,thedesign°owisillustratedinFigure5.14.
The shaded boxes represent the four major functionalities o®ered by PyGen in addition to the
original MATLAB/Simulink design °ow.
² Parameterized design development. Parameterized designs are described in Python. De-
sign parameters such as data precision, degree of parallelism, hardware binding, etc., can be
captured by the Python designs. After the designs are completed, PyGen is invoked to trans-
late the designs in Python to the corresponding designs in MATLAB/Simulink. Changes to
the MATLAB/Simulink designs, such as the adjustment of the placement of the blocks, also
get re°ected in the PyGen environment through the communication channel between them.
² Performance estimation. Using the modeling environment of MATLAB/Simulink, ap-
plication designers can perform arithmetic level simulation to verify the correctness of their
designs. Then, by providing the simulation results to the performance estimator within PyGen
and invoking it, application designers can quickly estimate the performance of their designs,
such as energy dissipation and resource utilization.
² Optimization for energy e±ciency. Application designers provide design constraints, such
asend-to-endlatency,throughput,numberofavailableslicesandembeddedmultipliers,etc.,to
the optimization module. After optimization is completed, PyGen outputs the designs which
havethemaximumenergye±ciencyaccordingtotheperformancemetricsusedwhilesatisfying
the design requirements.
²Pro¯leandfeedback. Thedesignprocesscanbeiterative. Usingtheenergypro¯ler, PyGen
can break down the results from low-level simulation and pro¯le energy dissipation of various
components of the candidate designs. The application designers can use this pro¯ling to adjust
the architectures and algorithms used in their designs. Such energy pro¯ling information can
also be used to re¯ne the energy estimates from the performance estimator.
124
Figure 5.14: Design °ow of PyGen
Finally, using System Generator to generate the corresponding VHDL code, application
designers can follow the standard FPGA design °ow to synthesize and implement these designs
and download them to the target devices.
The input to our design tool is a task graph. That is, the target application is decomposed
into a set of tasks with communication between them. Then, the development using PyGen
is divided into two levels: kernel level and application level. The objectives of kernel level
development are to develop parameterized designs for each task and to provide support for
rapid energy estimation. The objectives of application level development are to describe the
application using the available kernels and to optimize its energy performance with respect to
design constraints.
5.5.3 Kernel level development
The kernel level development consists of two design steps, which are discussed below.
125
5.5.3.1 Parametrized kernel development
As shown in [14], di®erent implementations of a task (e.g. kernel) provides di®erent design
trade-o®s for application development. Taking matrix multiplication as an example, designs
with a lower degree of parallelism require less hardware resources than those with a higher
degree of parallelism while introducing a larger latency. Also, at the implementation level,
several trade-o®s are available. For example, in the realization of storage, registers, slice-based
RAMs and Block RAMs can be used. These implementations o®er di®erent energy e±ciency
depending on the size of data that needs to be stored. The objective of parameterized kernel
design is to capture these design and implementation trade-o®s and make them available for
application development.
While System Generator o®ers limited support for developing parameterized kernels, Py-
Gen has a systematic mechanism for this purpose by the way of Python classes. Application
designers expand the Python class library and create extended classes. Each extended class is
constructed as a tree, which contains a hierarchy of subclasses. The leaf nodes of the tree are
basic classes while the other nodes are extended classes. An example of such an extended class
is shown in Figure 5.15. This example illustrates some extended classes in the construction of
a parameterized FFT kernel in PyGen. Once an extended class is instantiated, its subclasses
also get instantiated. While translating this to the MATLAB/Simulink environment by the
PyGen module, it has the same e®ect as generating subsystems in MATLAB/Simulink, drag-
ging and dropping a number of blocks into these subsystems, and connecting the blocks and
the subsystems according to the relationship between the classes.
Applicationdesignersareinterestedinsomedesignparameterswhilegeneratingthekernels.
These parameters can be architecture used, hardware binding of a speci¯c function, data pre-
cision, degree of parallelism, etc. We use the data attributes of the Python classes to capture
these design parameters. Each design parameter of interest has a corresponding data attribute
126
Figure 5.15: Tree structure of the Python extended classes for parameterized FFT kernel
development
in the Python class. These data attributes control the behavior of the Python class when the
class is instantiated to generate System Generator designs. They determine the blocks used
in a MATLAB/Simulink design and the connections between the blocks. Besides, by properly
packaging the classes, the application designers can choose to expose only the data attributes
of interest for application level development.
5.5.3.2 Support of rapid and accurate energy estimation
While the parameterized kernel development can potentially o®er a large design space, being
abletoquicklyandaccuratelyobtaintheperformanceofagivenkerneliscrucialforidentifying
the appropriate parameters of this kernel and optimize the performance of the application
using it. To address this issue, we integrate into PyGen a domain-speci¯c modeling based
rapid energy estimation technique proposed in [14].
The use of domain-speci¯c energy modeling for FPGAs is shown in Figure ??. In general,
a kernel can be implemented using di®erent architectures. For example, implementing matrix
multiplication on FPGAs can employ a single processor or a systolic architecture. Implementa-
tions using a particular architecture are grouped into a domain. Analysis of energy dissipation
of the kernel is performed within each domain. Because each domain corresponds to an archi-
tecture, energy functions can be derived for each domain. These functions are used for rapid
127
energy estimation for implementations in the corresponding domain. See [14] for more details
regarding domain-speci¯c modeling.
Figure 5.16: Class tree organized as domains
In order to support this domain-speci¯c modeling technique, the kernel developers must
be able to group di®erent kernel designs into the corresponding domain. Such support is
not available in System Generator as the organization of the block set is ¯xed. However,
after mapping the block set to the °exible class library in PyGen, re-organization of the class
hierarchyaccordingtothearchitecturesrepresentedbytheclassesbecomespossible. Takingthe
case shown in Figure 5.16 as an example, Python class A represents various implementations
of a kernel. It contains a number of subclasses A(1), A(2), ¢¢¢, A(N). Each of the subclasses
represents the implementations of the kernel that belong to the same domain.
The process of energy estimation in PyGen is hierarchical. Energy functions are associated
with the Python basic classes and are obtained through low-level simulation. They capture the
energy performance of these basic classes under various possible parameter settings. For the
extended classes, depending on whether domain-speci¯c energy modeling is performed for the
classesornot, theremaybenoenergyfunctionsassociatedwiththemforenergyestimation. In
case that the energy function is not available, energy estimate of the class needs to be obtained
128
from the classes contained in it. While this way of estimation is fast by skipping the derivation
of energy functions, it has lower estimation accuracy as shown in Table 5.2 in Section 5.7.
To support such hierarchical estimation process, a method estimate() is associated with
each Python object. When this method is invoked, it checks if an energy function is associated
with the Python object. If yes, it calculates the energy dissipation of this object according to
the energy function and the parameter settings for this object. Otherwise, PyGen iteratively
searches the tree as shown in Figure 5.15 within this Python object until enough information is
obtained to calculate the energy performance of the object. In the worst case, it will trace all
the way back to the leaf nodes of the tree. Then, the estimate() method computes the energy
performance of the Python object using these energy functions obtained as described above.
Switching activities within a design are a key factor that a®ects energy dissipation. By
utilizing the data from MATLAB/Simulink simulation, PyGen obtains the actual switching
activity of various blocks in the high level designs and uses them for energy estimation. Com-
paring with the approach in [14] which assumes default switching activities, this helps increase
the accuracy of the estimates. To show the bene¯ts o®ered by PyGen, we consider an 8-point
FFT using the unfolded architecture discussed in Section ??. It contains twelve butter°ies,
each based on the same architecture. In Figure 5.17, the bars show the power consumption
of these butter°ies while the upper curve shows the average switching activity of the System
Generator basic building blocks used by each butter°y. Such switching activity information
can be quickly obtained from the MATLAB/Simulink arithmetic level simulation. As shown
in the ¯gure, the switching activity information obtained from MATLAB/Simulink is able to
capture the variation of the power consumption of these butter°ies. The average estimation
error based on such switching activity information is 2.9%. For the sake of comparison, we
perform energy estimation by assuming a default switching activity as in [14]. The results are
shown in Figure 5.18. For default switching activities ranging from 20% to 40%, which are
129
typical of designs of many signal processing applications, the average estimation errors can go
up to as much as 36.5%. Thus, by utilizing the MATLAB/Simulink simulation results, PyGen
improves the estimation accuracy.
1 2 3 4 5 6 7 8 9 10 11 12
0
50
100
150
200
250
Power (mW)
Butterlies
Measured
Estimated
Average Switching Activity (percent)
0
10
20
30
40
50
Figure 5.17: Power consumption and average switching activities of input/output data of the
butter°ies in an unfolded-architecture for 8-point FFT computation
20 25 30 35 40
0
5
10
15
20
25
30
35
40
Default Switching Activity (percent)
Estimation Error (percent)
Figure 5.18: Estimation error of the butter°ies when default switching activity is used
130
5.5.4 Application level development
Theapplicationleveldevelopmentbeginsaftertheparameterizeddesignsforthetasksaremade
availablebygoingthroughthekernelleveldevelopment. Itconsistsofthreedesignsteps, which
are discussed below.
5.5.4.1 Application description
Based on the input task graph, the application designers construct the application using the
parameterized kernels as discussed in the previous section. This is accomplished by manipu-
lating the Python classes created in a way as described in Section 5.5.3.1. Besides, application
designers need to create interfacing classes for describing the communication between tasks.
These classes capture: (1) data bu®ering requirement between the tasks, which is determined
by the application requirements and the data transmission patterns of the implementations of
the tasks; (2) hardware binding of the bu®ering.
5.5.4.2 Support of performance optimization
Application designers have complete °exibility in implementing the optimization module by
handling the Python objects. For example, if the task graph of the application is a linear
pipeline, the application designer can create a trellis as shown in Figure 5.19. Many signal
processing applications including the beamforming application discussed in the next section
can be described as linear pipelines. The parameterized kernel classes capture the various
possible implementations of the tasks (the shaded circles on the trellis) while the interfacing
classes capture the various possible ways of communication between the tasks (the connection
betweentheshadedcirclesonthetrellis). Then,thedynamicprogramingalgorithmproposedin
[52]canbeappliedto¯ndoutthedesignparametersforthetaskssothattheenergydissipation
of executing one data sample is minimized.
131
Figure 5.19: Trellis for describing linear pipeline applications
5.5.4.3 Energy pro¯ling
By using the energy pro¯ler in PyGen, application designers can write Python code to obtain
the power or energy dissipation for a speci¯c Python object or a speci¯c kind of objects.
For example, the power consumption of the butter°ies used in an FFT design is shown in
Figure 5.17. Based on the pro¯ling, the application designers can identify the energy hot spots
and change the designs of the kernels or the task graph of the applications to further increase
energy e±ciency of their designs. They can also use the pro¯ling to re¯ne the energy estimates
from the energy estimator. One major reason that necessitates such re¯nement is that the
energy estimation using the energy functions (discussed in Section 5.5.3.2) captures the energy
dissipation of the Python objects; it cannot capture the energy dissipated by the interconnect
that provides communication between these objects.
132
5.6 Instruction-levelenergyestimationforsoftwareprograms
5.6.1 Arithmetic-level instruction based energy estimation
In this chapter, we propose an arithmetic-level instruction based energy estimation technique
which can rapidly and fairly accurately estimate the energy dissipation of computations on
FPGAbasedsoftprocessors. Weuseinstruction based todenotethattheassemblyinstructions
that constitute the input software programs are the basic units based on which the proposed
technique performs energy estimation. The energy dissipation of the input software program is
obtained as the sum of the energy dissipation of the assembly instructions executed when the
software program is running on the soft processor. The architecture of many soft processors
consists of multiple pipeline stages in order to improve their time performance. For example,
the MicroBlaze processor contains three pipeline stages. While most instructions takes one
clock cycle to complete, the memory access load/store instructions take two clock cycles to
complete and the multiplication instruction take three clock cycles to complete. Thus, the
execution of instructions in these soft processors may get stalled due to the unavailability
of the required data. We assume that the various hardware components composing the soft
processor are \properly" clock gated between the execution of the assembly instructions (e.g.
the MicroBlaze processor) and dissipate negligible amount of energy when they are not in use.
Underthisassumption,thestallingofexecutioninthesoftprocessorwouldhavenegligiblee®ect
on the energy dissipation of the instructions. Thus, the impact of multiple pipeline stages on
the energy dissipation of the instructions is ignored in our energy estimation technique.
We use arithmetic-level to denote that we capture the changes in the arithmetic status of
the processor (i.e. the arithmetic contents of the registers and the local memory space) as a
result of executing the instructions. We use this arithmetic-level information to improve the
accuracy of our energy estimates. We do not capture how the instructions are actually carried
outinthefunctionalunitsofthesoftprocessor. Instead,theimpactofdi®erentfunctionalunits
133
on the energy dissipation is captured indirectly when analyzing the energy performance of the
instructions. For example, when analyzing the energy dissipation of the addition instructions,
we compare the values of the destination register before and after the addition operation and
calculate the switching activity of the destination register as a result of the addition operation.
Sample software programs which cause di®erent switching activity on the destination register
are executed on the soft processor. The energy dissipation of these sample software programs
is obtained through low-level energy estimation and/or actual measurement. Then, the energy
performance of the addition instructions is expressed as an energy function of the switching
activityofthedestinationregisteroftheadditioninstruction. AsweshowinSection5.6.3, this
arithmetic-level information can be used to improve the accuracy when estimating the energy
dissipation of the software programs. One major advantage of such arithmetic-level energy
estimation is that it avoid the time-consuming low-level simulation process and greatly speed
up the energy estimation process. Note that since most soft processors do not support out-of-
order execution, we only consider the change in arithmetic status of the processor between two
immediately executed instructions when analyzing the energy dissipation of the instructions.
Based on the above discussion, our rapid energy estimation technique for soft processors
consists of two steps. In the ¯rst step, by analyzing the energy performance of the sample
programs that represent the corresponding instructions, we derive energy functions for all the
instructions in the instruction set of the soft processor. An arithmetic-level instruction energy
look-up table is created by integrating the energy functions for the instructions. The look-up
table contains the energy dissipation of the instructions of the soft processor executed under
various arithmetic status of the soft processor.
In the second step, an instruction set simulator is used to obtain the information of the
instructions executed during the execution of the input software program. This information
includes the order that the instructions are executed and the arithmetic status of the soft
134
processor (e.g. the values stored at the registers, etc.) during the execution of the instructions.
For each instruction executed, its impact on the arithmetic status of the soft processor is
obtained by comparing the arithmetic contents of the registers and/or the local memory space
ofthesoftprocessorbeforeandafterexecutingtheinstruction. Then, basedonthisarithmetic-
level information, the energy dissipation of the executed instructions is obtained using the
instruction energy look-up table. Finally, the energy dissipation of the input software program
is obtained by summing up the total energy dissipation of all the instructions executed in the
software program.
5.6.2 An implementation
An implementation of the proposed arithmetic-level instruction set based energy estimation
technique, which is shown in Figure 5.21, is provided to illustrate our energy estimation tech-
nique. While our technique can be applied to other soft processors, Xilinx MicroBlaze is used
to demonstrate the energy estimation process since this processor has been widely used.
5.6.2.1 Step1|creationofthearithmetic-levelinstructionenergylook-uptable
The °ow for creating the instruction energy look-up table is shown in Figure 5.20. We created
sample software programs that represent each instruction in the MicroBlaze instruction set by
embedding assembly code into the sample C programs. For the embedded assembly code, we
let the instruction of interest repeatedly executed for a certain amount of time with more than
100 di®erent sets of input data and under various execution contexts (e.g. various orders of
instruction execution orders). More speci¯cally, we use mb-gcc, the GNU gcc compiler for Mi-
croBlaze for compiling the input software program. Then, mb-gdb (the GNU gdb debugger for
MicroBlaze), the Xilinx Microprocessor Debugger (XMD), and the MicroBlaze cycle-accurate
instructionsetsimulatorprovidedbyXilinxEDK(EmbeddedDevelopmentKit)[79]areusedto
135
obtain the arithmetic-level instruction execution information of the software program. Follow-
ing the design °ows discussed in Section 5.6.1, the instruction energy look-up table that stores
the energy dissipation of each instruction in the instruction set of the MicroBlaze processor is
created.
Figure 5.20: Flow of instruction energy pro¯ling
We use low-level simulation based technique to analyze the energy performance of the in-
structions. TheMicroBlazeprocessorhasaPC EX buswhichdisplaysthevalueoftheprogram
counter and a VALID INSTR signal which is high when the value on the PC EX bus is valid.
We ¯rst perform timing based gate-level simulation of the sample software programs using
ModelSim [46]. We check the status of PC EX and VALID INSTR during the timing based
simulation. Using this information, we determine the intervals over which the instructions of
interest are executed. Then, we rerun the timing based gate-level simulation in ModelSim.
Simulation ¯les (.vcd ¯les) that record the switching activities of each logic and wire on the
FPGA device during these intervals are generated. In the last step, XPower is used to ana-
lyze the design ¯les (.ncd ¯les) and the simulation ¯les. Finally, the energy dissipation of the
instructions is obtained based on the output from XPower.
NotethattheMicroBlazesoftprocessorisshippedasaregister-transferlevelIP(Intellectual
Property)core. ItstimeandenergyperformanceareindependentofitsplacementontheFPGA
device. Moreover,theinstructionsarestoredinthepre-compiledmemoryblocks(i.e. BRAMs).
Since the BRAMs are evenly distributed over the FPGA devices, the physical locations of the
136
Figure 5.21: Software architecture of an implementation of the proposed energy estimation
technique
memory blocks and the soft processor have negligible impact on the energy dissipation for the
soft processor to access the instructions and data stored in the BRAMs. Therefore, as long as
the MicroBlaze system is con¯gured as in Figure 5.22, energy estimation using the instruction
energy look-up table can lead to fairly accurate energy estimates, regardless of the actual
placement of the system on the FPGA device. There is no need to incur the time-consuming
low-levelsimulationeachtimetheinstructionenergylook-uptableisusedforenergyestimation
in a new con¯guration of the hardware platform.
5.6.2.2 Step 2 | creation of the energy estimator
The energy estimator created based on the two-step methodology discussed in Section 5.6.1 is
showninFigure5.21. Itiscreatedbyintegratingseveralsoftwaretools. Ourmaincontributions
are the arithmetic-level instruction energy look-up table and the instruction energy pro¯ler,
which are highlighted as shaded boxes in Figure 5.21. The creation of the instruction energy
is discussed in Section 5.6.2.1. For the instruction energy pro¯ler, it integrates the instruction
energy look-up table based on the execution of the instructions and calculates the energy
dissipation of the input software program. Note that, as an requirement of the MicroBlaze
cycle-accurate instruction set simulator, the con¯gurations of the soft processor, its memory
137
controller interfaces, and memory blocks (e.g. BRAMs) are pre-determined and are shown in
Figure 5.22. Since MicroBlaze supports split bus transactions, the instructions and the data
of the user software programs are stored in the dual port BRAMs and are made available to
theMicroBlazeprocessorthroughtwoseparateLMB(LocalMemoryBus)interfacecontrollers.
The MicroBlaze processor and the two LMB interface controllers are required to operate at the
same frequency. Under this setting, a ¯xed latency of one clock cycle is guaranteed for the
soft processor to access the program instructions and data through these two memory interface
controllers.
Figure 5.22: Con¯guration of the MicroBlaze processor system
The GNU compiler for MicroBlaze, mb-gcc, is used to compile the software program and
generate an ELF (Executable and Linking Format) ¯le, which can be downloaded to and ex-
ecuted on the MicroBlaze processor. This ELF ¯le is then provided to the GNU debugger for
MicroBlaze, mb-gdb.
The Xilinx Microprocessor Debugger (XMD) is a tool from Xilinx [79] for debugging pro-
gramsandverifyingsystemsusingthePowerPC(Virtex-IIPro)orMicroBlazemicroprocessors.
ItcontainsaGDBinterface. Throughthisinterface,XMDcancommunicatewith mb-gdb using
TCP/IP protocol and get access to the executable ELF ¯le that resides in it. XMD also inte-
grates a built-in cycle accurate simulatorthat can simulate the execution of instructions within
the MicroBlaze processor. The simulator assumes that all the instructions and program data
are fetched through two separate local memory bus (LMB) interface controllers. Currently, the
138
simulation of the MicroBlaze processor with other con¯guration, e.g. di®erent bus protocols
and other peripherals, is not supported by this instruction set simulator.
XMD has a TCL (Tool Command Language) [21] scripting interface. By communicating
with XMD through the TCL interface, the instruction pro¯ler obtains the numbers and the
types of instructions executed on the MicroBlaze processor. Then, the pro¯ler queries the
instructionenergylook-uptableto¯ndouttheenergydissipationofeachinstructionexecuted.
By summing up all the energy values, the energy dissipation of the input software program is
obtained.
5.6.3 Illustrative examples
TheMicroBlazebasedsystemshowninFigure5.22iscon¯guredonaXilinxSpartan-3XC3S400
FPGA device, which integrates dedicated 18bit£18bit multipliers and BRAMs. We con¯gured
MicroBlaze to use three dedicated multipliers for the multiplication instructions (e.g. mul
and muli) and use the BRAMs to store the instructions and data of the software programs.
According to the requirement of the cycle-accurate simulator, the operating frequencies of the
MicroBlaze processor and the two LMB interface controllers are set at 50 MHz.
For the experiments discussed in the chapter, we use EDK 6.2.03 for description and auto-
matic generation of the hardware platforms. The GNU tool chain for MicroBlaze (e.g. mb-gcc,
mb-gdb, etc.) from Xilinx [79] is used for compiling the software programs. Besides, we use the
Xilinx ISE (Integrated Software Environments) tool 6.2.03 [79] for synthesis and implementa-
tion of the hardware platform, and ModelSim 6.0 [46] for low-level simulation and recording
of the simulation results. The functional correctness and the actual power consumption of the
designs considered in our experiments are veri¯ed on a Spartan-3 FPGA prototyping board
from Nu Horizons [50].
139
Figure 5.23: Energy pro¯ling of the MicroBlaze instruction set
5.6.3.1 Creation of the arithmetic-level instruction energy look-up table
Using the technique discussed in Section 5.6.2.1, we perform low-level energy estimation to
create the arithmetic-level instruction energy look-up table for the MicroBlaze processor.
² Variances in energy dissipation of the instructions
Figure5.23showstheenergypro¯lingofvariousMicroBlazeinstructionsobtainedusingthe
methodology described in Section 5.6.2.1. We consider the energy dissipated by the complete
system shown in Figure 5.22, which includes the MicroBlaze processor, two memory controllers
and the BRAMs. Instructions for addition and subtraction have the highest energy dissipation
whilememoryaccessinstructions(bothloadandstore)andbranchinstructionshavethelowest
energy dissipation. A variance of 423% in energy dissipation is observed for the entire instruc-
tion set, which is much higher than the 38% variance for both StrongARM and Hitachi SH-4
processors reported in [69]. These data justify our motivation for taking an approach based on
instruction-level energy pro¯ling to build the energy estimator for soft processors, rather than
using the ¯rst order and the second order models proposed in [69].
Note that the addition and subtraction instructions (e.g. add, addi, sub, rsubik, etc.) dis-
sipate much more energy than the multiplication instructions. This is because in our con¯g-
uration of the soft processor, addition and subtraction are implemented using \pure" FPGA
140
0 0.1 0.2 0.3 0.4 0.5
0
5
10
15
20
25
Energy dissipation (nJ)
Switching activity
add
xor
mul
Figure 5.24: Impact of input data causing di®erent arithmetic behavior of MicroBlaze
resources (i.e. slices on Xilinx FPGAs) while multiplication instructions are realized using the
embedded multipliers available on the Spartan-3 FPGAs. Our work in [15] shows that the use
of these embedded multipliers can substantially reduce the energy dissipation of the designs
compared with those implemented using con¯gurable logic components.
² Impact of input data causing di®erent arithmetic behavior of the soft processor
Figure5.24showstheenergydissipationoftheadd additioninstruction,thexor instruction,
and the mul multiplication instruction. They are executed with di®erent input data which
cause di®erent switching activity on the destination registers. We can see that input data
causingdi®erentswitchingactivityofthesoftprocessorhaveasigni¯cantimpactontheenergy
dissipation of these instructions. Thus, by utilizing the arithmetic-level information gathered
duringtheinstructionsetsimulationprocess,ourrapidenergyestimationtechniquecangreatly
improve the accuracy of energy estimation.
² Impact of various addressing modes
We have also analyzed the impact of various addressing modes of the instructions on their
energy dissipation. For example, consider adding the content of register r1 and number 23
141
Figure 5.25: Impact of di®erent instruction addressing modes
and storing the result in register r3. This can be accomplished using the immediate addressing
mode of the addition instruction as addi r3, r1, 23 or using the indirect addressing mode of
the addition instruction as add r3, r1, r16 with the content of register r16 being 16 before ex-
ecuting the addition instruction. The energy dissipation of the addition instruction in di®erent
addressing modes is shown in Figure 5.25. Compared with instruction decoding and the actual
computation, di®erent addressing modes contribute little to the overall energy dissipation of
the instructions. Thus, di®erent addressing modes of the instructions can be ignored in the
energy estimation process in order to reduce the time for energy estimation.
5.6.3.2 Energy Estimation of the Sample Software Programs
² Sample software programs
Inordertodemonstratethee®ectivenessofourapproach,weanalyzetheenergyperformance
oftwoFFTsoftwareprogramsandtwomatrixmultiplicationsoftwareprogramsrunningonthe
MicroBlaze processor system con¯gured as shown in Figure 5.22. Details of these four software
programs can be found in [59].
142
We choose FFT and matrix multiplication as our illustrative examples because they are
widely used in many embedded signal processing systems, such as software de¯ned radio [48].
For°oating-pointcomputation, therearee®ortsinsupportingbasicoperationsonFPGAs, e.g.
addition/subtraction, multiplication, division, etc. [26] [49]. These implementations require a
signi¯cant amount of FPGA resources and involve considerable e®orts to realize °oating-point
FFT and matrix multiplication on FPGAs. In contrast, °oating-point FFT and matrix mul-
tiplication can be easily implemented on soft processors through software emulation, requiring
only a moderate amount of memory space.
² The MicroBlaze soft processor platform
By time sharing the resources, implementing FFT and matrix multiplication using soft pro-
cessors usually demand a smaller amount of resources than the corresponding designs with
parallel architectures. Thus, they can be ¯t into smaller devices. The complete MicroBlaze
system occupies 736 (21%) of 3584 available slices, 3 (18.8%) of 16 available dedicated multi-
pliers on the target Spartan-3 FPGA device. The °oating-point FFT program occupies 23992
bitswhilethe°oating-pointmatrixmultiplicationprogramoccupies3976bits. Eachofthefour
software programs can be ¯t into 2 (12.5%) of 16 available 18-Kbit BRAMs. While parallel
architectures reduce the latency for °oating-point matrix multiplication, they require 33 to 58
times more slices (19045 and 33589 slices) than the MicroBlaze based software implementation
[90]. These customized designs with parallel architecture cannot ¯t in the small Spartan-3
FPGA device targeted in this chapter. As quiescent power accounts for increasingly more per-
centage of the overall power consumption on modern FPGAs [75], choosing a smaller device
can e®ectively reduce the quiescent energy. For example, according the Xilinx web power tools
[83], the quiescent power of the Spartan-3 target device used in our experiments is 92 mW. It
is signi¯cantly smaller than the 545 mW quiescent power of the Virtex-II Pro XC2VP70 device
targeted by Zhuo et al.'s designs [90].
143
Table 5.1: Energy dissipation of the FFT and matrix multiplication software programs
Program Numbers of clock cycles Measured Low-level estimation Our technique
FFT
97766 1783 J 1662 J (6.8%) 1544 J (13.4%)
16447 300 J 273 J (3.2%) 264 J (12.0%)
Matrix 9877 180 J 162 J (8.5%) 153 J (15.0%)
Multiplication 9344 170 J 157 J (7.8%) 145 J (14.7%)
Program Comment
FFT
8-point, complex °oating-point data
8-point, complex integer data
Matrix 3£3 matrix, real °oating-point data
Multiplication 3£3 matrix, real integer data
Figure 5.26: Instant average power consumption of the FFT software program
² Energy performance
The energy dissipation of the FFT programs and the matrix multiplication programs es-
timated using our technique is shown in Table 5.1. All these estimates were obtained within
¯ve minutes. Our arithmetic-level instruction based energy estimation technique eliminates
the »6 hours required for register-transfer level low-level simulation of the soft processor and
the additional »3 hours required for analyzing the low-level simulation results to obtain the
energy estimation results as discussed in Section 6.1. Thus, our technique achieves a signi¯cant
speed-up compared with the low-level simulation based energy estimation techniques.
144
We have performed low-level simulation based energy estimation using XPower to estimate
the energy dissipation of these software programs. Also, to verify the energy dissipation results
obtained using our arithmetic-level rapid estimation technique, we have also performed actual
power consumption measurement of the MicroBlaze soft processor platform using a Spartan-3
prototyping board from Nu Horizons [50] and a SourceMeter 2400 from Keithley [40]. During
our measurement, we ensure that except for the Spartan-3 FPGA chip, all the other compo-
nents on the prototyping board (e.g. the power supply indicator, the SRAM chip) are kept
in the same operating state when MicroBlaze processor is executing the software programs.
Under these settings, we consider that the changes in power consumption of the FPGA pro-
totyping board are mainly caused by the FPGA chip. Using the Keithley SourceMeter, we
¯x the input voltage to the FPGA prototyping board at 6 Volts and measure the changes of
input current to it. Dynamic power consumption of the MicroBlaze soft processor system is
then calculated based on the changes of input current. Compared with measured data, our
arithmetic-level energy estimation technique achieves estimation errors ranging from 12.0% to
15.0% and 13.8% on average. Compared with the data obtained through low-level simulation
based energy estimation, our technique introduces an average estimation error of 7.2% while
signi¯cantly speeding up the energy estimation process.
In addition, we have further analyzed the variation of the power consumption of the soft
processor during the execution of the °oating-point FFT program. First, we use ModelSim to
generate a simulation record ¯le every 13.3293 ¹sec (1/200 of the total simulation time). Then,
XPower is used to measure the average power consumption of each period represented by these
simulation record ¯les. The results are shown in Figure 5.26. The regions between the lines
represent the energy dissipation of the MicroBlaze processor, the instruction access cost and
the program data access cost (both through the LMB bus).
145
For the above °oating-point FFT software program, the MicroBlaze processor dissipates
around 3 times as much energy as that for accessing the instructions and data stored in the
BRAMs. Most importantly, this resultshowssigni¯cant°uctuations inthepowerconsumption
of the entire system. This is consistent with the large variations in the energy pro¯ling of the
MicroBlaze instruction set shown in Figure 5.23. As di®erent instructions are executed in each
samplingperiod, thelargedi®erencesinenergydissipationamongtheinstructionswouldresult
in signi¯cant variation in the average power consumption in these sampling periods.
5.7 Illustrative examples
To demonstrate the e®ectiveness of our approach, we show in this section the design of a
CORDICprocessorfordivisionandablockmatrixmultiplicationalgorithm. Thesedesignsare
widelyusedinsystemssuchassoftwarede¯nedradio,whereenergyisanimportantperformance
metric [20]. We focus on MicroBlaze and System Generator in our illustrative examples due to
their wide availability though our approach is also applicable to other soft processors and other
design tools.
² CORDIC processor for division: The CORDIC (COordinate Rotation DIgital Computer)
iterative algorithm for dividing b by a [6] is described as follows. Initially, we set X
¡1
= a,
Y
¡1
= b, Z
¡1
= 0 and C
¡1
= 1. There are N iterations and during each iteration i (i = 0, 1,
¢¢¢, N¡1), the following computation is performed.
8
>
>
>
>
>
>
>
>
>
>
<
>
>
>
>
>
>
>
>
>
>
:
X
i
= X
i¡1
Y
i
= Y
i¡1
+d
i
¢X
i¡1
¢C
i¡1
Z
i
= Z
i¡1
¡d
i
¢C
i¡1
C
i
= C
i¡1
¢2
¡1
(5.1)
146
Table 5.2: Arithmetic level/low-level simulation time and measured/estimated energy perfor-
mance of the CORDIC based division application and the block matrix multiplication applica-
tion
Designs
Simulation time
Arithmetic level Low-level
¤
CORDIC with N =24, P =2 6.3 sec 35.5 sec
CORDIC with N =24, P =4 3.1 sec 34.0 sec
CORDIC with N =24, P =6 2.2 sec 33.5 sec
CORDIC with N =24, P =8 1.7 sec 33.0 sec
12£12 matrix mult. (2£2 blocks) 99.4 sec 8803 sec
12£12 matrix mult. (4£4 blocks) 51.0 sec 3603 sec
Designs
Energy performance
High-level Low-level Measured
CORDIC with N =24, P =2 1.15 ¹J (9.7%) 1.19 ¹J (6.8%) 1.28 ¹J
CORDIC with N =24, P =4 ¹J (9.5%) 0.71 ¹J (6.8%) 0.76 ¹J
CORDIC with N =24, P =6 0.55 ¹J (10.1%) 0.57 ¹J (7.0%) 0.61 ¹J
CORDIC with N =24, P =8 0.48 ¹J (9.8%) 0.50 ¹J (6.5%) 0.53 ¹J
12£12 matrix mult. (2£2 blocks) 595.9 ¹J (18.2%) 675.3 ¹J (7.3%) 728.5 ¹J
12£12 matrix mult. (4£4 blocks) 327.5 ¹J (12.2%) 349.5 ¹J (6.3%) 373.0 ¹J
Note: ¤ Timing based post place-and-route simulation. The times for placing-and-routing and
generating simulation models are not included.
where,d
i
=+1ifY
i
<0and¡1otherwise. Thecustomizedhardwareperipheralisdescribed
in MATLAB/Simulink as a linear pipeline of P processing elements (PEs). Each of the PEs
performsoneiterationofcomputationdescribedinEquation5.1. Thesoftwareprogramcontrols
the data °owing through the PEs and ensures that the data get processed repeatedly by them
until the required number of iterations is completed. Communication between the processor
and the hardware implementation is through the FSL interfaces. It is simulated using our
MicroBlaze Simulink block. Besides, we consider 32-bit data precision.
Figure 5.27: CORDIC processor for division (P =4)
147
Table 5.3: Simulation speeds of the hardware-software simulators considered in this
Instruction set simulator Simulink
(1)
ModelSim
(2)
Simulated clock cycles per second >1000 254.0 8.7
Note: (1) Only consider simulation of the customized hardware peripherals; (2) Timing
based post place-and-route simulation. The time for generating the simulation models of the
low-level implementations is not accounted for.
² Block matrix multiplication: Input matrices A and B are decomposed into a number
of smaller matrix blocks. Multiplication of these smaller matrix blocks is performed using a
customizedhardwareperipheraldescribedusingMATLAB/Simulink. AsshowninFigure5.28,
data elements of a matrix block from matrix B (e.g. b
11
, b
21
, b
12
and b
22
) are fed into the
hardware peripheral and get stored. When data elements of a matrix block from matrix A
comesin,multiplicationandaccumulationareperformedaccordinglytogenerateoutputresults.
The software program running on MicroBlaze is responsible for preparing the data sent to
the customized hardware peripheral, accumulating the multiplication results back from the
hardware peripheral, and generating the result matrix.
Figure 5.28: Architecture of matrix multiplication with customized hardware for multiplying
2£2 matrix blocks
For the experiments discussed in this chapter, the MicroBlaze processor is con¯gured on a
XilinxSpartan-3xc3s400FPGA[79]. Theoperatingfrequenciesoftheprocessor, thetwoLMB
(Local Memory Bus) interface controllers and the customized hardware peripherals shown in
Figure5.1aresetat50MHz. WeuseEDK6.3.02fordescribingthesoftwareexecutionplatform
and for compiling the software programs. System Generator 6.3 is used for description of
148
the customized hardware peripherals and automatic generation of low-level implementations.
Finally, ISE 6.3.02 [79] is used for synthesizing and implementing (placing and routing) the
complete applications.
We have also measured the actual power consumption of the two applications using a
Spartan-3 prototyping board from Nu Horizons [50] and a SourceMeter 2400 instrument (a
programmable power source with the measurement functions of a digital multimeter) from
Keithley [40]. Except for the Spartan-3 FPGA device, all the other components on the proto-
typing board (e.g. the power supply indicator, the SRAM chip) are kept in the same operating
state when the FPGA device is executing the applications. Under these settings, we consider
that the changes in power consumption of the board are mainly caused by the FPGA device.
Then, we ¯x the input voltage and measure the changes of input current to the prototyping
board. The dynamic power consumption of the designs is calculated based on the changes of
input current. Note that quiescent power (power consumption of the device when there is no
switching activity on it) is ignored in our experimental results since it can not be optimized for
a speci¯c FPGA device.
The simulation time and energy performance for various implementations of the two nu-
merical computation applications are shown in Table 5.2. For these two applications, our
arithmetic level co-simulation environment based on MATLAB/Simulink achieves simulation
speed-ups ranging from 5.6x to 88.5x compared with that of low-level timing simulation using
ModelSim. The low-level timing simulation is required for low-level energy estimation using
XPower. The simulation speed of our arithmetic co-simulation approach is the major factor
that determines the energy estimation time required by the proposed energy estimation tech-
nique. It varies depending on the hardware-software mapping and scheduling of the tasks that
constitute the application. This is due to two main reasons. One reason is due to the di®er-
ences in simulation speeds of the hardware simulator and the software simulator. Table 5.3
149
shows the simulation speeds of the cycle-accurate MicroBlaze instruction set simulator, the
MATLAB/Simulink simulation environment for simulating the customized hardware peripher-
als, and ModelSim for timing based low-level simulation. Cycle-accurate simulation of software
executions is much faster (more than 4 times faster) than cycle-accurate arithmetic level sim-
ulation of hardware execution using MATLAB/Simulink. Thus, if more tasks are mapped to
execute on the customized hardware peripherals, it will slow down the overall simulation speed
that can be achieved by the proposed arithmetic level co-simulation approach. Compared with
the low-level simulation using ModelSim, our MATLAB/Simulink based implementation of the
co-simulation approach can potentially achieve simulation speed-ups from 29.0x to more than
114x. Another reason is that the frequency of data exchanges between the software program
and the hardware peripherals. Every time that the simulation data is exchanged between the
hardware simulator and the software simulator, the simulation processes performed within the
simulators are stalled and resumed. This would add quite some extra overhead to the co-
simulation process. There are close interactions between the hardware and software execution
forthetwonumericalcomputationapplicationsconsideredinthischapter. Thus,thespeed-ups
achievedforthetwoapplicationsaresmallerthanthemaximumspeed-upsthatcanbeachieved
in principal.
If we further consider the time for implementing (including synthesizing, placing-and-
routing)thecompletesystemandgeneratingthepostplace-and-routesimulationmodels,which
is required by the low-level energy estimation approaches, our arithmetic level co-simulation
approach would lead to even much greater simulation speed-ups. For the two numerical com-
putation applications, the time for implementing the complete system and generating the post
place-and-route simulation models takes around 3 hours. Thus, our arithmetic level simulation
based energy estimation technique can be 197x to 6534x faster than that based on low-level
simulation for these two numerical computation applications.
150
For the hardware peripheral of the CORDIC division application, our energy estimation
is based on the energy functions for the processing elements shown in Figure 5.27. For the
hardware peripheral of the matrix multiplication application, our energy estimation is based
on the energy functions for the multipliers and the accumulators. As one input to these energy
functions, we calculate the average switching activity of all the input/output ports of the
Simulink blocks during the arithmetic level simulation process. Table 5.2 shows the energy
estimates obtained using our arithmetic level simulation based energy estimation technique.
Energy estimation errors ranging from 9.5% to 18.2% and 11.6% on average is achieved for
these two numerical computation applications compared with actual measurements. Low-level
simulation based energy estimation using XPower achieves an average estimation error of 6.8%
compared with actual measurements.
To sum it up, for the two numerical computation applications, our arithmetic level co-
simulation based energy estimation technique sacri¯ces an average 4.8% estimation accuracy
while achieving estimation speed-ups up to 6534x compared with low-level energy estimation
techniques. Besides, the implementations of the two applications identi¯ed using our energy
estimation technique achieve energy reductions up to 52.2% compared with other implementa-
tions considered in our experiments.
5.8 Summary
A two-step rapid energy estimation technique for hardware-software co-design using FPGAs
was proposed in this chapter. An implementation of the proposed energy estimation technique
based on MATLAB/Simulink and the design of two numerical computation applications were
provided to demonstrate the e®ectiveness of our approach.
151
One extension of our rapid energy estimation approach is to provide con¯dence interval
information of energy estimates obtained using the techniques proposed in this chapter. Pro-
viding such information is desired in the development of many practical systems. Besides, as
the use of multiple clocks with operating frequencies is becoming popular in the development
of con¯gurable multi-processor platforms for improving time and energy performance, another
extension of our work is to support energy estimation for hardware-software co-designs using
multiple clocks. This requires that we extend the co-simulation environment to maintain and
synchronize multiple global simulation timers used to keep track of the hardware and software
simulation processes driven by di®erent clock signals.
152
Chapter 6
Hardware-software co-design for energy e±cient
implementations of operating systems
6.1 Introduction
There is a strong trend towards integrating FPGA with various heterogeneous hardware com-
ponents, such as RISC processors, embedded multipliers, memory blocks (e.g. Block RAMs in
XilinxFPGAs),etc. Suchintegrationleadstoarecon¯gurableSystem-on-Chip(SoC)platform,
an attractive choice for implementing many embedded systems. FPGA based soft processors,
which are RISC processors realized using the con¯gurable resources available on FPGA de-
vices, are also becoming popular. Examples of such processors include Nios from Altera [3],
MicroBlaze and PicoBlaze from Xilinx [79]. One advantage of designing using soft processors
is that they provide new design trade-o®s by time sharing the limited hardware resources (e.g.
con¯gurable logic blocks on Xilinx FPGAs) available on the devices. Many control and man-
agement functionalities as well as computations with tightly coupled data dependency between
computation steps (e.g. many recursive algorithms such as Levinson Durbin algorithm [32])
are inherently more suitable for software implementations on processors than the correspond-
ing customized (parallel) hardware implementations. Their software implementations are more
153
compactandrequireamuchsmalleramountofhardwareresources. Suchcompactdesignsusing
soft processors can e®ectively reduce the static energy dissipation of the complete system by
¯tting into smaller FPGA devices [75]. Most importantly, soft processors are \con¯gurable" by
allowingthecustomizationoftheinstructionsetand/ortheattachmentofcustomizedhardware
peripherals in order to speed up the computation of algorithms with a large degree of paral-
lelism and/or to e±ciently perform some auxiliary management functionalities as described in
thischapter. TheNiosprocessorallowsuserstocustomizeupto¯veinstructions. TheMicroB-
laze processor supports various dedicated communication interfaces for attaching customized
hardware peripherals to it.
Real-time operating systems (RTOSs) simplify the application development process by de-
compling the application code into a set of separate tasks. They provide many e®ective mech-
anisms for task and interrupt management. New functions can be added to the RTOS without
requiring major changes to the software. Most importantly, for preemptive real-time operating
systems, when low priority tasks are added to the system, their impact on the responsiveness
of the system can be minimized as much as possible. By doing so, time-critical tasks and inter-
rupts can be handled quickly and e±ciently. Therefore, real-time operating systems are widely
adopted in the development of many embedded applications using soft processors.
Energy e±ciency is a key performance metric in the design of many embedded systems.
Modern FPGAs o®er many on-chip energy management mechanisms to improve the energy
e±ciencyofdesignsusingthem. Oneimportantmechanismis\clockgating". Theusercandy-
namicallystopthedistributionofclocksignalstosomespeci¯ccomponentswhentheoperations
of these components are not required. Another important mechanism is the dynamic switching
of clock sources. Many FPGA devices provide multiple clock sources with di®erent operating
frequencies and clock phases. When communicating with low-speed o®-chip peripherals, por-
tions of the FPGA device can dynamically switch to operate with a slow clock frequency and
154
thus reduce the energy dissipation of the system. For example, Xilinx FPGAs provide BUFGCEs
to realize \clock gating" and BUFGMUXs to realize dynamic switch of clock sources [79]. These
BUFGCEs and BUFGMUXs automatically avoid the glitches in the clock distribution network when
changing the status of the network. Customization of the FPGA based soft processor to in-
corporate these on-chip energy management mechanisms is crucial for improvement of energy
e±ciency when implementing real-time operating systems on it.
In this chapter, we focus on the following key design problem. The FPGA device is con-
¯gured with one soft processor and several on-chip customized hardware peripherals. The
processor and the hardware peripherals communicate with each other through some speci¯c
bus interfaces. There are a set of tasks to be run on the soft processor based system. Each of
the tasks is given as a C program. Software drivers for controlling the hardware peripherals
are provided. Tasks can invoke the corresponding hardware peripherals through these software
drivers in order to perform some I/O operations or to speed up the execution of some spe-
ci¯c computations. The execution of a task may involve only a portion of on-chip hardware
resources. For example, the execution of a task may require the processor, a speci¯c hardware
peripheral, and the bus interface between them while the execution of another task may only
require the processor, the memory interface controllers and the memory blocks that store the
program instructions and data for this task. Tasks are either executed periodically or invoked
by some external interrupts. Tasks have di®erent execution priorities and are scheduled in a
preemptive manner. Tasks with higher priorities are always selected for execution and can
preempt the execution of tasks with lower priorities. Besides, the rates at which the tasks are
executed is considered to be \infrequent". That is, there are intervals during which no tasks
are under execution. Based on these assumptions, our objective is to perform customization
of the soft processor and adapt it to the real-time operating system running on it so that the
energy dissipation of the complete system is minimized. We focus on the utilization of on-chip
155
programmable resources and energy management mechanisms to achieve our objective. Also,
it is desired that the required changes to the software source code of the operating systems are
kept minimal when applying the energy management techniques.
Toaddressthedesignproblemdescribedabove, weproposea techniquebasedon hardware-
software co-design for energy e±cient implementation of real-time operating systems on soft
processors. The basic ideas of our approach are the integration of on-chip energy management
mechanism and the employment of customized \hardware" assisted task management compo-
nent by utilizing the con¯gurability o®ered by soft processors. More speci¯cally, we tightly
couple several customized hardware components to the soft processor, which perform the fol-
lowing functionalities for energy management: (1) manage the clock sources for driving the
soft processor, the hardware peripherals and the bus interfaces between them; (2) perform the
task and interrupt management responsibilities of the operating system cooperatively with the
processor; (3) selectively wake up the processor and the corresponding hardware components
fortaskexecutionbasedonthehardwareresourcerequirementsofthetasks. Sincetheattached
hardware peripherals can work concurrently with the soft processor, the proposed hardware-
software co-design techniques incur an extra clock cycle of task and interrupt management
overhead, which is negligible for the development of many embedded systems compared with
the software management overhead. Most importantly, due to the parallel processing capa-
bility of the hardware peripherals, we show that for some real-time operating systems, our
techniques can actually lead to less management overheads compared with the corresponding
\pure"softwareimplementations. Inaddition, weimplementareal-timeoperating systemona
state-of-the-artsoftprocessortoillustrateourapproach. Thedevelopmentofseveralembedded
applications is provided in both Section 6.6 and Section 6.7 to demonstrate the e®ectiveness
of our approach. Actual measurement on an FPGA prototyping board shows that the sys-
tems implemented using our power management techniques achieve energy reductions ranging
156
from 73.3% to 89.9% and 86.8% on the average for the di®erent execution scenarios of the
applications considered in our experiments.
Thischapterisorganizedasfollows. Section6.2discussesthebackgroundinformationabout
real-time operating systems as well as the related work about customization of FPGA based
soft processors. Section 6.5 presents our hardware-software co-design technique for energy
e±cient implementation of real-time operating systems on FPGA based soft processors. The
implementations of two popular operating systems through customization of a state-of-the-art
soft processor are described in Section 6.6 and Section 6.7 in order to illustrate our energy
management techniques. The development of two embedded FFT applications, the execution
of which is typical in many embedded systems, are provided in both of these two sections to
demonstrate the e®ectiveness of our approach. Finally, we conclude in Section 6.8.
6.2 Real-time operating systems
6.2.1 Background
Real-timeoperatingsystemsmayencompassawiderangeofdi®erentcharacteristics. Wefocus
on the typical characteristics listed below in this chapter. We retain these important features
of real-time operating systems when applying our energy management techniques to improve
their energy e±ciency.
² Multitasking: The operating system is able to run multiple tasks \simultaneously" through
context switching. Each task is assigned a priority number. Task scheduling is based on the
priority numbers of the tasks ready to run. When there are multiple tasks ready for execution,
the operating system always picks up the task with the highest priority for execution.
157
² Preemptive: The task under execution may be intercepted by a task with higher priority or
an interrupt. A context switching is performed during the interception so that the intercepted
task can resume execution later.
² Deterministic: Execution times for most of the functions and services provided by the oper-
ating system are deterministic and do not depend on the number of tasks running in the user
application. Thus, the user should be able to calculate the amount of time the real-time OS
takes to execute a function or a service.
² Interrupt management: When interrupts occur, the corresponding interrupt service routines
(ISRs) are executed. If a higher priority task is awakenedas a result of a interrupt, this highest
priority task runs as soon as all nested interrupts are completed.
6.2.2 O®-the-shelf operating systems
Several commercial real-time operating systems have been ported to soft processors. ThreadX
[22],areal-timeoperatingsystemfromExpressLogic,Inc.,hasbeenportedtobothMicroBlaze
and Nios processors. In ThreadX, only the system services used by the application are brought
into the run-time image of the operating system. Thus, the actual size of the operating system
is determined by the application running on it.
6.2.2.1 MicroC/OS-II
MicroC/OS-II is a real-time operating system that supports all the important characteristics
discussed above [42]. Each task within MicroC/OS-II has a unique priority number assigned
to it. The operating system can manage up to 64 tasks. There is a port of MicroC/OS-II for
MicroBlaze processor. In this implementation, the operating system runs a dummy idle task
when no useful task waits for execution and no interrupt presents. Xilinx also provides a real-
timeoperatingsystemforapplicationdevelopmentusingMicroBlaze. However, asidesfromthe
158
various bene¯ts o®ered by these real-time operating systems, none of these implementations of
real-time operating systems addresses the energy management issue in order to improve their
energy e±ciency when running on FPGA based soft processors.
6.2.2.2 TinyOS
TinyOS[33] is anoperating systemsdesigned for wireless embedded sensor networks. It adopts
a component-based architecture, which leads to a tighter integration between user application
code and the OS kernel than traditional operating systems. Not all the features of real-time
operatingsystemsdiscussedinSection6.2.1aresupportedinTinyOS.Forexample,itmaydelay
theprocessingoftaskstillthearrivalofnextinterrupt. Theunpredictableand°uctuatewireless
transmissionmakessuchfullsupportunnecessary. TinyOSturnstheprocessorintoalow-power
sleep mode when no processing is required and wakes it up when interrupts occur. Lacking
the customized hardware peripherals proposed in this chapter to take over the management
responsibilities when the processor is in sleep mode, TinyOS would result in unnecessary wake-
ups of the processor when processing tasks such as the periodic management of OS clock ticks.
Operating systems with a component based architecture provide a set of reusable system
components. Eachcomponentperforms somespeci¯c functionalities (e.g. OSservicesand FFT
computation). Implementations of the components can be pure software programs or software
wrappers around hardware components. An application designer builds an applications by
connecting the components using some \wiring speci¯cations", which are independent of the
implementations of the components. Decomposing the di®erent functionalities provided by the
operating system into separate components allows unused functionalities to be excluded from
the application.
nesC is a system programming language based on ANSI C language that supports the de-
velopment of component based operating systems. It has been used to build several popular
operating systems such as TinyOS [33]. It is also used to implement networked embedded
159
systems such as Motes [19]. There are two types of components in nesC: modules and con-
¯gurations. Modules provide the actual implementations of the functionalities as well as the
interfaces with other components. Interfaces describe the interactions between di®erent com-
ponents and are the only access point to the components. Con¯gurations are used to wire
components together, connecting interfaces used by a component to interfaces of another com-
ponent. Anapplicationdescribedin nesC containsatop-level con¯guration thatwirestogether
the components used by it. In addition, nesC provides the following features to improve the
time performance of component based operating systems.
² Task and event based concurrency: nesC provides two kinds of concurrency: tasks and
events. Tasks are a deferred execution mechanism. Components can post tasks and the post
operation immediately returns. The actual execution of the tasks is deferred till the task
scheduler releases them later. Tasks run to completion and do not preempt each other. To
ensure low task execution latency, individual tasks must be short. Tasks with long execution
shouldbedecomposed into multipletasks inorder to keepthe systemreactive. Events alsorun
to completion, but may preempt the execution of a task or another event. Events signify either
the completion of a split-phase operation (discussed below) or the occurrence of interrupts
(e.g. message reception or timer time-out). The simple concurrency model of nesC allows high
concurrency with low overhead. This is compared against the thread-based concurrency model
used by the operating systems discussed in Section 6.2.2, where the thread stacks consume
precious memory space while blocking incoming processing requests.
² Split-phase operations: Since tasks in nesC run non-preemptively, the operations of the tasks
with long latency are executed in a split-phase manner. Interfaces in nesC are bi-directional.
Onecomponentcanrequesttheothercomponenttoperformanoperationbyissuingacommand
160
throughaninterface. Forasplit-phase operation,thecommandsreturnimmediately. Thecom-
ponent can be noti¯ed of the completion of the operation by implementing the corresponding
event for the operation through the same interface.
² Whole-program optimization: According to the wiring of the components that constitute the
application, nesC generates a single C program. Program analysis (i.e. data races) and perfor-
mance optimization are performed based on the single C program. Such whole-program opti-
mization leads to more compact code size and more e±cient interactions between components.
For example, the kernel of TinyOS requires a total 400 bytes of program and data memory, a
signi¯cant reduction of kernel size compared with the »15 Kbytes storage requirement of the
MicroC/OS-II kernel.
See [25] and [33] for more discussions about nesC and TinyOS.
6.3 On-chip energy management mechanisms
For the commercial and academic real-time operating systems discussed above, all the task
and interrupt management responsibilities of the operating systems are implemented in \pure"
software. In these implementations, the operating systems have little or no control of the
status of the processor and its peripherals. As is discussed in Section??, even though there are
already several real-time operating systems ported to soft processors, to our best knowledge,
there have been no attempts to perform customized \con¯guration" of the soft processors
and provide energy management capability to the operating systems in order to improve their
energy e±ciency. Dynamic voltage and frequency scaling (DVFS) technique has proved to
be an e®ective method of achieving low power consumption while meeting the performance
requirements [34]. DVFS is employed by many real-time operating systems. However, DVFS
cannot be realized directly on the current recon¯gurable SoC platforms and requires additional
and relatively sophisticated o®-chip hardware support. Thus, application of DVFS is out of
161
scope for the design problem discussed in this chapter. Another e®ective energy management
technique is to power o® the processors when no tasks are ready for execution. This technique
has been used in the design of operating systems such as TinyOS [33]. In these designs, the
hardwarecomponentthatcontrolsthepoweringon/o®oftheprocessorsislooselycoupledwith
the operating systems and share little or no responsibilities of the operating system with the
processor. Forexample,theOSclockclickscanonlybemanagedbytheprocessorforthedesign
of MicroC/OS-II shown in [42]. This would result in unnecessary waking up of the processor
and undesired energy dissipation in many cases. For TinyOS, such powering on and o® may
also cause delays for executing tasks with high priority. Besides, selective waking up of the
hardware components depending on the resource requirement of the tasks is not realized in
these designs.
6.4 Related work
There have been quite some e®orts on customizing soft processors in order to optimize their
performance toward a set of target applications. Cong, etc. propose a technique based on
shadow registers so as to get a better utilization of the limited data bandwidth between the
soft processor and the tightly coupled hardware peripherals [16]. In their technique, the core
register ¯le of the soft processor is augmented by an extra set of shadow registers which are
conditionally written by the soft processor in the write-back stage and are read only by the
hardware peripherals attached to the processor.
Shannon, etc. propose a programmable controller with a SIMPPL (Systems Integrating
Modules with Prede¯ned Physical Links) system computing model as a °exible interface for
integrating various on-chip computing elements [66]. The programmable controller allows the
addition of customized hardware peripherals as computing elements. All the computing ele-
mentswithinthecontrollercancommunicatewitheachotherthrougha¯xedphysicalinterface.
162
Their approach enables users to easily adapt the controller to the new computing requirements
without necessitating the redesign of other elements in the system.
Besides, Hauck, etc. propose the optimization technique based on run-time recon¯guration
units to adapt the processor to the ever-changing computing requirements [89]. Sun, etc.
propose a scalable synthesis methodology for customizing the instruction set of the processor
based on the speci¯c set of application of interests [72]. Hubner, etc. have a design based on
multiple soft processors for automotive applications. In their designs, each of the multiple soft
processors is optimized to perform some speci¯c management responsibilities.
In spite of all the previous work on soft processors, to our best knowledge, the hardware-
software co-design technique proposed in this chapter is the ¯rst attempt to customize a soft
processor to optimize the energy performance of the real-time operating system running on it.
6.5 Our approach
Thebasicideaofourhardware-softwareco-designbasedtechniqueforenergye±cientimplemen-
tation of real-time operating system is to tightly couple several dedicated hardware peripherals
to the soft processor. These hardware peripherals cooperatively perform task and interrupt
management responsibilities of the operating system together with the software portions of
the operating system running on the soft processor. These hardware peripherals also control
the activation states of the various on-chip hardware components including the processor by
utilizing the on-chip energy management mechanisms. They take over the task and interrupt
management responsibilities of the operating systems running on the soft processor when there
is no task ready for execution. In this case, except for these energy management components,
all the other hardware components including the soft processor are completely shut o® when
they are not processing any \useful" tasks. There are two major reasons for improvement
of energy e±ciency using our hardware-software co-design technique. One reason is that the
163
Figure 6.1: Overall hardware architecture
energy dissipation of these attached dedicated hardware components for performing task and
interrupt management are much smaller than that of the soft processor. Another reason is
that we selectively wake up the hardware peripherals attached to the processor and put them
into proper activation states based on the hardware resource requirements of the tasks under
execution. Thus, the undesired energy dissipation of the hardware peripherals that are not
required for execution can be eliminated and the energy e±ciency of the complete system can
be further improved.
The overall hardware architecture of our hardware-software co-design approach is shown
in Figure 6.1. Three hardware components, which are clock management unit, auxiliary task
and interrupt management unit and selective component wake-up unit, are added to the soft
processor to perform energy management.
²Clock management unit: ManyFPGAdevicesprovidedi®erentclocksourcesonasinglechip.
TheFPGAdevicescanthenbedividedintodi®erentclockdomains. Eachoftheclockdomains
can be driven by di®erent clock sources. Dynamic switching between the clock sources with
di®erent operating frequencies within a few clock cycles is also possible. For example, when
164
the processor is communicating with a low-speed peripheral, instead of running the processor
in some dummy software loops when waiting for the response from slow peripherals, the user
can choose to switch the processor and the related hardware peripherals to work with a clock
source with a slower operating frequency. This can e®ectively reduce the energy dissipated by
the processor. Clock gating is another important technique used in FPGA designs to reduce
energy dissipation. The user can dynamically change the distribution of clock signals and
disable the transmission of clock signals to the hardware components that are not in use.
Theclockmanagementunitprovidesexplicitcontrolaccesstotheclockdistributionnetwork
of the FPGA device. It accepts the control signals from other components and change clock
sourcesthatdrivesthevarioushardwarecomponentsusingthetwotechniquesdiscussedabove.
² Auxiliary task and interrupt management unit: We attach an auxiliary task and interrupt
management (ATIM) hardware peripheral to the soft processor. We let the internal data of
the operating system for task scheduling and interrupt management be shared between the
ATIM unit and soft processor. Hence, when the operating system determines that no task is
ready for execution and no interrupt presents, the ATIM unit will send out signals to the clock
managementcomponentunitanddisablesthetransmissionofclocksignalstothesoftprocessor
andtheotherhardwareperipheralsattachedtoit. TheATIMtakesoverthetaskandinterrupt
management responsibilities of the operating system. When the ATIM unit determines that a
task is ready for execution or any external interrupt arrives, it will wake up the processor and
the related hardware peripherals and hand the task and interrupt management responsibilities
back to the processor. Note that to retain the deterministic feature of the real-time operating
system, dedicated bus interfaces with deterministic delays are used for communication between
the processor and the other hardware components.
² Selective component wake-up and activation state management unit: Since the execution
of a task uses only a portion of the device, we provide a selective wake-up state management
165
mechanismwithintheOSkernel. Usingtheclockmanagementunit,theFPGAdeviceisdivided
into several clock domains. Each of the clock domains can be in either \active" or \inactive"
(clock gated) state. We denote a combination of the activation states of these di®erent clock
domainsasanactivationstate ofthedevice. Itistheuser'sresponsibilitythatataskisassigned
to an appropriate activation state of the device in which the hardware components required by
the task are all active. Thus, when a task is selected by the operating system for execution,
only some speci¯c components used by the task are driven by the clock sources. The unused
components are kept in clock gated state in order to save power consumption.
In the following sections, we implement two state-of-the-art real-time operating systems
on the MicroBlaze soft processor. We perform customization to the MicroBlaze processor
and enhance the energy performance of the two real-time operating system by applying the
hardware-software co-design energy management technique proposed in this chapter.
Regarding to the management overheads caused by our hardware-software co-design tech-
niques, since the attached hardware peripherals can work concurrently with the soft processor,
the proposed hardware-software co-design techniques incur an extra clock cycle of task and
interrupt management overhead, which is negligible for the development of many embedded
systems compared with the software management overhead. Most importantly, due to the
parallel processing capability of the hardware peripherals, we show that for some real-time op-
erating systems (e.g. h-TinyOS discussed in Section 6.7), our techniques can actually lead to
lessmanagementoverheadscomparedwiththecorresponding\pure"softwareimplementations.
6.6 An implementation based on MicroC/OS-II
The hardware architecture of the complete MicroC/OS-II operating system is shown in Fig-
ure 6.2. Except for the \priority aliasing" technique discussed in Section 6.6.4, our energy
166
management techniques are transparent to the software portions of the operating system and
does not require changes in software.
6.6.1 Customization of MicroBlaze soft processor
The instruction and program data of the MicroC/OS-II operating system are stored at the
BRAMs. TheMicroBlazeprocessorgetsaccesstotheinstructionandtheprogramdatathrough
two LMB (Local Memory Bus) interface controllers, one for instruction side and the other for
data side data access.
There are two di®erent kinds of bus interfaces to attach customized hardware peripherals
to MicroBlaze. One kind of bus interface is Faster Simplex Links (FSLs), which are dedicated
interfaces for tightly coupling high speed peripherals to MicroBlaze. A 32-bit data can be sent
between the processor and its hardware peripherals through FSLs in a ¯xed time of two clock
cycles. To retain the real-time responsiveness of the operating system, FSL interfaces are used
for attaching the energy management hardware peripherals to the MicroBlaze soft processor.
Another interface is the On-chip Peripheral Bus (OPB) interface, a shared bus interface for
attaching hardware peripherals that operate with a relative low frequency compared to the
MicroBlaze processor. For example, peripherals for accessing general purpose input/output
(GPIO) interfaces and those for managing the communication through serial ports can be
attached to MicroBlaze through the OPB bus interface.
6.6.2 Clock management unit
The hardware architecture of the clock management unit is illustrated in Figure 6.3. Xilinx
Spartan-3/Virtex-II/Virtex-II Pro/Virtex FPGAs integrate on-chip digital clock management
(DCM) modules. Each DCM module can provide di®erent clock sources (CLK0, CLK2X,
CLKDV, CLKFX, etc.). Each of these clock sources can be used to form a clock domain
167
Figure 6.2: Con¯guration of the MicroBlaze soft processor with the COMA scheme
Figure 6.3: An implementation of the clock management unit
and drive the hardware components within its own domain. Thus, di®erent on-chip hardware
components can operate under di®erent operating frequencies. For example, on Spartan-3
FPGA devices, CLKDV of the DCM module can divide the input clock by up to 16. For the
design examples shown in Section 6.6 and Section 6.7, when the input clock frequency is 50
MHz, the output clock frequency of CLKDV can be as low as 3.125 MHz.
Thereare multiplexers(i.e. BUFGMUXs)withintheclockdistributionnetwork. BUFGMUXscan
be used to dynamically switch the clock sources with di®erent operating frequencies for driv-
ing the soft processor, other hardware peripherals, and the communication interfaces between
them. Besides, Xilinx FPGAs integrate bu®ers with enable port (BUFGCEs) within their clock
168
Figure 6.4: Linked list of task control blocks
distributionnetworkwhichcanbeusedtorealize"clockgating". Thatis,these BUFGCEsbeused
to dynamically drive the corresponding hardware components only when these components are
required for execution.
The clock management unit accepts control signals from the selective component wake-up
and activation state management unit discussed in Section 6.6.4 and change the clock sources
for driving the soft processor, memory controllers, the OPB bus and the FSLs connected to
the soft processor accordingly. Using BUFGMUXs and BUFGCEs, the clock management unit can
change the activation state of the FPGA device within one or two clock cycles upon receiving
the request from the other management hardware components. As is analyzed in section 6.6.5,
comparing the software overhead for context switching between tasks, the addition of the clock
management unit introduces negligible overhead to the operating system.
6.6.3 Auxiliary task and interrupt management unit
In order to take over the responsibilities of the operating systems when the processor is turned
o®, the auxiliary task and interrupt management unit performs three major functionalities:
ready task list management, OS clock tick management, interrupt management. They are
described in detail as follows.
² Ready task list management: MicroC/OS-II maintains a ready list consisting of two
variables, OSRdyGrp and OSRdyTbl to keep track of the task status. Each task is assigned a
unique priority number between 0 and 63. As shown in Figure 6.5, tasks are grouped (eight
169
taskspergroup)andeachgroupisrepresentedbyonebitin OSRdyGrp. OSRdyGrpandOSRdyTbl
actually form a table. Each slot in the table represents a task with a speci¯c priority. If the
value of a slot equals to zero, it means that the task represented by this slot is not ready for
execution. Otherwise, if the value of the slot is one, it means that the task represented by this
slot is ready for execution. Whenever there are tasks ready for execution, the operating system
searches the two variables in the ready list table to ¯nd out the task with highest priority that
is ready to run. A context switch is performed if the selected task has higher priority than the
task under execution. As mentioned in Section 6.2.2, MicroC/OS-II always runs an idle task
(represented by slot 63) with the lowest priority when no other task is ready for execution.
We use some complier construct to explicitly specify the storage location of the ready task
list (i.e. OSRdyGrp and OSRdyTbl) on the dual-port BRAMs. The MicroBlaze processor can
get access to OSRdyGrp and OSRdyTbl through the data side LMB bus controller attached to
port A of the dual-port BRAMs while the ATIM component can also get access to these two
variablesthroughportBoftheBRAMs. TheATIMcomponentkeepstrackofthetwovariables
of the ready list with a user-de¯ned period. When it detects that only the idle task is ready for
execution, it will signal the clock management unit to disable the clocks sent to the processor,
itshardwareperipheralsandthebusinterfacesbetweenthem. These\clockgated"components
will resume their normal states when there are useful tasks become ready for execution and/or
external interrupts presented.
² OS clock tick management: OS clock ticks are a special kind of interrupts that are
used by MicroC/OS-II to keep track of the time experienced by the system. A dedicated timer
is attached to the MicroBlaze processor and repeatedly generates time-out interrupts with a
pre-de¯ned interval. Each interrupt generated by this timer corresponds to one clock tick.
MicroC/OS-II maintains an 8-bit counter for counting the number of clock ticks experienced
by the operating system. Whenever the MicroBlaze processor receives a time-out interrupt
170
Figure 6.5: Ready task list
from the timer, MicroC/OS-II increases the clock tick counter by one. The counting of clock
ticks is used to keep track of time delays and timeouts. For example, the period that a task
is repeatedly executed is counted in clock ticks. Also, a task can stop waiting for an interrupt
if the interrupt of interest fails to occur within a certain amount of clock ticks. We use a
special hardware component to perform the OS clock tick management and separate it from
the management of interrupts for other components through the OPB bus. By doing so,
frequently powering on and o® of the OPB bus controller to notify the processor of the clock
tick interrupt, which is required by other interrupts and would result in unnecessary energy
dissipation, can be avoided.
When a task is created, it is assigned a task control block, OS_TCB. A task control block is
a data structure used by MicroC/OS-II to maintain the state of a task when it is preempted.
The task control blocks are organized as a linked list, which is shown in Figure 6.4.
WhenthetaskregainscontrolsoftheCPU,thetaskcontrolblockallowsthetasktoresume
execution from the previous state where it has stopped. Especially, the task control block
171
contains a ¯eld OSTCBDly, which is used when a task needs to be delayed for a certain number
of clock ticks or a task needs to pend for an interrupt to occur within a timeout. In this case,
OSTCBDly ¯eld contains the number of clock ticks the task is allowed to wait for the interrupt
to occur. When this variable is 0, the task is not delayed or has no timeout when waiting for
an interrupt. MicroC/OS-II requires a periodic time source. Whenever the time source times
out, the operating system decreases OSTCBDly by 1 till OSTCBDly = 0, which indicates that the
corresponding task is ready for execution. The bits in OSRdyGrp and OSRdyTbl representing
this task will then be set accordingly.
In order to apply the proposed energy management technique, the OSTCBDly ¯eld for each
task is stored at a speci¯c location of the dual-port BRAMs. The MicroBlaze processor gets
access to OSTCBDly through the data side LMB bus controller attached to port A of the dual-
port BRAMs. On the other hand, the ATIM component can also get access to OSTCBDly
through port B of the BRAMs.
When the MicroBlaze processor is turned o® by the clock management unit, the ATIM
unit will take over the management responsibilities of MicroC/OS-II from the processor. The
ATIM unit gets access to the OSTCBDly ¯elds of the task control blocks through another port
of the dual-port BRAMs and decreases their values when an OS clock tick interrupt occurs.
Whenever the OSTCBDly ¯eld of some tasks reaches zero, which means that these tasks are
ready for execution, the ATIM unit will signal the clock management unit, which will bring
up the processor together with the related hardware components to process the tasks that are
waked up and ready for execution.
² Interrupt management: The architecture of the interrupt management unit is shown
in Figure 6.6. When external interrupts arrive, the interrupt management will check the status
of the MicroBlaze processor and the OPB bus controller. If neither the processor nor the OPB
bus controller are active, the interrupt management unit will perform the following operations:
172
Figure 6.6: Interrupt management unit
(1)sendoutcontrolsignalstotheselectivecomponentwake-upunittoenabletheprocessorand
the OPB controller; (2) notify the processor of the external interrupts; (3) when the processor
responds and starts querying, send out the interrupt vector through the OPB bus and notify
the soft processor of the sources of the incoming external interrupts. If both the processor and
the OPB bus controller are already in active state, the interrupt management unit will skip the
wake-up operation and go on directly with operation (2) and (3) described above.
6.6.4 Selective wake-up and activation state management unit
TominimizetherequiredchangestothesoftwareportionoftheMicroC/OS-IIoperatingsystem,
weemployatechniquecalled\priorityaliasing"torealizetheselectivecomponentwake-upand
activationstatemanagementunit. AsshowninFigure6.5,MicroC/OS-IIusesan8-bitunsigned
integer to store the priority of a task. Since MicroC/OS-II supports a maximum number of 64
tasks,onlythelast6bitsoftheunsignedintegernumberisusedtodenotethepriorityofatask.
When applying the \priority aliasing" technique, we use the ¯rst 2 bits of a task's priority to
denotethefourdi®erentcombinationsofactivationstatesofthedi®erenthardwarecomponents
(i.e. an activation state of the FPGA device). Thus, the user can assign a task with a speci¯c
priority using four di®erent task priority numbers. Each of these four numbers corresponds to
173
Figure 6.7: Priority aliasing for selective component wake-up and activation state management
oneactivationstateoftheFPGAdevice. Itisuptotheusertoassignanappropriateactivation
state of the device to a task. One possible setting of the two-bit information for the design
examplediscussedinSection6.6.6isshowninFigure6.7. Theusercanchoosetosupportother
activation states of the FPGA device by making appropriate changes to the clock management
unit and the selective component wake-up management unit.
Note that when applying the \priority aliasing" technique, the processor can also change
the activation state of the device during its normal operation depending on the tasks that are
under execution. As shown in Figure 6.4, OSTCBPrio in the task control blocks are stored at
some speci¯c locations on the dual-port BRAMs and are accessible to both the auxiliary task
and interrupt management unit and the MicroBlaze processor through the dual-port memory
blocks on the FPGA device. When MicroBlaze is active, it can send the ¯rst two bits of the
taskprioritynumbersthatdeterminetheactivationstatesofthedevicetotheselectivewake-up
management unit through an FSL bus interface before it starts executing a task.
6.6.5 Analysis of management overhead
Figure 6.8 shows the typical context switch overhead of the MicroC/OS-II operating system.
Task A ¯nishes its processing and calls OSTimeDly() to stop its execution, which will resume
till a later time. Then, a context switch is performed by MicroC/OS-II and task B is selected
for execution as a result of the context switch. During the function call OSCtxSw() for the
actualcontextswitchoperations(whichincludesavingthecontextoftaskAandthenrestoring
174
Figure 6.8: Typical context switch overhead
Figure 6.9: Typical interrupt overhead
the context of task B), MicroC/OS-II uses one clock cycle to issue a request to the selective
component wake-up management unit. The request will be processed by both the selective
component wake-up management unit and the clock management unit to change the activation
state of the FPGA device to the state that is speci¯ed by task B. Note that the change of
activationstatesperformedbytheselectivecomponentwake-upmanagementunitandtheclock
management unit is in parallel with the other required context switching operations performed
by MicroC/OS-II on the MicroBlaze processor. Therefore, while the typical context switch is
between 250 to 313 clock cycles, the proposed hardware-software co-design technique actually
incurs one extra management clock cycle overhead.
Figure 6.9 illustrates the typical interrupt overhead of the MicroC/OS-II operating system.
Task A is under execution and an external interrupt occurs. MicroC/OS-II responses to the
175
incoming interrupt and executes the corresponding interrupt service routine (ISR). Task B,
whichis withahigher prioritythantask A,ismadeready forexecutionasa resultofexecuting
the ISR. Since task B is running at a di®erent activation state as that of task A, a change
in the activation states of the FPGA device is required when MicroC/OS-II is restoring the
executioncontextfortaskB.Similartothecontextswitchcasediscussed above, MicroC/OS-II
spends one extra clock cycle during the restoration of context in order to issue the request
for changing the activation state to the selective component wake-up management unit and
the clock management unit. The next 3 clock cycles of actual change of activation states are
performed in the selective component wake-up management unit and the clock management
unit in parallel with MicroC/OS-II, which is still in the process of restoring the context for
executing task B.
Tosummarize,basedontheanalysisoftheabovetwocases,wecanseethatthemanagement
overheadincurredbytheproposedhardware-softwareco-designtechniqueisonlyoneclockcycle
and is negligible in many practical application developments.
6.6.6 Illustrative application development
In this section, we show the development of an FFT computation application and using the
MicroC/OS-IIoperatingsystemrunningonaMicroBlazeprocessorenhancedwiththeproposed
hardware-software co-design technique. FFT is a widely deployed application. The scenarios
of task execution and interrupt arrivals considered in our examples are typical in the design
of many embedded systems such as radar and sensor network systems and software de¯ned
radio, where energy e±ciency is a critical performance metric. Actual measurements on a
commercial FPGA prototyping board is performed to demonstrate the e®ectiveness of our
hardware-software co-design approach.
176
6.6.6.1 Customization of the MicroBlaze soft processor
The con¯guration of the MicroBlaze soft processor is as shown in Figure 6.2. Two customized
hardware peripherals are attached to MicroBlaze through the OPB bus interface. One periph-
eralisa opb_gpiohardwareperipheralthatacceptsdatacomingfroman8-bitGPIOinterface.
The other one is a opb_uartlite hardware peripheral that communicates with an external
computer through serial UART (Universal Asynchronous Receiver-Transmitter) protocol.
Based on the con¯guration discussed above, the settings of the selective component wake-
up and activation state management unit is shown in Figure 6.7. The ¯rst two bits of a task
prioritydenotesfourdi®erentactivationstateofthedevice. \00"representstheactivationstate
of the device that only the processor and the two LMB bus controllers are active. Under the
activation state, the processor can execute tasks the instructions and data of which are stored
in the BRAMs accessible through the LMB bus controllers. Access to other peripherals is not
allowed in the \00" activation state. The experimental results shown in Table 6.1 indicate
that the \00" status reduces the power consumption of the device by 23.6% compared with
activation state \11" when all the hardware components are active. Other activation states of
the device are: \01" stands for the state that the processor, the two memory controllers, the
OPB bus controller, and the general purpose I/O hardware peripherals are active; \10" stands
for the state that the processor, the two memory controllers, the OPB bus controller, and the
hardware peripheral for communication through the serial port are active. Besides, as shown
in Figure 6.3, in states \00", \01" and \11", the FPGA device is driven by CLK0 while in state
\10", it is driven by CLKDV, the operating frequency of which is 8 times less than that of CLK0.
177
6.6.6.2 A case study
To demonstrate the e®ectiveness of our hardware-software co-design technique, we develop an
FFT computation application using the MicroC/OS-II real-time operating system running on
the MicroBlaze soft processor.
² Implementation: The FFT computation application consists of three tasks: the data-
input task, the FFT computation task, and the data-output task. The data-input task is
responsible for storing the input data coming from the 8-bit opb_gpio hardware peripheral
to the on-chip BRAM memory blocks. The data-input task is waken up as a result of the
externalinterruptfromthegeneralpurposeinput/outputports(GPIOs)whenthedatabecome
available. One task is for FFT computation while the other task is for sending out data to the
externalcomputerfordisplayingthroughthe opb_uartliteperipheral. TheFFTcomputation
task performs a 16-point complex number FFT computation as described in [59]. We consider
intdatatype. The cos()and sin()functionsarerealizedthroughtablelook-up. Wheninput
datapresents,theMicroBlazeprocessorreceivesaninterruptfromthe8-bitGPIO.MicroC/OS-
IIrunningonMicroBlazewillexecutetheinterruptserviceroutine(ISR),whichstoresthedata
coming from the 8-bit GPIO at the BRAMs and mark the FFT computation task to be ready
in the task ready list. Then, after the data input ISR completes, MicroC/OS-II will begin
processing the FFT computation task. The data output task is executed repeatedly with a
user-de¯ned interval. Through an opb_uartlite hardware peripheral which controls the RS-
232 serial port, the data output task sends out the results of the FFT computation task to a
computer where the results get displayed. Each of the 8-bit data is sent out in a ¯xed 0.05
msec interval. MicroBlaze runs an empty for loop between the transmission intervals.
To better demonstrate the e®ectiveness of our energy management techniques, we generate
the input data periodically and vary the input and output data rates in our experiments to
showtheaveragepowerreductionsachievedwhenapplyingourenergymanagementtechniques.
178
For the experiments discussed in this chapter, the MicroBlaze processor is con¯gured on a
XilinxSpartan-3xc3s400FPGA[79]. TheinputclockfrequencytotheFPGAdeviceis50MHz.
The MicroBlaze processor, the two LMB interface controllers as well as the other hardware
components shown in Figure 6.2 are operating at the same clock frequency. An on-chip digital
clock management (DCM) module is used to generate clock sources with di®erent operating
frequencies ranging from 6.25 MHz to 50 MHz for driving these hardware components. We
use EDK 6.3.02 for describing the software execution platform and for compiling the software
programs. ISE6.3.02[79]isusedforsynthesisandimplementationofthecompleteapplications.
Actual power consumption measurement of the applications is performed using a Spartan-3
prototyping board from Nu Horizons [50] and a SourceMeter 2400 from Keithley [40]. We
compare the di®erences in power consumption of the FPGA device when MicroC/OS-II is
operated with di®erent activation states. We ensure that except for the Spartan-3 FPGA chip,
alltheothercomponentsontheprototypingboard(e.g. thepowersupplyindicator,theSRAM
chip) are kept in the same operating state when the FPGA device is executing under di®erent
activation states. Under these settings, we consider that the changes in power consumption
of the FPGA prototyping board are mainly caused by the FPGA chip. Using the Keithley
SourceMeter, we ¯x the input voltage to the FPGA prototyping board at 6 Volts and measure
thechangesofinputcurrenttoit. DynamicpowerconsumptionoftheMicroC/OS-IIoperating
system is then calculated based on the changes of input current.
² Experimental results: Table 6.1 shows the power consumption of the FPGA device
when MicroC/OS-II is processing the FFT computation task and the FPGA device is assigned
to di®erent activation states. Note that the measurement results shown in Table 6.1 only
account for the di®erences in dynamic power consumption caused by these di®erent activation
states. Quiescent power, which is the power consumed by the FPGA device when there is no
switching activities on it, is ignored. This is because quiescent power is ¯xed for a speci¯c
179
Table 6.1: Dynamic power consumption of the FPGA device in di®erent activation states
State Power (W) Reduction
¤
Note
00 0.212 57.1% @ 50 MHz
01 0.464 6.1% @ 50 MHz
10 0.026 94.7% @ 6.25 MHz
11 0.494 { @ 50 MHz
¤: Power reduction is compared against that of state 11.
FPGA device and cannot be optimized using the techniques proposed in this chapter. Power
reductionsrangingfrom6.1%to94.7%and52.6%onaverageareachievedbyselectivelywaking
up the various hardware components through the clock management unit.
Figure 6.10 shows the instant power consumption of the FPGA device when MicroC/OS-II
is processing the data-input interrupt and the FFT computation task. At time interval \a",
onlytheATIMunitisactive. Alltheotherhardwarecomponentsareinactive. Attimeinterval
\b", input data is presented. The GPIO peripheral raises an interrupt to notify the ATIM unit
of the incoming of the data. Upon receiving the interrupt, the ATIM unit turns the FPGA
device into activation state 01 and wakes up the MicroBlaze processor and other hardware
peripherals. MicroC/OS-II begins processing the data input interrupt and stores the input
data at the BRAMs through the data-side LMB bus. MicroC/OS-II also changes the ready
tasklistandmarkstheFFTcomputationtaskreadyforexecution. Notethatinordertobetter
observe the changes of power consumption during time interval \b", some dummy loops are
added to the interrupt service routine to extend its execution time. At time interval \c", the
MicroBlaze processor sends commands to the selective wake-up management unit through an
FSL channel and changes the activation state of the device to state 00. Once the FPGA device
enters the desired activation state, MicroC/OS-II starts executing the FFT computation task.
Finally,attimeinterval\d",MicroC/OS-IIalready¯nishesprocessingthedata-inputinterrupt
andtheFFTcomputationtask. Bycheckingthereadytasklistandthetaskcontrolblocks,the
ATIMunitdetectsthattherearenotaskreadyforexecution. Itthenautomaticallydisablesthe
180
0 0.5 1 1.5 2 2.5
0.7
0.8
0.9
1
1.1
1.2
Power consumption (W)
Time (msec)
a
b
c
d
Figure 6.10: Instant power consumption when processing data-input interrupt and FFT com-
putation task
0 0.5 1 1.5 2 2.5
0.7
0.8
0.9
1
1.1
1.2
Power consumption (W)
Time (msec)
e
f
g
Figure 6.11: Instant power consumption when processing data-output task
transmissionofclocksignalstoallthehardwarecomponentsincludingtheMicroBlazeprocessor
and takes over the management responsibilities originally performed by MicroC/OS-II.
In Figure 6.11, we show the instant power consumption of the FPGA device when
MicroC/OS-II is processing the data-output task. At time interval \e", the ATIM unit is
activeandismanagingthestatusofthetasks. TheATIMunitdecreasesthe OSTCBDly¯eldsof
the task control blocks. All the other hardware components including MicroBlaze are inactive
and are put into \clock gated" states. At time interval \f", the ATIM unit ¯nds out that the
181
10 20 30 40 50
0.10 0.11 0.12 0.13 0.14 0.15 0.16
Operating frequency (MHz)
Energy dissipation (mJ)
Figure6.12: EnergydissipationoftheFPGAdeviceforprocessingoneinstanceofthedata-out
task with di®erent operating frequencies
OSTCBDly ¯eld for the data-out task reaches zero, which denotes that this task is ready for
execution. The ATIM then marks the corresponding bit in the ready task list that represents
this task to be 1. It also changes the FPGA device to activation state 10 according to the ¯rst
two bits of the data-output task's priority number. Under activation state 10, the MicroBlaze
soft processor is woken up to execute the data-out task and send out the results of the FFT
computation task through the opb_uartlite hardware peripheral. Time interval \g" is similar
to time interval \d" shown in Figure 6.10. As no tasks are ready for execution during this
interval, the ATIM unit disables the other hardware components including the MicroBlaze soft
processorandresumesthemanagementresponsibilityofMicroC/OS-IIinrepresentationofthe
MicroBlaze soft processor.
Besides, the arrows shown in Figure 6.10 and Figure 6.11 identify the spikes in power
consumption. These spikes are caused by the ATIM unit when it is processing the interrupts
for managing OS clock ticks. Limited by the maximum sampling rate that can be supported
182
Input/output data rate (samples per second)
Average power consumption (W)
0 100 200 300 400 500
0.12 0.16 0.2 0.24 0.28 0.32
50 60 70 80 90 100
Reduction (percentage)
Power
Reduction
Figure 6.13: Average power consumption of the FPGA device with di®erent data input/output
rates
by the Keithley SourceMeter, we are unable to observe all the spikes caused by the OS clock
tick management.
Figure 6.12 shows the power consumption of the FPGA device when MicroC/OS-II is pro-
cessing the data-out task and the MicroBlaze processor is operating with di®erent operating
frequencies. Energy reduction ranging from 18.4% to 36.3% and 29.0% on average can be
achieved by lowering the clock frequencies of the soft processor when communicating with the
low-speed hardware peripherals.
Figure 6.13 shows the energy dissipation of the FPGA device with di®erent data input
and output rates. Our energy management scheme leads to energy reduction ranging from
73.3% to 89.9% and 86.8% on average for the di®erent execution scenarios considered in our
experiments. A lower data input and output rates, which imply a longer system idle time
between task execution, lead to more energy savings when our COMA energy management
technique is applied.
183
6.7 An implementation based on TinyOS
In this section, we show the development of h-TinyOS, an energy e±cient implementation
of the popular component based operating system, TinyOS on soft processors. h-TinyOS is
written in nesC [25], an extension of the ANSI C language. The architecture of h-TinyOS is
similar to that of the popular TinyOS [33] operating system. Thus, h-TinyOS bears all the
features o®ered by component based operating systems discussed in Section 6.2.2.2. Through
the customization of the target soft processor, h-TinyOS has a much higher energy e±ciency
thantheoriginal\pure"softwarebased TinyOS operatingsystem. Toourbestknowledge, this
is the ¯rst attempt to port component based operating systems to soft processors and improve
their energy e±ciency by customizing the soft processors.
6.7.1 Hardware architecture
One major advantage for implementing component based operating system on soft processor
is that the interactions between the soft processor and the customized hardware peripherals
can be clearly described and e±ciently controlled through the wiring speci¯cations provided
by the component based architecture of the operating system. Especially, the \h" in the name
of h-TinyOS denotes that we tightly couple several customized energy management hardware
peripherals to improve the energy e±ciency of the operating system. The overall hardware ar-
chitectureofh-TinyOS isshowninFigure6.14. Thefunctionalitiesofthecustomizedhardware
peripherals for energy management are further discussed in detail in the following paragraphs.
6.7.1.1 Hardware based task and event management (TEM) unit
A customized hardware peripheral is designed to perform task and event management in h-
TinyOS.Themanagementhardwareperipheralmaintainsatasklist. Whenacomponentposts
a task, the information of the task is put into the task list. To better utilize the hardware
184
Figure 6.14: Hardware architecture of h-TinyOS
resources, each task in the task list is associated with a priority number (i.e. the pri variable
of the FFT computation task shown in Figure 6.16). The management hardware peripheral
alwaysselectsthetaskinthetasklistwiththehighestpriorityforexecution. Themanagement
hardware peripheral also accepts incoming events and invokes the corresponding hardware
peripherals to process them. One advantage of the hardware based management approach is
e±cient task scheduling. The management hardware peripheral can identify the next task for
execution within a few clock cycles while the corresponding software implementation usually
takes tens of clock cycles. Another advantage is that the soft processor can be turned o® when
themanagementhardwareperipheralisinoperation,thusreducingtheenergydissipationofthe
system. Thesoftprocessorisonlywakenuptoperformusefulcomputations. Thisisincontrast
to the state-of-the-art operating systems including TinyOS discussed in Section 6.2.1, where
the soft processor has to be active and perform the management functionalities. Therefore, the
hardware based management approach reduces the energy dissipation caused by the task and
event management.
185
6.7.1.2 \Explicit" power management unit
Thepowermanagementunitisacustomizedhardwareperipheralattachedtothesoftprocessor
for managing the activation states of the various hardware components on the FPGA device.
The power management unit provides a software interface, through which h-TinyOS can send
out commands and \explicitly" control the activation status of the FPGA device. The power
management unit is realized through the clock management unit as discussed in Section 6.6.2.
Similar to the selective component wake-up unit discussed in Section 6.6.4, each combination
of the activation states of the on-chip hardware components of interest is represented by a
unique state number. By employing the technique discussed in [71], we allow the application
designer to associate a state variable with a task. The state variable speci¯es the combination
of the activation states of the hardware components (including the soft processor) when the
task is under execution. The state variable is pushed into the list maintained by the TEM unit
discussedabovewhenataskpostedbyaspeci¯cnesC component. Beforeaselectedtaskbegins
execution, the TEM unit sends out the value of the state variable to the power management
unit, which will further turns the FPGA device into the desired activation state speci¯ed by
the state variable. Only after that, the selected task actually starts execution.
6.7.1.3 Split-phase operations for computations using hardware peripherals
We utilize the concept of split-phase operations supported by component based operating sys-
tems to reduce the power consumption of the system when performing computations in the
customized hardware peripherals with long latencies. A component can post a task, which
will issue a command requesting a split-phase operation in the customized hardware periph-
erals when selected for execution by the management hardware peripheral. The management
hardware peripheral will turn o® the soft processor and turn on only the hardware peripherals
required for computation when the split-phase operation is under execution. For a component
186
Figure 6.15: An implementation of h-TinyOS on MicroBlaze
that needs to be noti¯ed of the completion of the split-phase operation, the application de-
signer can wire it to the component that requests the split-phase operation and implement the
command function in the wiring interface connecting the two components.
6.7.2 Illustrative application development
6.7.2.1 Customization of the MicroBlaze soft processor
The customization of the MicroBlaze soft processor for running h-TinyOS is shown in Fig-
ure 6.15. Similar to Figure 6.2, to maintain the real-time property of the operating system,
the management hardware components (i.e. the power management unit, and the task and
event management unit) are attached to the MicroBlaze soft processor through the dedicated
FSL bus interfaces. The power management unit controls the clock distribution network on
the target FPGA device using the on-chip clock management resources (that is, BUFGCEs and
BUFGMUXs as discussed in Figure 6.3).
187
6.7.2.2 A case study
In this section, we develop an FFT embedded application within h-TinyOS to demonstrate the
e®ectivenessofourhardware-softwareco-designtechniqueforimprovingtheenergyperformance
of h-TinyOS.
² Implementation: We ¯rst develop an FFT module for performing FFT (Fast Fourier
Transform) computation, the implementation of which is shown in Figure 6.16. The FFT mod-
ule provides a software wrapper over a customized hardware peripheral, which is realized using
an FFT IP (Intellectual Property) core from Xilinx [79] for performing the actual computa-
tion. The FFT module provides a StdControl interface and an FFT interface for interaction
with other modules of the operating system. The StdControl interface is used to perform ini-
tialization during system start-up. The FFT interface provides one command processData().
InvokingtheprocessData() commandwillput(\post")anFFTcomputationtaskintotheTEM
unit. Theapplicationdesignassociatesapriority(i.e. pri)andanexecutionstatus(i.e. status)
with the FFT computation task. The TEM unit determines when the FFT computation task
is ready for execution based on its priority. When the FFT computation task starts execution,
it will invoke the wrapped hardware peripheral in the FFT module to process the data store
at the BRAMs. Moreover, before the FFT computation task starts execution, the TEM unit
will put the FPGA device into the power status speci¯ed by status associated with the task.
The FFT interface also provides a dataReady() event. For modules that want to be noti¯ed
thecompletionoftheFFTcomputationtask,theapplicationdesignercan\wire"thesemodules
to the FFT module through the FFT interface and specify the desired operations within the
bodyofthedataReady() event. OncetheFFTcomputationtask¯nishesandsigni¯estheTEM
unit the completion of the computation, these operations speci¯ed in the dataReady() event
will be performed.
188
Figure 6.16: Implementation of an FFT module
SincetheactualcomputationoftheFFTmoduleisperformedusingacustomizedhardware
peripheral, split-phase operations isemployedtoimprovetheenergye±ciencyoftheFFTcom-
putation. Morespeci¯cally, duringtheexecutionoftheFFTcomputationtask, theMicroBlaze
processor is turned o® by the clock management unit. Only the FFT hardware peripheral is
active and processing the input data. As shown in Table 6.2, this would reduce the energy
dissipation of the complete system by 81.0%. When the FFT computation task ¯nishes, if
there is any module which needs to notify the completion of the FFT computation task and
implementsthe dataReady() eventofthe FFT interface, theTEMunitwillautomaticallywake
up the MicroBlaze processor upon receiving the noti¯cation from the FFT module. Then, the
MicroBlaze soft processor can continue to process the requests from other components.
While h-TinyOS can be used for a wide variety of embedded applications, we develop
an FFT application based on the FFT module described above in order to demonstrate the
e®ectiveness of h-TinyOS. The top-level con¯guration of the FFT application is realized as a
ProcessC module, which is shown in Figure 6.16. It consists of ¯ve major modules: a Main
module for performing initialization on system start-up; a GPIOM module for accepting the
data from a 32-bit general purpose input/output (GPIO) port (i.e. the data input task),
an FFTM module for wrapping a hardware peripheral for FFT computation (i.e. the FFT
computation task), a UARTM module for sending out the result data to a personal computer
189
Figure 6.17: Top-level con¯guration ProcessC of the FFT computation application
Table 6.2: Various activation states for the FFT application shown in Figure 6.17 and their
dynamic power consumption
Activation states Combination of activation states Dynamic power (W) Reduction
0 MicroBlaze + GPIO32 1.158 2.5%
¤
1 MicroBlaze + UART (@ 6.25 MHz) 0.716 38.7%
¤
2 All On (for comparison) 1.188 ¡¡
3 FFT 0.202 81%
¤¤
4 MicroBlaze + FFT (for comparison) 1.062 ¡¡
¤: compared against state 2;¤¤ compared against state 4.
through a serial port (i.e. the data output task), and a ProcessM module for wiring the other
four modules together to form a complete FFT computation application.
The experimental environment for h-TinyOS is similar to that discussed in Section 6.6.6.2.
We con¯gure the complete MicroBlaze soft processor on a Xilinx Spartan-3 FPGA prototyping
board and measure the power/energy performance of h-TinyOS under di®erent execution sce-
narios (see Section 6.6.6.2 for details regarding the techniques for measuring the power/energy
dissipation of the operating system).
² Experimental results: The dynamic power consumption of h-TinyOS when executing
di®erent tasks is shown in Table 6.2. Activation states of 0 and 1 correspond to the states for
executingthedatainputandthedataoutputtasksrespectively. Withtheexplicitenergyman-
agement functionalities provided by h-TinyOS, reductions in power consumption of 2.5% and
190
38.7% are achieved for these two tasks. Especially, for activation state 2, the power manage-
ment unit shown in Figure 6.15 can set the operating frequencies of the MicroBlaze processor
and the UART hardware peripheral from 6.25 MHz to 50 MHz. The power consumption of the
FPGAdevicewhenoperatingwiththesedi®erentoperatingfrequenciesisshowninFigure6.12.
By dynamically slowing down the operating frequency of the FPGA device when executing the
dataoutputtask,powerreductionsupto33.2%canbeachieved. Whentheapplicationrequires
a¯xeddataoutputratedeterminedbythepersonalcomputer,thepowerreductioncandirectly
lead to energy reduction for the data output task.
Activation state 3 denotes the state that the FFT computation task is executed using
split-phase operations. In this state, the MicroBlaze processor is turned o® while the FFT
computation is performed in the customized hardware peripherals. Activation state 2 presents
the state that would be set by TinyOS, the \pure" software implementation, for executing
the FFT computation task. Thus, for the FFT computation task, compared with the \pure"
software implementation of TinyOS, the hardware-software co-design technique employed by
h-TinyOS reduces 83.0% of the dynamic power consumption of the FPGA device by explicitly
controlling the activation state of the device. Besides, compared with state 4, which represents
that activation state that would be set by the MicroC/OS-II implementation discussed in
Section 6.6.6.2, TinyOS reduces 81.0% of the dynamic power consumption of the FPGA device
by allowing the MicroBlaze processor to be turned o® during the computation.
Similar to Section 6.6.6.2, we vary the data input and output rates for the data input and
data output task. We then actually measure the energy dissipation of the FPGA device using
the technique discussed in Section 6.6.6.2. The measurement results are shown in Figure 6.18.
h-TinyOS achieves reduction on average power consumption ranging from 74.1% to 90.1% and
85.0% on average for these di®erent data rates. Lower data input/output rates, which imply
191
Input/output data rate (samples per second)
Average power consumption (W)
0 100 200 300 400 500
0.12 0.16 0.2 0.24 0.28 0.32
50 60 70 80 90 100
Reduction (percentage)
Power
Reduction
Figure 6.18: Average power consumption for di®erent data input/output rates for h-TinyOS
a higher percentage of idle time for the operating system during the operation, lead to more
energy reductions.
6.7.3 Analysis of management overhead
Thesoftwarekernelofh-TinyOS islessthan100bytes,ascomparedagainst»400bytesrequired
bythecorresponding\pure"softwareimplementationof TinyOS.Thereductionofthesoftware
kernel size is mainly due to the fact that we let the prioritized task and event queue to be
maintained by the hardware based task and event management unit tightly attached to the
MicroBlaze soft processor. More importantly, with such customized task and event hardware
peripheral, h-TinyOS can respond to and begin processing the arising tasks and incoming
events between 5 to 10 clock cycles. Also, the time for context switching between di®erent
tasks is within 3 clock cycles, which includes one clock cycle for notifying the completion of a
task, one clock cycle for identifying the next task with the high priority for execution, and one
possible clock cycle for changing the activation state of the FPGA device. This is compared
against 20 to 40 clock cycles as required the \pure" software version of TinyOS enhanced with
priority scheduling, and the typical hundreds of clock cycles as required by MicroC/OS-II (see
192
Section6.6.5fordetailsontheanalysisofthemanagementoverheadforMicroC/OS-II).Similar
to the management overhead analysis discussed in Section 6.6.5, there can be one extra clock
cycle for changing the activation states of the FPGA device. However, due to the parallel
processing capability o®ered by customized hardware designs, the task and event management
unitisableto¯ndoutthetaskinthetaskqueuewiththehighestprioritywithinoneclockcycle,
a considerable improvement in time compared with the for loops in the corresponding \pure"
software implementation. Therefore, considering the above factors, the overall overhead for
h-TinyOS to manage tasks and events is less than that of \TinyOS". With such a short event
responding and task switching time, h-TinyOS is very e±cient in handling the interactions
between the processor and its hardware peripherals as well as coordinating the cooperative
processing between them.
6.8 Summary
We proposed a hardware-software co-design based cooperative management technique for en-
ergye±cientimplementationofreal-timeoperatingsystemsonFPGAsinthischapter. Theim-
plementationsoftwopopularreal-timeoperatingsystemsusingtheproposedhardware-software
co-design technique based on a state-of-the-art soft processor, as well as the development of
several embedded applications are shown to demonstrate the e®ectiveness of our approach.
193
Chapter 7
Concluding remarks and future directions
7.1 Concluding remarks
Four major contributions toward energy e±cient hardware-software application synthesis using
recon¯gurable hardware have been presented in this thesis. Implementations of the techniques
proposed in the thesis based on state-of-the-art application development tools are provided.
Through the development of practical recon¯gurable systems, the thesis demonstrates that the
proposed techniques can lead to signi¯cant energy reductions for these systems.
7.2 Future work
As recon¯gurable hardware devices are more and more widely used in embedded system de-
velopment, incluing many portable devices (e.g., cell phones and personal digital assistants,
etc.), energy e±cient application development will continue to be a hot research topic for both
academia and industry. The promising future work, which can help to improve the research
work presented in this thesis, are summarized in the following.
² Co-simulation: for the high-level co-simulation framework proposed in Chapter 3, we are
working on to integrate it with hardware-in-the-co-simulation. The users can easily choose to
194
migrate any part of the systems to execute on an actual FPGA devices while co-simulated
simultaneously with the other parts of the systems. In addition to the potential simulation
speed-ups, this can also help to accelerate the design veri¯cation process.
² Energy estimation: for the energy estimation proposed in Chapter 5, it is desired that
the energy estimates can be guaranteed to over-estimate the energy dissipation of the low-level
implementations. Inaddition,providingcon¯denceintervalinformationoftheenergyestimates
would be very useful in practical system development. Another direction we are working on is
rapid energy estimation based on the synthesized netlist. This is motivated by the di®erenty of
obtaining cycle-accurate simulation models, which is required by the rapid energy estimation
technique proposed in Chapter 5.
² Energy performance optimization: for the dynamic programming based energy perfor-
mance optimization proposed in Chapter 4, various design constraints need to be considered
when the end user is tageting a speci¯c con¯gurable hardware device. Example of such design
constraints include execution time, various hardware resource usages (e.g., number of con¯g-
urablelogicblocks, embeddedmultipliers, etc.) Whenconsideringtheseconstraints, theenergy
perfromance optimization problem would become NP-Complete. No polynomial solutions can
befoundforthisextendedversoinoftheproblem. However, e±cientapproximationalgorithms
and heuristics can still be applied to identify the designs with minimum energy dissipation
while satisfying the various constraints.
² Implementation of operating systems: for the COoperative MAnagement (COMA) tech-
niqueproposedinChapter6,weareworkingtoapplythistechniquetoembeddedhardprocessor
cores. WearecurrentlyfocusingontheIBMPowerPC405hardprocessorcoreintegratedinthe
Xilinx Virtex-4 FX series FPGAs. As is discussed in Chapter 2, the embedded PowerPC has a
tight integration with the surrounding con¯gurable logic. The tight integration even allows the
end user to customize the decoding process of the PowerPC processor. It is expected that the
195
COMA technique can also signi¯cantly reduce the energy dissipation of the real-time operating
systems running on the PowerPC hard processor core.
196
Reference list
[1] Actel, Inc. http://www.actel.com/.
[2] M. Adhiwiyogo. Optimal pipelining of i/o ports of the virtex-ii multiplier. Xilinx Appli-
cation Notes, 2003.
[3] Altera, Inc. http://www.altera.com/.
[4] Altera, Inc. Cyclone: The lowest-cost fpga ever. http://www.altera.com/products/
devices/cyclone/cyc-index.jsp.
[5] R. Andraka. A survey of CORDIC algorithms for FPGAs. In Proceedings of the Interna-
tional Symposium on Field Programmable Gate Arrays (FPGA). ACM, 1998.
[6] R. Andraka. A survey of CORDIC algorithms for FPGAs. In Proceedings of International
Symposium on Field-Programmable Gate Arrays (FPGA). ACM, 1998.
[7] A.Bakshi,V.K.Prasanna,A.Ledeczi,andetal. MILAN:Amodelbasedintegratedsimu-
lationframeworkfordesignofembeddedsystems. InConferenceonLanguages, Compilers,
and Tools for Embedded Systems (LCTES). ACM, 2001.
[8] F. Balarin, M. Chiodo, D. Engeles, and et al. Hardware-software co-design of embedded
systems. the POLIS approach. Kluwer Academic Publisher, 1997.
[9] J.Becker. Con¯gurablesystems-on-chip: Challengesandperspectivesforindustryanduni-
versities. InProceedingsofEngineeringofRecon¯gurableSystemsandAlgorithms(ERSA),
2002.
[10] V. Betz, J. Rose, and A. Marquardt. Architecture and CAD for Deep-Submicron FPGAs.
Kluwer Academic Publishers, 1999.
[11] D. Brooks, V. Tiwari, and M. Martonosi. Wattch: A framework for architectural-level
poweranalysisandoptimizations. InProceedingsofInternationalSymposiumonComputer
Architecture (ISCA). IEEE, 2000.
[12] Celoxica, Inc. DK4. documentation available online http://www.celoxica.com/
products/tools/dk.asp.
[13] S. Chappell and C. Sullivan. Handel-C for co-processing and co-design of ¯eld pro-
grammable System-on-Chip. Celoxica, Inc., available online at http://www.celoxica.
com/, 2004.
[14] S.Choi, J.-W.Jang, S.Mohanty, andV.K.Prasanna. Domain-speci¯cmodelingforrapid
energy estimation of recon¯gurable architectures. Journal of Supercomputing, November
2003.
197
[15] S. Choi, R. Scrofano, V. K. Prasanna, and J.-W. Jang. Energy-e±cient signal processing
using FPGAs. In Proceedings of International Symposium on Field-Programmable Gate
Arrays (FPGA). ACM, 2003.
[16] J. Cong, Y. Fan, G. Han, A. Jagannathan, G. Reinman, and Z. Zhang. Instruction set
extensionwithshadowregistersforcon¯gurableprocessors. InProceedingsofInternational
Symposium on Field Programmable Gate Array (FPGA). ACM, 2005.
[17] J. Cong, Y. Fan, G. Han, A. Jagannathan, G. Reinman, and Z. Zhang. Instruction set
extensionwithshadowresistersforcon¯gurableprocessors. InProceedingsofInternational
Symposium on Field Programmable Gate Arrays. ACM, 2005.
[18] G. W. Cook and E. J. Delp. An investigation of scalable SIMD I/O techniques with
application to parallel JPEG compression. Journal of Parallel and Distributed Computing
(JPDC), November 1995.
[19] Crossbow Technology, Inc. Motes, smart dust sensors, wireless sensor networks. available
online at http://www.xbow.com/.
[20] C. Dick. The platform FPGA: enabling the software radio. In Proceedings of Software
De¯ned Radio Technical Conference and Product Exposition (SDR), 2002.
[21] TCL Developer Exchange. http://www.tcl.tk/.
[22] Inc. Express Logic. ThreadX user guide. available online at http://www.expresslogic.
com/.
[23] Gaisler Research, Inc. LEON3 user manual. available online at http://www.gaisler.
com/.
[24] P. Galicki. FPGAs have the multiprocessing I/O infrastructure to meet 3G base sta-
tion design goals. Xilinx Xcell Journal, available online at http://www.xilinx.com/
publications/xcellonline/xcell_45/xc_pdf/xc_2dfabric45.pdf, 2003.
[25] D. Gay, P. Levis, R. von Behren, M. Welsh, E. Brewer, and D. Culler. The nesC language:
Aholisticapproachtonetworkedembeddedsystems. In Proceedings of International Con-
ference on Programming Language Design and Implementation (PLDI). ACM, 2003.
[26] G.Govindu, L.Zhuo,S.Choi, andV.K.Prasanna. Analysisofhigh-performance°oating-
point arithmetic on FPGAs. In Proceedings of Recon¯gurable Architecture Workshop
(RAW). IEEE, 2004.
[27] S. Gupta, M. Luthra, N.D. Dutt, R.K. Gupta, and A. Nicolau. Hardware and interface
synthesis of fpga blocks using parallelizing code transformations. In Proceedings of In-
ternational Conference on Parallel and Distributed Computing and Systems (ICPADS).
IEEE, 2003.
[28] P. Haglund, O. Mencer, W. Luk, and B. Tai. PyHDL: Hardware scripting with python.
In Proceedings of International Conference on Engineering of Recon¯gurable Systems and
Algorithms (ERSA), 2003.
[29] M. Hall, P. Diniz, K. Bondalapati, H. Ziegler, P. Duncan, R.Jain, and J. Granack. DE-
FACTO: A design environment for adaptive computing technology. In Proceedings of
Recon¯gurable Architectures Workshop (RAW). IEEE, 1999.
[30] M. Hammond. Python for windows extensions. starship.python.net/crew/mhammond.
198
[31] R. Hartenstein and J. Becker. Hardware/software co-design for data-driven Xputer-based
accelerators. In Proceedings of International Conference on VLSI Design: VLSI in Multi-
media Applications, 1997.
[32] S. Haykin. Adaptive Filter Theory (3rd Edition). Prentice Hall, 1995.
[33] J. Hill, R. Szewczyk, A. Woo, S. Hollar, D. E. Culler, and K. S. J. Pister. System archi-
tecture directions for networked sensors. In Proceedings of International Conference on
Architectural Support for Programming Languages and Operating Systems. ACM, 2000.
[34] M. Horowitz, T. Indermaur, and R. Gonzalez. Low-power digital design. In Proceedings
of International Symposium on Low Power Electronics. IEEE, 1994.
[35] Impulse Accelerated Technology, Inc. Codeveloper. http://www.impulsec.com/.
[36] International Business Machines (IBM), Inc. IBM CoreConnect bus cores. avail-
able online http://www-306.ibm.com/chips/techlib/techlib.nsf/techdocs/
F175B826ECE6FDE08725711F00770F60/$file/coreconnect_Feb2406_pb.pdf.
[37] P. James-Roxby, P. Schumacher, and C. Ross. A single program multiple data parallel
processing platform for FPGAs. In Proceedings of International Symposium on Field-
Programmable Custom Computing Machines (FCCM). IEEE, 2004.
[38] Y. Jin, K. Ravindran, N. Satish, and K. Keutzer. An FPGA-based soft multiprocessor
system for IPv4 packet forwarding. In Proceedings of Workshop on Architecture Research
using FPGA Platforms (WARFP). IEEE, 2005.
[39] A. K. Jones, R. Hoare, and D. Kusic. An FPGA-based VLIW processor with custom
hardware execution. In Proceedings of International Symposium on Field Programmable
Gate Arrays (FPGA). ACM, 2005.
[40] Keithley Instruments, Inc. www.keithley.com.
[41] D. Lampret, C.-M. Chen, M. Mlinar, and et al. OpenRISC 1000 architecture manual.
available online at http://www.opencores.org/.
[42] J. J. Larbrosse. Microc/os-ii the real-time kernel (2nd edition). CMP Books, 2002.
[43] C. Lee, M. Potkonjak, and W. H. Mangione-Smith. MediaBench: a tool for evaluating
andsynthesizingmultimediaandcommunicationssystems. In Proceedings of International
Symposium on Microarchitecture (MICRO), 1997.
[44] MathWorks, Inc. http://www.mathworks.com/.
[45] S. McCloud. Algorithmic C synthesis optimizes ESL design °ows. Xilinx Xcell Journal,
2004.
[46] Mentor Graphics, Inc. http://www.mentor.com/.
[47] Mentor Graphics, Inc. Catapult C synthesis. http://www.mentor.com/products/
c-based_design/.
[48] J. Mitola. The software radio architecture. IEEE Communications Magazine, 1995.
[49] Nallatech, Inc. http://www.nallatech.com/.
[50] Nu Horizons Electronics Corp., Inc. http://www.nuhorizons.com/.
199
[51] Open SystemC Initiative. available online at http://www.systemc.org/.
[52] J. Ou, S. Choi, and V. K. Prasanna. Performance modeling of recon¯gurable soc architec-
turesandenergy-e±cientmappingofaclassofapplications. InProceedingsofInternational
Symposium on Field Customizable Computing Machines (FCCM). IEEE, 2003.
[53] J. Ou and V. K. Prasanna. Parameterized and energy e±cient adaptive beamforming
using system generator. In Proceedings of International Conference on Acoustics, Speech,
and Signal Processing (ICASSP), 2004.
[54] J. Ou and V. K. Prasanna. PyGen: A MATLAB/Simulink based tool for synthesizing
parameterized and energy e±cient designs using fpgas. In Proceedings of International
Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 2004.
[55] J. Ou and V. K. Prasanna. Rapid energy estimation of computations on FPGA based soft
processors. In Proceedings of International System-on-a-Chip Conference (SoCC). IEEE,
2004.
[56] K. V. Palem, S. Talla, and W.-F. Wong. Compiler optimizations for adaptive EPIC pro-
cessors. In Proceedings of Workshop on Embedded Software, 2001.
[57] T. P. Plaks. Engineering of recon¯gurable hardware/software objects. Journal of Super-
computing, 2001.
[58] K. K. W. Poon, S. J. E. Wilton, and A. Yan. A detailed power model for ¯eld-
programmable gate arrays.
[59] W. Press, B. Flannery, S. Teukolsky, and W. Vetterling. Numerical recipes in C: The art
of scienti¯c computing (second edition). Cambridge University Press, 2002.
[60] Python. http://www.python.org/.
[61] A. Raghunathan, N. K. Jha, and S. Dey. High-level Power Analysis and Optimization.
Kluwer Academic Publishers, 1998.
[62] D. Rakhmatov and S. Vrudhula. Hardware-software bipartitioning for dynamically re-
con¯gurable systems. In Proceedings of International Conference on Hardware Software
Codesign (CODES), 2002.
[63] J. Razavilar, F. Rashid-Farrokhi, and K. J. Ray Liu. Software radio architecture with
smart antennas: A tutorial on algorithms and complexity. Journal on Selected Area in
Communication (JSAC), April 1999.
[64] S.Choi,R.Scrofano,V.K.Prasanna,Ju-W.Jang. Energy-e±cientsignalprocessingusing
fpgas. In Proceedings of International Symposium on Field Programmable Gate Arrays
(FPGA). ACM, 2003.
[65] R. Scrofano, G. Govindu, and V. K. Pasanna. A library of parameterizable °oating-point
coresforfpgasandtheirapplicationtoscienti¯ccomputing. In Proceedings of Engineering
of Recon¯gurable Systems and Algorithms (ERSA), 2005.
[66] L. Shannon and P. Chow. Simplifying the integration of processing elements in computing
systems using a programmable controller. In Proceedings of International Symposium on
Field-programmable Custom Computing Machines (FCCM). IEEE, 2005.
200
[67] C.Shi, J.Hwang, S.McMillan, A.Root, andVinaySingh. Asystemlevelresourceestima-
tion tool for FPGAs. In Proceedings of International Conference on Field Programmable
Logic and its applications (FPL), 2004.
[68] SimpleScalar, LLC. Simplescalar tutorial (for version 4). available online at http://www.
simplescalar.com/docs/simple_tutorial_v4.pdf.
[69] A.SinhaandA.Chandrakasan. JouleTrack: Awebbasedtoolforsoftwareenergypro¯ling.
In Proceedings of Design Automation Conference (DAC), 2001.
[70] C. Souza. IP columns support application-speci¯c FPGAs. In EE Times, 2003.
[71] V. Subramonian, H.-M. Huang, and S. Datar. Priority scheduling in TinyOS : A case
study. In Technical Report, Washington University, 2003.
[72] F. Sun, S. Ravi, A. Raghunathan, and N. K. Jha. A scalable application-speci¯c processor
synthesis methodology. In Proceedings of International Conference on Computer Aided
Design, 2003.
[73] Synopsys, Inc. Smart simulation model. 2005.
[74] Triscend, Inc. documentation available online at http://www.xilinx.com/prs_rls/xil_
corp/0435_triscend_acquisition.htm.
[75] T. Tuan and B. Lai. Leakage power analysis of a 90nm fpga. In Proceedings of Custom
Integrated Circuits Conference (CICC). IEEE, 2003.
[76] University of Southern California. Recon¯gurable hardware in orbit (RHinO). available
online at http://rhino.east.isi.edu/.
[77] J. Villarreal, D. Suresh, G. Stitt, F. Vahid, and W. Najjar. Kluwer Journal on Design
Automation of Embedded Systems, 2002.
[78] Y. Xie and W. Wolf. Allocation and scheduling of conditional task graph in hard-
ware/software co-synthesis. 2001.
[79] Xilinx, Inc. http://www.xilinx.com/.
[80] Xilinx, Inc. AccelDSP synthesis tool. documentation available online at http://www.
origin.xilinx.com/ise/dsp_design_prod/acceldsp/index.htm.
[81] Xilinx, Inc. Easypath series. http://www.xilinx.com/products/silicon_solutions/
fpgas/easypath/index.htm.
[82] Xilinx, Inc. Software manuals. http://www.xilinx.com/support/software_manuals.
htm.
[83] Xilinx, Inc. Web power analysis tools. available online at http://www.xilinx.com/
power/.
[84] Xilinx, Inc. Xilinx spreadsheet power tools. available online at http://www.xilinx.com/
ise/power_tools/spreadsheet_pt.htm.
[85] Xilinx, Inc. Xilinx virtex-ii pro and virtex-ii pro x fpga user guide. available online at
http://www.xilinx.com/.
[86] Xilinx, Inc. Two °ows for partial recon¯guration: Module based or di®erence based
(xapp290). Xilinx Application Notes, 2003.
201
[87] Xilinx, Inc. Xpower. documentation available online at http://www.xilinx.com/xlnx/
xil_prodcat_product.jsp?title=xpower, 2003.
[88] W.Ye,N.V.Krishnan,M.Kandemir,andM.J.Irwin. ThedesignanduseofSimplePower:
A cycle-accurate energy estimation tool. In Proceedings of Design Automation Conference
(DAC), 2000.
[89] Z. A. Ye, A. Moshovos, S. Hauck, and P. Banerjee. CHIMAERA: A high-performance
architecture with a tightly-coupled recon¯gurable functional unit. In Proceedings of Inter-
national Symposium on Computer Architecture (ISCA), 2000.
[90] L. Zhuo and V. K. Prasanna. Scalable and modular algorithms for °oating-point matrix
multiplicationonFPGA. In Proceedings of International Parallel and Distributed Process-
ing Symposium (IPDPS). IEEE, 2004.
202
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Application-specific external memory interfacing for FPGA-based reconfigurable architecture
PDF
An efficient design space exploration for balance between computation and memory
PDF
Hierarchical design space exploration for efficient application design using heterogeneous embedded system
PDF
Efficient acoustic noise suppression for audio signals
PDF
Energy and time efficient designs for digital signal processing kernels on FPGAs
PDF
Improving memory hierarchy performance using data reorganization
PDF
Architectural support for efficient utilization of interconnection network resources
PDF
Design and analysis of server scheduling for video -on -demand systems
PDF
Dynamic voltage and frequency scaling for energy-efficient system design
PDF
Dynamic logic synthesis for reconfigurable hardware
PDF
Cost -sensitive cache replacement algorithms
PDF
Energy -efficient information processing and routing in wireless sensor networks: Cross -layer optimization and tradeoffs
PDF
Compression, correlation and detection for energy efficient wireless sensor networks
PDF
Contributions to efficient vector quantization and frequency assignment design and implementation
PDF
Adaptive video transmission over wireless fading channel
PDF
Energy -efficient strategies for deployment and resource allocation in wireless sensor networks
PDF
A CMOS frequency channelized receiver for serial-links
PDF
Consolidated logic and layout synthesis for interconnect -centric VLSI design
PDF
Error resilient techniques for robust video transmission
PDF
Architecture -independent programming and software synthesis for networked sensor systems
Asset Metadata
Creator
Ou, Jingzhao
(author)
Core Title
Energy efficient hardware-software co-synthesis using reconfigurable hardware
School
Graduate School
Degree
Doctor of Philosophy
Degree Program
Computer Engineering
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
engineering, electronics and electrical,OAI-PMH Harvest
Language
English
Contributor
Digitized by ProQuest
(provenance)
Advisor
Prasanna, Viktor (
committee chair
), Diniz, Pedro C. (
committee member
), Narayanan, Shrikanth (
committee member
), Pinkston, Timothy M. (
committee member
), Zimmermann, Roger (
committee member
)
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c16-443579
Unique identifier
UC11336695
Identifier
3237157.pdf (filename),usctheses-c16-443579 (legacy record id)
Legacy Identifier
3237157.pdf
Dmrecord
443579
Document Type
Dissertation
Rights
Ou, Jingzhao
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the au...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus, Los Angeles, California 90089, USA
Tags
engineering, electronics and electrical