Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
A technical survey of embedded processors
(USC Thesis Other)
A technical survey of embedded processors
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
A TECHNICAL SURVEY OF EMBEDDED PROCESSORS
Copyright 2002
by
Ashwin Sethuram
A Dissertation Presented to the
SCHOOL OF ENGINEERING
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
MASTER OF SCIENCE
(ELECTRICAL ENGINEERING)
December 2002
Ashwin Sethuram
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
UMI Number: 1414853
UMI
UMI Microform 1414853
Copyright 2003 by ProQuest Information and Learning Company.
All rights reserved. This microform edition is protected against
unauthorized copying under Title 17, United States Code.
ProQuest Information and Learning Company
300 North Zeeb Road
P.O. Box 1346
Ann Arbor, Ml 48106-1346
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
This thesisr written by
f i S H W I f J S t T H v M M
y
under the guidance o f his/her Faculty Committee and
approved by all its members, has been presented to and
accepted by the School o f Engineering in partial
fulfillm ent o f the requirements fo r the degree of
A lflS fgfi of s c i e n c e
y / o ^sx/'_ ^ ....
_____
Date:
December 18T 2002
Faculty Committee
C hairm an
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
1 1
ACKNOWLEDGMENTS
I would like to thank Dr. Viktor Prasanna, my advisor at the University of
Southern California, for his guidance, encouragement, support, and vision
throughout my Master’s program. In addition to academic guidance, he has been a
motivating force in the development of my professional skills and perspective of
research. In the long road to my dissertation, he provided me with excellent
direction and moral support when I faltered in my approach. I understood the
process of abstracting the underlying principles and applying problem-solving
techniques to multiple domains wholly due to his guidance. I really appreciate the
flexibility and understanding Dr. Prasanna has shown towards helping me balance
my professional and personal responsibilities. His contribution towards my
professional and career growth goes far beyond technical and academic advisement.
I have been extremely fortunate to have him as my advisor and to have known him
as a person over the last few years.
I also thank the members of my thesis committee, Dr. Monte Ung and
Dr. Cauligi S. Raghavendra for their valuable suggestions.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
iii
CONTENTS
ACKNOWLEDGMENTS......................................................................................... ii
LIST OF TABLES..................................................................................................... vi
LIST OF FIGURES................................................................................................... vii
LIST OF ABBREVIATIONS AND ACRONYMS............................................... viii
ABSTRACT............................................................................................................... xiv
Chapter
1. AN INTRODUCTION TO EMBEDDED SYSTEMS.......................... 1
History, Importance, and B asics.......................................................... 2
A Brief Insight into Embedded Systems Computing......................... 3
Components and Market Potential....................................................... 4
Embedded System Design Methodologies......................................... 7
Desktop versus Embedded Processors................................................. 13
Target Applications............................................................................... 16
Performance Benchmarks and Design Flow for
Embedded Processors.................................................................... 17
2. OVERVIEW OF SOC ARCHITECTURES........................................... 23
Introduction............................................................................................ 23
Reconfigurable Logic for SOC Architectures..................................... 24
3. SPACE AND CLASSIFICATION OF EMBEDDED
PROCESSORS........................................................................................... 26
Selection Criteria................................................................................... 26
Architectural Details Available............................................................ 26
Examples of Popular Architectures...................................................... 26
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
iv
Chapter Page
4. CONFIGURABLE PLATFORMS: THE ASIC REVOLUTION 46
Platform-Based Designs........................................................................... 47
Platform-Based Designs........................................................................... 47
Benefits of Platform-Based D esigns....................................................... 48
SOC Alternative........................................................................................ 49
Platform FPGA (Virtex™ II and CoolRunner™).................................. 51
5. ENERGY-EFFICIENT RECONFIGURABLE
ARCHITECTURES....................................................................................... 54
Introduction................................................................................................ 54
Energy Management in Dynamically Reconfigurable
Mobile Systems................................................................................ 56
Power Estimation Tools and F low .......................................................... 59
Power Estimation Flow............................................................................. 60
Architectural Adaptation for Pow er........................................................ 62
Energy-Efficient Reconfigurable Architectures for
DSP Applications.............................................................................. 63
Low-Energy FPGA Design...................................................................... 67
Role of Compilers in Energy Crisis of Embedded/
Reconfigurable Processors............................................................... 68
Power Savings in Embedded Processors through
Decode Filter Cache.......................................................................... 70
6. NETWORK PROCESSORS......................................................................... 72
Introduction................................................................................................ 72
Market Trends and Survey........................................................................ 73
Market Challenges..................................................................................... 75
Rise of Network Processors..................................................................... 76
Parallel Network Processors.................................................................... 77
Network Processors: Building Block for Enhanced
Network Performance...................................................................... 77
Functions.................................................................................................... 78
Network Processor Designs..................................................................... 79
Future of Network Processors.................................................................. 81
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
V
Chapter Page
7. IP ROUTING FOR THE NEXT GENERATION
NETWORK PROCESSORS..................................................................... 82
Overview of IP Routing Architectures................................................ 83
Requirements of Network Processor Architecture............................ 85
Optimizations for Network Processor Cache...................................... 88
Network Processors: Challenges/Achievements................................ 89
8. COMMERCIAL EMBEDDED PROCESSORS: AN
INDUSTRY PERSPECTIVE.................................................................... 90
Intel Corp.’s Strong ARM* SA-110 Processor................................... 90
International Business Machines Corp.’s
PowerPC NP NPe 405™ Processor............................................. 91
Transmeta Corp.’s Crusoe™ Reconfigurable Processor................... 91
9. CONCLUSIONS AND FUTURE OF EMBEDDED/
RECONFIGURABLE PROCESSORS.................................................. 98
BIBLIOGRAPHY...................................................................................................... 100
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
vi
LIST OF TABLES
Table Page
1. Analysis of Blackfm™ Architecture Characteristics.............................. 30
2. Key SOC Characteristics........................................................................... 50
3. Platform FPGA Performance Metrics...................................................... 52
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
vii
LIST OF FIGURES
Figure Page
1. Typical Embedded/Reconfigurable Processor......................................... 5
2. Embedded Systems Market Segment....................................................... 6
3. A Programmable Logic Function.............................................................. 10
4. Desktop Processor versus Embedded Processor..................................... 15
5. Design Flow................................................................................................ 18
6. Architecture Core and Peripheral blocks.................................................. 28
7. Memory Hierarchy Architecture............................................................... 29
8. Texas Instruments, Inc.’s OMAP™ C ore............................................... 33
9. Configuration B us...................................................................................... 36
10. Generalized Mesh Interconnect Structure............................................... 38
11. UMS Architecture...................................................................................... 42
12. Platform M apping...................................................................................... 48
13. Platform FPGA Architecture.................................................................... 51
14. FPFA Architecture..................................................................................... 65
15. Simple Network Processor Architecture.................................................. 81
16. IP Routing Architecture Overview.......................................................... 84
17. Network Processor Architecture for IP Routing..................................... 87
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
LIST OF ABBREVIATIONS AND ACRONYMS
ACM Adaptable computing machine.
ACS Add-compare-select.
ALU Arithmetic and logic control.
AMPS Advanced mobile phone service.
ASIC Application specific integrated circuits.
ASSP Application specific standard product.
ATM Asynchronous transfer mode.
BIOS Basic input output system.
CDMA Code division multiple access.
CISC Complex instruction set computer.
CLB Carry look-ahead buffer.
CLK Clock.
CLIKIN Clock in.
CPU Central processing unit.
D-Cache Data cache.
DCT Discrete cosine transform.
Demux Demultiplexer.
DFC Decode filter cache.
DFT Design for testing.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
DMA Direct memory access.
DRAM Dynamic random access memory.
DRC Design rule checker.
DSL Digital subscriber line.
DSP Digital signal processing.
DVD Digital video display; digital versatile disk.
EDA Electronic design automation.
FIR Finite impulse response.
FFT Fast Fourier transform.
FPFA Field programmable functional arrays.
FPGA Field programmable gate arrays.
FSM Finite state machine.
FT Fourier transform.
GB Gigabytes.
Gbps Gigabits per second.
GFlop Gigaflop.
GHz GigaHertz.
GPA Global positioning arrays.
GPR Global positioning reference.
GPS Global positioning system.
GSM Global system for mobile computing.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
HDL Hardware description language.
HPC Handheld personal computer.
IBM International Business Machines Corporation.
IC Integrated circuit.
I-Cache Instruction cache.
IDCT Inverse discrete cosine transform.
IDE Integrated Development Enterprise®.
IFC Instruction fetch cache.
I/O Input/output.
IP Internet protocol, intelligence property.
ISDN Integrated Services Digital Network.
ISO International Organization for Standardization.
KB Kilobyte.
Kwords Kilowords.
Kbps Kilo bits per second.
LI, L2 Level 1, level 2.
LAN Local area network.
LED Light emitting diode.
LCD Liquid crystal display.
MAC Multiply-accumulate.
MB Megabyte.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Mbps Megabits per second.
Mem Memory.
MHz MegaHertz.
MIPS Million instructions per second.
MMU Memory management unit.
MP3 Motion Picture Standard-3.
MPEG Motion Picture Expert Group.
MSP Multistream processor.
MSPS Mega samples per second.
MUX Multiplexer.
MW Megawatt/megawatts.
mW Milliwatt/milliwatts.
NP Network processor.
NPU Network processor unit.
OC3 Optical carrier level 3.
OC48 Optical carrier level 48.
OEM Original equipment manufacturer.
OMAP Open multimedia applications platform.
OS Operating System™.
PC Personal computer.
PCI Peripheral component interconnect.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
PDA Personal digital assistant; power-delay area.
PFPGA Programmable field programmable gate array.
PGA Programmable gate array.
PLA Programmable logic array.
PLD Programmable logic device.
PLL Phase locked loop.
PSA Programmable system architecture.
QOS Quality of service.
RAM Random access memory.
RC Resistor capacitor.
RIS Random instruction sequence.
RISC Reduced instruction set computer.
ROM Read-only memory.
RTL Register transfer logic.
RTOS Real-time operating system.
SDRAM Synchronous dynamic random access memory.
Sec Second (time).
SIMD Single program multiple data.
SOC System-on-chip.
SRAM Synchronous random access memory.
TI Texas Instruments, Inc.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
TDMA Time-division multiple access.
UMS Universal Microsystems
VLIW Very long instruction word.
VLSI Very large scale integration.
VoIP Voice over IP.
VHDL Very High-Speed Integrated Circuits Hardware Description
Language.
WAN Wide area network.
2D, 3D 2-dimensional, 3-dimensional.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
xiv
ABSTRACT
Embedded processing and reconfigurable computing is a new paradigm based
on dynamically adapting the hardware to reconfigure the computation and com
munication structures on the chip. Embedded circuits and systems have evolved
from application specific accelerators to a general purpose-computing paradigm.
Researchers and the industry have developed various embedded devices. These
devices promise a high degree of flexibility and superior performance. But, the
architectural techniques and design models are also heavily based on the hardware
paradigm from which they have evolved.
This thesis addresses the fundamental concepts used in performing embedded
computing. It also addresses the challenges in achieving high performance and
energy efficiency using embedded/reconfigurable architectures. A formal classifi
cation methodology and the performance characteristics for these target architec
tures, based on the dynamic knobs, are discussed in this thesis.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
1
CHAPTER 1
INTRODUCTION
The term embedded processor covers a wide range of processors and their
applications. The term embedded implies that the user is not aware of the presence
of the processor. It is a processor that is embedded into a general system without
being really visible. That is, the user does not directly interact with, or program, the
processor, but, rather, the processor provides a convenient method of implementing
the control function. Microwave controllers, electronic games, and computer
peripherals are typical examples of systems using embedded processors. With the
falling cost and the increasing power of microprocessor systems, the range and
application capabilities of embedded processors are expanding. With advancements
in process technology, embedded control has expanded considerably. As many
applications needed and could afford more processing power, higher bit CPUs came
into use for embedded applications.
With the growing size and complexity of applications, the need for high-
level language support has increased. However, for speed-critical portions of an
application, the need for strong assembly language support for embedded control
systems still exists. Thus, a processor that can be programmed easily at both levels
is beneficial.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
2
An embedded computing system uses microprocessors to implement parts of
the functionality of non-general-purpose computers. Early microprocessor designs
emphasized more I/O, but modem embedded processors are capable of a great deal
of computation in addition to the I/O tasks. Microprocessors that were once prized
centerpieces of desktop computers are now being used in automobiles, televisions,
and telephones. This huge increase in computational power can only be harnessed
by applying structured design methodologies to the design of embedded computing
systems. To understand, and take advantage, of this capability a thorough knowl
edge is needed of the fundamental concepts in the analysis and design of embedded
systems.
As with VLSI system design, the embedded system designers must have
complete knowledge of the complete design process so that they can make global
rather than local decisions. The biggest difference between VLSI design and
embedded system design is the level of complexity of components.
History. Importance, and Basics
Embedded microprocessors grew out of need in the late 1960s for more
advanced calculators. As the varieties for applications grew, the need for program
mability rather than fixed operations was realized. Also in the late 1970s and early
1980s, designers started studying the processing needs of more advanced applica
tions (such as DSP, graphics, etc). The comprehensive nature of these applications
led designers to the conclusion that even applications-specific processors should be
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
generally programmable. So a general-purpose instruction set on an application-
specific device was employed to make the device compatible to new applications.
A Brief Insight into Embedded Systems Computing
Microprocessors are everywhere; an average person comes across a number
of microprocessors everyday, and this number grows rapidly with time. Though
consumers have been using microprocessor-enhanced equipment for the last
25 years, the role of embedded microprocessors has fundamentally changed as
advancements in VLSI have allowed us to produce high-performance microprocess
ors cheaply. Early microprocessors were used to control analog subsystems, but
today most of the systems work is done in the digital domain.
Early microprocessors placed more emphasis on I/O, and the external analog
devices did most of the work. The job of the microprocessors was to sequence and
control the operation of these external devices to provide a higher level of overall
functionality.
The earlier microprocessors worked on small words, had limited address
space, and ran at relatively low clock rates. The limitation of the architecture made it
difficult to compile efficient code, so most programs were written in assembly
language. All these limitations conspired to require considerable handcrafting of
microprocessor-based systems. But now, however, the scene has changed. There
are probably more embedded processor architectures today than ever, but this only
helps us to see the commonality in approaches. Higher levels of integration provide
larger word widths, larger address spaces, and larger memories. Higher clock rates
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
4
facilitate running complex programs to be run at a faster speed. As VLSI implemen
tations and compiler technology have improved, high-level language programming
has become popular. All these changes have transformed embedded computing into
a systems discipline in which components can be characterized and assembled to
meet design goals.
High-performance RISC CPUs and signal processors (DSP processors) are
now commonly used in embedded systems. An application’s computation is split
across multiple CPUs even if a single CPU can perform it. This is for the one of the
following two reasons. The application may have physically distributed I/Os, in
which case either the expense of wiring or the delay of transmitting values back and
forth demands local processing of data, or because it may be cheaper to use multiple
CPUs, each of which run one or a few non-conflicting jobs.
Components and Market Potential
Components of Embedded Processors
Embedded processors are made up of the following three main components:
1. Microprocessor
2. RAM
3. Non-volatile storage
Figure 1 is an illustration of a typical embedded/reconfigurable processor.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
5
Configur
able Logic
Unit
Configura
tion Cache
Memory CPU
Interconnection
Network
On-Chip
Memory
BUS
Configur
able Logic
Unit
Configura
tion Cache
Figure 1. Typical embedded/reconfigurable processor [47]
Trends and Market Potential of
Embedded Processors
The embedded systems market [43], [48], which includes applications for
embedded microprocessors, includes the high-volume consumer applications that
drive the electronics industry. Analysis of the embedded systems market [43], which
is the market for all but 2 percent of microprocessors, will tell us where the elec
tronics industry is headed. The embedded systems market consists of four over
lapping segments, defined by their design requirements. Figure 2 illustrates the
dominant-characteristic taxonomy of the embedded systems market.
The zero-cost segm ent [48], which to a first approximation represents alm ost
all of the embedded systems market [43], is the segment for which low cost is the
overriding consideration. Most microprocessors go into consumer appliances
(microwave ovens, electric razors, blenders, toasters, and washing machines) that
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
6
Leading
Edge/Wedge
Zero
Power
Zero Volume
Zero
Delay
Zero Cost
Figure 2. Embedded systems market segment [43]
generally have minimal processing needs. These are commodity markets; that means
they sell in high volumes (millions o f units to tens of millions of units). These
markets are characterized by intense price competition, so substantial effort goes into
reducing production cost. The ideal would be zero cost to implement.
The zero-power segment [43], which to a first approximation represents a
few percentages of the embedded systems market, is the segment for which zero-
power [43] dissipation represents the ideal. These applications are consumer items,
such as smoke detectors, cellular phones, pagers, pacemakers, hearing aids, MP3
players, and pocket calculators. Consumers want them to run forever on a single
button-size battery or on weak ambient light. As with all consumer applications,
minimum product cost remains a concern.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
7
The zero-delay segment [43], which to a first approximation represents a
little more than zero percent of the embedded systems market, is the segment for
which zero delay from data-in to result-out represents the ideal. These applications
are also consumer items, such as high-end printers, scanners, copiers, and fax
machines, for which processing power and throughput are important—at minimum
product cost, of course.
The zero-volume segment, which to more than a first approximation, repre
sents zero percent of the embedded systems market, is the segment for which the
application potential is nearly zero. If the application volume is going to be very
close to zero, then production volumes and profits will also be close to zero. There
must be some other reason to attempt to capture the application. The embedded
systems market has four segments: zero-cost, zero-power, zero-delay, and zero-
volume [43], Most applications fall within the zero-cost segment, which is by far the
largest segment. Virtually all consumer applications fall within the zero-cost
segment. Because consumer markets are competitive, cost is always a concern for
these products. The zero-power and zero-delay segments overlap substantially with
the zero-cost segment. The zero-volume segment overlaps with the zero-delay
segment but is completely disjoint from the zero-cost segment.
Embedded System Design Methodologies
The embedded systems market has four segments: zero-cost, zero-power,
zero-delay, and zero-volume. Most applications fall within the zero-cost segment,
which is by far the largest segment. Virtually all consumer applications fall within
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
8
the zero-cost segment. Because consumer markets are competitive, cost is always a
concern for these products. The zero-power [43] and zero-delay [43] segments
overlap substantially with the zero-cost [43] segment. The zero-volume segment
overlaps with the zero-delay segment but is completely disjoint from the zero-cost
segment across a range of problems. Before the introduction of the computer, the
engineer was responsible for selecting the hardware resources and for mapping the
algorithm into the hardware to solve the problem. After the introduction of the
computer, the engineer programmed algorithms onto the computer’s fixed resources
to solve problems. Problem solving became programming [43], Reverse engineer
ing a microprocessor-based circuit might not tell you much about what the circuit
does [43], (You would have to analyze the programs together with the hardware and
not just the hardware.)
Dynamic Logic
The taxonomy of embedded applications is important because it says where
the electronics industry is headed. Continued improvement in semiconductor
fabrication, continued proliferation of cellular telephones, and the growing popu
larity of handheld devices (digital cameras, GPS receivers, PDAs, etc.) drive more
computing into portable devices. Because they are consumer devices, they fall into
the zero-cost segment. Because they have high computing requirements, they fall
into the zero-delay segment [43], Because they are portable devices, they fall into
the zero-power segment. We all want cheap, highly capable devices that give us
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
9
instant answers and that work on weak ambient light. The overlap of the zero-cost,
zero-delay, and zero-power segments is the leading-edge wedge [43].
As figure 2 shows, the leading-edge wedge is a tiny percentage [43] of the
embedded systems market, but it is growing rapidly and it can have better margins
than most embedded applications. The leading-edge wedge is growing rapidly as the
popularity of mobile consumer devices rises. Applications in the leading-edge
wedge will drive the direction of the electronics industry, and they will take the
designers with them. The leading-edge wedge [43] is the future for mobile
embedded-systems applications. Tomorrow’s engineers will have to design appli
cations to meet the requirements of this wedge [43],
An embedded-system application can be implemented more directly using
programmable logic than using a microprocessor or DSP. The algorithm repre
senting the application can be compiled directly into gates implementing the design.
If the application must implement an assortment of algorithms, as in the multi
protocol cellular phone for example, the implementation of a particular protocol can
be “paged” into the programmable logic on demand. Paging functions into program
mable logic can be efficient. Idle functions are not paged into the programmable
logic and do not use power [43],
Figure 3 [43] shows how much more directly programmable logic imple
ments an application. The compiler translates the algorithm into resources that
execute the application directly. Processing flexibility is one advantage that
programmable logic implementations have over microprocessor- or DSP-based
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
10
Result Data
Application
Processor
Compiler
Algorithm
Figure 3. A programmable logic function [43]
implementations. Programmable logic resources can be configured to implement
parallel functions or a function of any (reasonable) number of bits. Unlike the
microprocessor or DSP, which implement fixed resources and rely on dynamic
algorithms, dynamic logic implements dynamic algorithms and dynamic resources.
Adaptive Computing Machine
An adaptive computing machine is basically a next-generation technology
using dynamic logic. Logic for each of the device protocols and functions can be
“mapped” into the chip’s programmable logic, eliminating the need for a digital
signal processor, ASICs, and possibly even the usual microprocessor. Functions that
are not “mapped” into the chip’s gates do not use power. Efficiency improves
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
11
because the implementation is more direct for each function than it is in a DSP-based
implementation. The DSP-based implementation runs a variety of functions on a
fixed set of resources, giving up efficiency for the sake of simplifying the program
ming and the hardware resources. The dynamic logic solution gives up efficiency in
“mapping” functions into the programmable logic.
The hope is that is that mapping the logic into the chip will cost less power
than having logic that is always resident but mostly idle. This will circumvent the
inefficiencies of the commercial prototyping market by designing its own devices to
suit just the anticipated range of applications. The ACM [43] will still be a PLD, but
it will be designed for rapid partial reconfiguration to accommodate “mapping” of its
functions. The ACM [43] will also allow background reconfiguration while the
device is operating. It may even cache high-use logic functions for more efficient
mapping (though this would cost valuable power). In addition, since it is not being
designed for general-purpose prototyping, it can be designed with much less over
head than a commodity PLD, which typically has about 20 transistors of overhead
for each transistor in the implemented function. In addition, the peripheral circuitry
can be designed to suit a single system rather than being the general-purpose,
universally configurable I/O pad ring required by PLDs for prototyping applications.
One application is a mobile device that could be a multimode, multiprotocol
cellular phone. This phone might allow roaming among protocols, such as AMPS,
CDMA, TDMA, GSM, and even among frequency bands. Since it is “adaptable,” it
could be updated to the third-generation standards even as they evolve. In addition
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
12
to the cellular phone functions, the device could also accommodate a variety of other
functions, such as calendar, calculator, e-mail, GPS, and MP3. It might ship with a
standard set of functions that could be “paged” from ROM and a bank of flash
memory to accommodate changes and installation of new functions. These changes
and new functions might be loaded over the air interface, keeping the device func
tional and current in the field much longer than devices based on ASICs and ROM
programs.
In leading-edge wedge applications, the embedded system must meet the
conflicting demands of compute-intensive algorithms and of long battery life. The
processing requirements of these applications can be demanding across a range of
subtasks. The cellular telephone, for example, must do call setup, call teardown,
encoding, decoding, and a variety of protocol processing subtasks. These appli
cations typically require several application-specific integrated circuits (ASICs), a
digital signal processor (DSP), and a microprocessor. The world has not converged
on a single cellular standard and is not likely to any time soon, so there is demand for
multiprotocol cellular phones. A separate ASIC typically supports each protocol.
The increasing popularity of e-mail and the demand for wireless connection to the
Internet are driving demand for these functions into the cellular phone as well. Each
of these added functions adds processing complexity and makes extending battery
life more difficult. The programmable logic device (PLD) may offer a way out of
this difficult situation. The ACM [43] uses dynamic logic; that is, ACMs implement
dynamic algorithms and dynamic resources.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
13
Resources on the chip can be allocated to the limit o f availability for parallel
computation, since the resources are not dedicated to particular functions as they
would be in a microprocessor, DSP, or an application-specific integrated circuit
(ASIC). A large fraction of the fixed resources in a microprocessor or DSP may be
idle at any particular time. DSPs generally work on data in multiples of a byte.
Dynamic logic implementations can work on any data width (the width can even
vary with time to suit the needs of the problem).
As semiconductor process improves, DSPs and microprocessors are built
with ever-more fixed resources and at ever-higher clock speeds, so they are capable
of tackling ever more complicated functions. But, while adding resources and
increasing the clock rate improve computational capability, they do not improve the
computational efficiency of these processors. Each of the functions “paged” into a
dynamic logic implementation makes efficient use of the resources it needs and is
then overwritten as the next function is “paged” into the chip [43], Computational
efficiency is high, so power dissipation is low. By using a dynamic logic imple
mentation we can lower power dissipation to be one half to one tenth of a com
parable DSP-based implementation while at the same time improving performance
by a factor of 10 to 100.
Desktop versus Embedded Processors
We tend to distinguish between microcontrollers and microprocessors though
there is no distinct difference [49], Similarly, we associate the microcontroller with
the embedded domain and the microprocessor with the desktop arena [49], A more
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
14
refined approach is to define microcontrollers as having RAM and ROM instead of
cache, and a lot of peripherals. In contrast, we define the microprocessor as having a
memory management unit and lots of cache. We also classify them based on their
performance. Until some time ago, it was sufficient to partition the microprocessor
world into desktop and embedded domains. This traditional view is illustrated by
figure 4. But these are not sufficient. Some embedded platforms arose from archi
tectures designed primarily for the desktop environment. Thus the difference cannot
be just the register organization, pipelining, or basic instruction set. Instead, issues
such as power consumption, cost, and integrated peripherals differentiate the desk
tops from embedded processors. Other important features are interrupt response
time, the amount of on-chip RAM or ROM, and the number of parallel ports. But
most importantly, while the desktop world values processing power [1], the
embedded world focuses on completing a task for a particular application at the
lowest possible cost. Economically, meeting the needs of these new applications is
the main driving force behind these n classes of embedded processors, devices that
narrow the gap between the embedded and the desktop worlds [49].
Applications like handheld, palmtop, or network PCs, video game consoles,
and car-information systems require a display, a powerful processor, storage media,
and interfaces to communicate with the outside world. For these reasons, embedded
processors have in some ways begun to incorporate capabilities traditionally associ
ated with conventional CPUs—but with a twist: they are subject to challenging cost,
power consumption, and application-imposed constraints.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
15
Controller
DSP
Processor
Program 1 Program 2
Memory
Analog I/O
and Other
Functions
Control
Panel
Display and
Other
Functions
Traditional processor (desktop)
Program
Register
Bank
ALU
— w
CPU
DSP
MAC
Embedded processor
Figure 4. Desktop processor versus embedded processor [49]
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
16
Target Applications
Target applications include the following:
1. Video game consoles, which require excellent graphics performance
2. Handheld, palmtop, automobile, and network PCs, which require virtual memory
management and standard peripherals
3. Cellular phones and mobile personal communicators, which require ultra-low
power consumption while offering high performance and digital signal process
ing (DSP) capabilities
4. Modems, fax machines, and printers, which require low-cost components
5. Set-top boxes and digital versatile disk (DVD) equipment, which require a high
level of integration
6. Digital cameras, which require both general-purpose and image-processing
capabilities
Manufacturers sell most current 32-bit embedded processors to the consumer-
application market, followed by the communication and office-automation markets.
The requirements of these markets force embedded microprocessor designers to
reduce manufacturing costs while simultaneously increasing the level of integration
and performance.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
17
Performance Benchmarks and Design Flow for
Embedded Processors
The benchmark and design flow criteria for embedded processors include:
1. Power consumption: for mobile applications, the benchmark is the MIP S/watt
ratio [43],
2. Code density [43]: the goal here is to avoid the complexity of CISC and density
of 32-bit, fixed length RISC architectures.
3. Peripheral integration [43] and chipsets: serial communication interfaces are
mandatory for embedded processors, for example, and for reducing overall cost
for dedicated applications.
4. Multimedia acceleration and acceleration of special application software: this is
done via enhanced instruction-set functionality.
5. Price/performance ratio: after all, cost—measured in MIPS/dollar—is the main
issue.
These are the evaluation parameters for embedded processors that define the mile
stones against which microprocessor development must be measured.
Functional Design
Figure 5 shows a simplified flowchart of the design methodology [43],
Product development begins with a specification based on customer and designer
input. Once the design is drafted and the specification approved, various applica
tions are implemented. In addition, an efficient design phase increases the quality
measurement capability and reduces test cost.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
18
Test, Design, and
Verification:
Interacts with All
Three
Physical Design and Verification, Place and
Route, Physical Rules Check, Scan Inversion
Logic Design/Verification, Synthesis,
Timing Analysis, Simulation, Formal
Verification
Functional Design/Verification, Model
Entry, Model Checking, Model Release
Figure 5. Design flow [43]
Model Entry
A synthesizable subset of the Verilog-XL™ design language is used to
describe the designs [43], Third party tools are compatible with this subset because
most of them support the same common language subset we use. An internal cycle-
based simulator is used during functional verification, and it optimizes our subset
[43].
However, the microprocessor architecture is fully synthesizable. As a result,
designers can retarget its design to new process libraries when they become avail
able. Retargeting requires no special standard cells. All designers use the same set
of building blocks (soft macros). Thus, design readability increases because the
model looks as if one person had designed it, and designers could concisely share
information [43],
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
19
Model Release
As we enter a skeleton framework of our design, we begin generating what
we call model releases [50], A model release is a list of design files under revision-
control that represents a product under development. The purposes of releasing a
model are to verify full microprocessor integration and cache-design problems early
in the process. The full integration and release of a microprocessor involves pulling
together all of the submodules and performing model syntax checks and a simulation
regression suite. This ensures that we release a syntactically and functionally correct
design to the remaining design steps.
Model Checking
In the model-checking process [43], we perform Verilog-XL™ syntax
checks, synthesis checks, and other design rule checks prior to model release. This
helps catch errors at the front end of the design flow. Potentially, these errors could
escape detection until much later. For example, model checking may find a
Verilog-XL™ construct that is acceptable for simulation but cannot be synthesized.
Without model checking, this would not be found until synthesis.
Model checks include the following:
1. Verilog-XL™ syntax
2. Synopsys© syntax
3. HDL style for simulation
4. HDL style for synthesis: unsupported constructs, hierarchy consistency,
unconnected inputs synthesis constraint file validation;—a constraint for all
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
20
ports; a match in the Verilog-XL™ HDL file for all constraint file ports model
file content against design rules; bus contention during scan shifting, mutual
exclusion, and multiple drivers [43]
Functional verification
Before releasing a ColdFire™ model, we run a limited set of regression tests
including benchmarks, diagnostics, and random instruction sequences (RISs) to
functionally verify the model [43], After the model passes the regression suite, we
run RIS tests on the released model 24 hours a day on the entire workstation network
via a batch system primarily running the simulator. We use a smaller set of work
stations to run Verilog-XL™ simulations.
The batch RIS tests expose the model to additional random situations,
increasing overall confidence that the design is correct. We generate the expected
results from a “golden model,” which can be either a prior microprocessor in the
product line or an independently written software model [43],
Simulation
We use a complete system simulation environment to simulate [43], The
system model consists of the microprocessor, external bus, peripherals, external
memory, and an external development system model to stimulate the on-chip
debugging module. We load programs consisting of the real code into the memory
and execute them. We can use this environment to create most application situations
completely. In addition to checking functionality, we also simulate the modeled test
logic to ensure that it operates as expected. We include most of the test logic in the
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
21
model to ensure that it does not interfere with the chip’s normal functional mode and
that it meets chip-timing goals. Its purpose is to make best-case design area esti
mates, to target hardware emulation, and to perform rapid DFT analysis for early
structural test vector generation. Rapid prototyping generates a logic netlist from a
Verilog-XL™ model very quickly. The synthesis is not constrained for timing, so
the final netlist may be twice as large as a timing-constrained synthesis run, but this
is not a problem. We use the netlist to perform rapid DFT analysis and to flush out
back-end problems during early chip building. The rapid-prototyping script we use
reads in the Verilog-XL™ model using SynopsysO. We compile each module in the
design hierarchy if it contains logic. Wire-only modules need no compilation. The
fastest way to get a netlist is to compile the Verilog-XL™ with a low compile-effort
setting, no design-rule fixing, and absolutely no synthesis timing constraints.
Essentially, rapid prototyping maps the Verilog-XL™ model to a target technology
with all synthesis algorithms turned off. Rapid DFT analysis verifies all the test
architecture and DFT assumptions by modeling most of the test structures in parallel
with the functional design and operating these structures in the simulator. The only
test structures not modeled are the scan chain connections because they require a
specific physical placement. (We optimize the scan chains for nearest-neighbor
routing to reduce scan route or metal overhead and to ensure at-speed operation.)
The rapid-prototype gate-level netlist goes through design-rule checking by static
and dynamic DRC tools. The scan-vector-generation tool is the golden fault
standard. Therefore, we pass the boundary-scan and memory-testing vectors through
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
22
the Mentor Flextest© tool for fault measurement against the golden fault database.
We identify architectural problems that reduce fault coverage and add proposed fixes
or workarounds to the next model update [43],
Logic Synthesis and Verification
The synthesis phase of design is not anything new to designers. We try to
write technology-independent code in our models so that we can target many process
technologies. Logic synthesis tools translate the design description into logic using a
standard cell library [43],
Logic Verification: Timing Analysis
Static and dynamic timing analyses provide feedback as to whether the
design is meeting the specifications (clock cycle time and I/O specifications). Static
timing analysis is faster and easier to manage than dynamic, and it is invaluable to
the design flow [43],
Gate-Level Simulation
Because formal verification does not cover dynamic conditions such as bus
contention, the logic verification process includes some gate-level simulation. We
run a diagnostics subset on the final gate-level netlist to verify functionality and
perform dynamic timing checks. This simulation is slow without the aid of a hard
ware accelerator [43],
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
23
CHAPTER 2
OVERVIEW OF SOC ARCHITECTURES
Introduction
SOC technology has opened up new possibilities for advanced ICs. Using
SOC technology it is possible to implement a system or subsystem using a wide
range of implementation architectures. Designers can select from available cores
and macrocells, custom logic options, programmable processors, memory blocks,
real-time operating systems, on-chip buses, and drivers, etc. [21].
Traditional relationships between components can be changed once those
components are combined on chip. For instance, the traditional bottleneck between
memory and logic changes dramatically when they are placed on the same die. With
all these possibilities, however, comes a high price: today’s methodologies for
hardware and software development are no longer sufficient or even appropriate for
SOC design. One critical aspect of today’s SOC systems is that the application [5]
being developed is distinct from the architecture on which it is being implemented.
In order to handle this distinction, a modeling capability for developing applications
needs to be developed. The modeling capability needs to provide an effective appli
cation development methodology and enable a new class of mapping tools [5] that
can be used to map the application onto one or more target architectures. Today’s
system-level modeling approaches (dataflow, state charts, synchronous languages)
are effective only in limited domains and cannot be used to effectively express a
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
24
complete application for a SOC IC. Improv Systems, Inc. has developed a new
semantic model that is ideally suited for describing applications that will be imple
mented on complex SOC ICs. It was developed out of a need to provide an intuitive
application environment for Programmable System Architecture (PSA™) ICs. PSAs
are reconfigurable, multiprocessing platforms that are implemented on advanced IC
technology. A PSA IC contains multiple VLIW processing engines with a unique
communication structure between them. The algorithm involves an application
description that would be intuitive for application developers while explicitly
exposing concurrent application tasks and the data communication and control
dependencies between them.
Reconfigurable Logic for SOC Architectures
The electronic systems of the future will be implemented as billion gates of
SOC, which will require huge design investments. But the pace of technological
changes and ever-changing requirements put them in danger of becoming obsolete as
soon as they come into existence, since applications always want to utilize the
newest technological changes and must meet changed requirements. Hence, the need
arises for single chip systems that are adaptable to a family of applications. The
emerging technology of configurable logic offers the promise of large-scale silicon
systems that are adaptive after manufacture, with little, or no, compromise on the
execution efficiency as compared to hard-wired systems. But the application of
configurable logic to SOC requires a radical rethinking approach, with new archi
tectures and design tools to address the issues of complexity and scalability. We
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
25
should build adaptive SOCs with the right amount of configurable logic in the right
places.
In order to manage the high complexity, we must predesign using highly
optimized, well-understood logical blocks. Though the configurable SOCs have a
higher degree of flexibility and better efficiency than conventional logic, they
occupy more area due to the overhead of routing paths and large program structures.
By mixing fixed conventional logic blocks with the programmable logic, we can
attain better area efficiency. Creating the best balance of this fixed and configurable
logic is a very challenging design problem that can be addressed only through
radically new architectures and design environments to support them. These
environments must support exploration and trade-off of design dimensions, such as
hardware/software partitioning, bus architecture and partitioning, and circuit size/
power versus speed. Such a methodology will enable commonality of chips for
different applications within a given domain, allow post-production debugging, and
support dynamic reconfigurations for different modes of operations.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
26
CHAPTER 3
CLASSIFICATION OF EMBEDDED PROCESSORS
Selection Criteria
The embedded processor:
1. Is meant for a power constrained environment
2. Is preferably used by handheld devices
3. Is embedded memory on chip
4. Has at least a few architecture parameters that can be exploited for optimization
5. Is an unconventional/embedded processor
Architectural Details Available
This section discusses the structural detail of the architecture, a list of archi
tecture parameters and how it can be exploited for optimization, a list of application
or application-domains that use this architecture, and its possible future potential.
Examples of Popular Architectures
The architectures consist of:
1. Analog Devices, Inc., Blackfin™ DSP
2. Texas Instruments, Inc. OMAP™, Berkeley (Pleiades)
3. Xilinx, Inc. platform FPGA
4. Cradle Technology, Inc. (UMS)
5. Improv Systems, Inc.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
27
Analog Devices. Inc.. Blackfin™ DSP
The Salient architectural features of the Blackfin™ DSP [37] are:
1. Dynamic power management
2. Highly parallel computational blocks
3. High performance digital signal processing
4. Superior code density
5. Video Instructions
6. Hierarchical memory
The current generation of the Blackfin™ DSP [37] architecture features a dual-
MAC, 300-MHz (600 million MACs per second) core operating at 1.5 V to provide
high-performance processing. Figure 6 illustrates a diagram of the Blackfin™ DSP
architecture.
The components consist of:
1. Two 16-bit MACS, two 40-bit ALUs, and four 8-bit video ALUs
2. Support for an 8/16/32-bit integer and 16/32-bit fractional data types
3. Concurrent fetch of one instruction and two unique data elements
4. Loop counters that allow for nested zero-overhead looping
5. Arbitrary bit and bit field manipulation, insertion, and extraction
6. Two arithmetic units with circular and bit-reversed addressing
7. Unified 4-GB memory space
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Reproduced w ith permission o f th e copyright owner. Further reproduction prohibited without permission.
External
Memory
Interface
and High-
Speed I/O
Parallel
Peripheral
Interface
System Interface Unit
48-KB
Instruction
SRAM/Cache
52-KB
Instruction
ROM
32-KB
Data
SRAM/Cache
4-KB
Scratchpad
RAM
Processor Core
System Control Blocks
Figure 6. Architecture core and peripheral blocks [37]
to
00
A t
29
8. Mixed 16/32-bit instruction encoding for best code density
9. Memory protection for support of OS operation
In a single cycle, the Blackfin™ DSP architecture supports the following:
1. Execution of a single instruction operating on both MACs or ALUs
2. Execution of 2 x 32-bit data moves (either two reads or one read/one write)
3. Execution of two pointer updates
4. Execution of hardware loop update
Figure 7 presents a diagram of the memory hierarchy architecture.
Scratchpad RAM
DMA
LI Instruction
SRAM and Cache L2 Instruction
and Data
SRAM
SRAM and Cache
LI Data
Blackfin™
DSP
Core
Figure 7. Memory hierarchy architecture [37]
LI has separate instruction and data-memory blocks and can be configured as either
SRAM or caches or a combination of both. L2 memory stores larger amounts of
program code and data. The LI memory is connected directly to the core and runs at
full system clock speed. There is latency in accessing L2, but once it is accessed it
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
30
acts like a burst store memory. For large program or data spaces, external memory
can load L2 memory directly.
Dynamic power management. Blackfin™ DSP incorporates dynamic power
management, allowing continuous adjustment of the processor’s voltage and fre
quency to optimize the power consumption and processor performance for real-time
applications. Software-selectable low-power operating modes put any combination
of the core and peripherals in a sleep mode while maintaining operation of the
remaining processor resources. In addition, a multilevel memory hierarchy provides
power savings, with most memory references contained to the smallest, on-chip
memory subsystems. See table 1.
Table 1. Analysis of Blackfin™ Architecture Characteristics
Mode
Power
Savings Notes
Full on Minimal Maximum performance mode
Active Low Core operational; PLL bypassed and core running at
CLKIN frequency; PLL enabled
Sleep High Core idle, core clock disabled; PLL enabled; wakeup on
any wakeup-enabled interrupt
Relaxed Medium Bypass enabled; PLL disabled; core wakes up on any
wakeup-enabled interrupt
Source: [37],
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
31
Regarding peripherals, any one that is not enabled will automatically have its
clock tree disabled in order to save power.
Regarding memories, the L2 SRAM bank is partitioned off into 32 banks, of
which only two maximum can be accessed at any given time (one bank by the core,
the other bank by the DMA), so the remaining 30 are powered down until accessed.
All of this is independent of the dynamic power management [37], which
allows the developer to dynamically change the voltage of operation and the
frequency of operation for the given tasks at hand. This can be done by utilizing an
RTOS, of which Blackfin™ is capable of running.
Target applications. Target applications for the Blackfin™ include:
1. Cellular terminals: voice, video, and data services, speech processing, and
ciphering tasks are available.
2. Dual MAC DSP architecture provides processing capability.
3. Pipelining of RISC architectures enable better frequency scaling.
4. Reduced complexity lowers the cost.
5. Internet and consumer appliances consist of speech recognition, handwriting
recognition, streaming video, and audio compression.
6. Dynamic power management reduces the voltage and frequency o f operation
during non-time-critical operations.
7. Internet and telecommunications infrastructure include voiceover IP (VoIP)
gateways.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
32
Roadmap/future directions. The Blackfin™ DSP architecture is claimed to
surpass 1-GHz performance and operates as low as 1.0 V. Support for dynamic
power management allows for high-performance operation, coupled with exceptional
low power consumption.
The visual DSP++ software development environment consists of an IDE®,
debugger, C/C++ compiler, assembler, linker, and simulator.
Texas Instruments. Inc. OMAP™ 1510
Salient features. The salient features of Texas Instruments, Inc. OMAP™
1510 [38] are as follows, and they are illustrated in figure 8.
1. Dual-core architecture optimized for efficient operating system and multimedia
code execution
2. TMS320C55x™ DSP core for superior multimedia performance
3. Tl-enhanced ARM™ 925 core with an added LCD frame buffer to run com
mand and control functions and user interface applications
4. Support for advanced OS
5. Targeted towards 2.5-G and 3-G wireless applications; optimized for power-
efficient execution of multimedia applications including MPEG encoders and
decoders
6. ARM 925 RISC core operates at 175 MHz
7. Includes a memory management unit (MMU) for virtual-to-physical memory
translation and task-to-task memory protection
8. 16-KB I-Cache, 8-KB data cache, and 17-word write buffer
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Reproduced w ith permission o f th e copyright owner. Further reproduction prohibited without permission.
Program Memory
Power
Management
for External
Modem
Interface
SDRAM
— 7% ---------------
Traffic Controller
DMA
LCD Controller
Clock
5
Clocks
M
M
U
M
M
U
LCDs
Cache
Enhanced
ARM
I-Cache
DSP
Interface to
Application
Peripherals
JTAG/ETMS
MPU
Interface
System
7\
iz
c=>
Test/ Real-time
Software Debug
Figure 8. Texas Instruments, Inc.’s OMAP™ core [38]
U )
34
9. 1.5 MB of internal memory
10. 200 MHz C55 x DSP core: increased idle domains, variable length
instructions, and increased parallelism
11. Multimedia extensions for motion estimation, discrete cosine transform (DOT),
inverse discrete (IDCT), and 1/2 pixel interpolation
12. Core includes 32 Kwords of internal dual-access SRAM, 48 Kwords of internal
single-access SRAM and 12-Kword instruction caches
13. Intermemory space data transfer through DMA
14. External memory addressable space of 64 MB, accessed at 100 MHz
15. Parallel data transfers that can occur with microprocessor unit operation
For air interface for voice and data transfer, a XI standard multi-channel
buffered serial port (McBSP) allows data transfers at 6 Mbps.
Target applications. Target applications for the OMAP™ 1510 [38] include:
1. Multimedia: streaming audio/video, broadcast, players
2. Games: 2D, 3D
3. Location-based services: GPS, network-assisted solutions
4. Security (user interface): biometrics, user authentication
5. Security (infrastructure): encryption/decryption, firewall, user verification,
anti-virus
6. Business applications: database management, spreadsheet, synchronization, and
application navigation via speech
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
35
7. Dynamic vocabulary speech recognizer: computationally intensive, small
footprint speech recognition engine running on the DSP, computationally non-
intensive, larger footprint grammar, dictionary; acoustic model generation
components residing on ARM® processor
Berkeley Wireless Research Center.
Pleiades Architecture
Details on Pleiades architecture are provided in this document. Pleiades is a
low-power domain-specific architecture template for digital signal processing
applications. References [16] and [39] present a research report that covers in detail
a platform-based low-power design of a hybrid reconfigurable fabric for wireless
protocol processing.
The key features of the Pleiades architecture template [16], [39] are:
1. It is a highly concurrent, scalable multiprocessor architecture with a hetero
geneous array of optimized satellite processors that can execute the dominant
kernels of a given domain of algorithms with a minimum of energy overhead.
The architecture supports dynamic scaling of the supply voltage.
2. Reconfiguration of hardware resources is used to achieve flexibility while mini
mizing the overhead of instructions.
3. A reconfigurable communication network can support the interconnection pat
terns needed to implement the dominant kernels of a given domain of algorithms
efficiently. The communication network uses a hierarchical structure and low-
swing circuits to minimize energy consumption.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
36
4. A data-driven distributed control mechanism provides the architecture with the
ability to exploit locality of reference to minimize energy consumption. The
control mechanism provides special support to handle the data structures com
monly used in signal-processing algorithms efficiently. The control mechanism
also provides a framework for minimizing switching activity.
Figure 9 gives a diagram of the configuration bus.
Control
Processor
Figure 9. Configuration bus
Pleiades architecture template. The Pleiades template [16], [39] consists of:
1. Control processors: a general purpose microprocessor core
2. Satellite processors: an heterogeneous array of autonomous, special purpose
processors
3. Reconfigurable communication network
4. Reusability of the template by two means
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
37
5. Set o f predefined control and communication primitives that are fixed across all
domain-specific instances of the template
6. Predefined satellite processors that can be placed in a library and reused in the
design of different types of processors
Control processor. The functions of the control processor [16], [39] are:
1. It executes non-compute intensive and control-oriented tasks of algorithms.
2. It configures the satellite processors and the communication network for
executing a given kernel. A configuration state of a resource is stored in a
suitable storage element, i.e., a register, a register file, or a memory.
3. The configuration states are in the memory map of the control processor and are
accessed by the control processor through the reconfiguration bus.
Satellite processors. The characteristics of the satellite processors are:
1. They form the computational core of the Pleiades [16], [39] architecture.
2. Concurrent execution of kernels is supported. Some examples of satellite
processors are memories to store the data structures processed by the compu
tational kernels of a given algorithm. Parameters include the type, size, and
number of memories used in a domain-specific processor. Other examples
include: address generators, reconfigurable data paths, PGA to implement logic
functions, MAC processors to compute vector dot product, and ACS processors
to implement the Viterbi algorithm used in communication and storage
applications.
3. DCT processors are included.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Communication network. The communication network [16], [39] is configured
by the control processor to implement the arcs of the data flow graph. The scheme
used by the Pleiades architecture is generalization of the mesh interconnect structure.
Figure 10 presents a diagram of the generalized interconnect structure.
□
£3-
13----------
□
□
Wiring channel
Figure 10. Generalized mesh interconnect structure
The parameters of the structure include:
1. A number of buses are employed in the wiring channel.
2. An hierarchical structure to improve the performance and efficiency of imple
mentation: clusters of tightly connected satellite processors are created to
provide a hierarchy in the communication network. Intercluster switch boxes
provide intercluster communication. Satellite processors at the end o f the two
ends of a communication channel can run at independent supply voltages.
Reconfiguration. Satellite processors can be reconfigured [16], [39] at run
time to support various kernels of a given domain of algorithms. A distributed data-
driven control mechanism employs small local controllers in place of a large global
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
39
controller. The processors communicate via an asynchronous timing scheme. The
Pleiades processor is used for speech-processing applications.
Cradle Technology. Inc.’s UMS
The demands of digital image processing, communications, and multimedia
applications are growing more rapidly than traditional design methods can fulfill
them. Previously, only custom hardware designs could provide the performance
required to meet the demands of these applications. However, hardware design has
reached a crisis point wherein it can no longer deliver a product with the required
performance and cost in a reasonable time for a reasonable risk. Software-based
designs running on conventional processors can deliver working designs in a
reasonable time and with low risk but cannot meet the performance requirements.
Cradle Technology, Inc. [41] offers the Universal Microsystems (UMS) [42] as a
solution to these problems. The UMS is a completely programmable (including I/O)
system on a chip that combines hardware performance with the fast time to market,
low cost, and low risk of software designs. Essentially it is a parallel processing
environment on a single chip. In other words, it illustrates parallel processing with
an SOC concept.
Unfortunately, the ASIC and ASSP [42] approach to high-performance
system design has been unable to keep up with the continual demands of process
improvement. Each round of process improvement provides the capability for more
transistors in a new product for the same cost as the previous product. With more
transistors, the new product requires more design time and effort than the previous
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
40
product. Improving design tools can help to reduce the effort. However, even with
improvements in tools, there is a problem. Process capability in the transistors-per-
unit area has been improving at over 50 percent per year, while design capability in
transistors per designer per year has been improving at less than 30 percent per year.
If the designer population does not grow rapidly and continually (and it has not),
fewer designs per year can be implemented, resulting in unmet demand. Potentially
successful products are not created because there is not enough talent available to
make them.
Bigger designs bring other problems. It takes more designers to create the
design, the design takes significantly longer to create and debug, and you have the
increased the risk of missing the market window. A related problem is that product
cycles are getting shorter. With the design taking longer and product cycles getting
shorter, the market can change and invalidate your product before you finish the
design. The problem with ASIC and ASSP design [42] is that it is hardware design.
Hardware design is relatively difficult, time consuming, inflexible, and specialized
relative to software design. Hardware designs are also not very reusable when
compared to software designs. Significant design effort is often required to transfer
a hardware design from the current process generation to a new generation. What is
needed is a system approach that combines the high performance associated with
hardware with the short design time, flexibility, and reusability of software. The
new architecture called the Universal Microsystems (UMS) [42] fills this need and
fills it in a way that provides major advantages over all other approaches that have
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
41
been tried. The UMS provides the performance of hardware designs with the rapid
design time of software designs.
The UMS uses many small, fast processors to provide the performance
normally associated with hardware with the short design time and reusability of
software. Figure 11 shows a block diagram of the UMS.
The UMS is a single chip consisting of clusters of processors connected by a
high-bandwidth bus. These processor clusters, called quads [42], communicate with
external DRAM and I/O interfaces through a DRAM controller and through a fully
programmable I/O system, respectively. Because the UMS [42], [44] has a large
number of small, fast processors (e.g., 75 processors running at 324 MHz on one
chip), the UMS exceeds the performance of most equivalent ASIC designs. In a
UMS design study of a digital camcorder, half of a single UMS chip [42] replaces
11 ASIC chips, using 1/6 the power, providing significantly more function than the
ASICs it replaced and having room for many more features in the future.
Performance. The performance of the UMS [42] is as follows:
1. MIPS ~ 104
2. I/O (MB/sec) - 500
3. Mem (GB/sec) -1 -2
4. MW/GFlop - 200
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Reproduced w ith permission o f th e copyright owner. Further reproduction prohibited without permission.
PROGRAMMABLE I/O
P
R
0
G
R
A
M
M
A
B
L
E
1
/
O
Memory
DRAM Control
Clock Debug
Memory
Memory
Memory
P
R
0
G
R
A
M
M
A
B
L
E
1
/
0
PROGRAMMABLE I/O
NVMEM DRAM
Figure 11. UMS architecture [42]
£ > -
43
The salient features of the Cradle UMS architecture [42], [44] are:
1. High, achievable performance: The UMS [42], [44] uses large numbers of small,
simple processors rather than one large, highly optimized fast processor to
achieve high performance. These processors are designed for high speed, small
size, and low power.
2. Simple programming model: the SPMD model is a simple programming model,
particularly compared to other models such as SIMD and VLIW. In the SPMD
model, a single program is written that runs on many processors.
3. Complete programmability: the UMS is a completely programmable system,
including the I/O. All functions implemented in hardware in an ASIC are
implemented in software in the UMS.
4. Other features: it has a short time to market, low design investment and risk,
reuse of design investment, predictable performance and costs, and scalable
design.
Hardware architecture. The UMS [42], [44], as shown in figure 11, is a
multiprocessor system on a chip with programmable I/O for interface to external
devices. The UMS is a single chip consisting of clusters of processors connected by
a high-bandwidth bus. These processor clusters, called quads, communicate with
external DRAM and I/O interfaces through a DRAM controller and through a fully
programmable I/O system, respectively. Each quad [42] consists of four processor
groups, hence the name quad.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
44
Each processor group is called a multistream processor, or MSP. The UMS
uses a single 32-bit (expandable to 40+ bits) address for all register and memory
elements. Each register and memory element in the UMS has a unique address and
is uniquely addressable.
Software architecture. The UMS [42], [44] achieves its high performance by
exploiting parallelism in its applications. One of the tasks of software design is to
find and use parallelism. Parallelism means running computing tasks in parallel (i.e.,
at the same time). To do this most effectively, the tasks must be nonblocking, i.e.,
each must be able to execute independently of the other. If tasks A and B are
running in parallel and A has to wait for data from B (data blocking) or wait for a
resource being used by B (resource blocking) before it can continue, A is being
blocked by B. If tasks are nonblocking, any number of tasks can be run in parallel.
Types of parallelism. The goal of software design is to find and exploit
scalable parallelism. The three types are of parallelism exploited are:
1. Functional parallelism: exploits the inherent parallelism among independent
functions
2. Pipeline parallelism: uses the static and dynamic pipeline parallelism
3. Data parallelism: makes use of the parallelism between independent data items
Summary. The UMS [42], [44] solves the design crisis for high performance
applications by providing a solution with the performance of hardware and the
design convenience of software in a completely programmable platform. The UMS
[42], [44] solution provides:
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
1. High, achievable performance
2. Short time to market
3. Low design investment and risk
4. Increased design productivity
5. Reuse of design investment
6. Program in C, not Verilog-XL™
7. Predictable performance and costs
8. Scalable designs
9. Technology Independence
The UMS [42], [44] is a revolutionary approach to the problem and is superior to all
other solutions proposed to date.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
46
CHAPTER 4
CONFIGURABLE PLATFORM: THE ASIC REVOLUTION
Defining this platform-based design was a revolutionary approach. Trends in
system-on-chip (SOC) [42] basically led to these platform-based approaches. To
achieve this, the new requirement was embedded processors. Thus the technology
was progressing towards the embedded domain from the desktop regime. To cus
tomize the platforms for a specific application, reconfigurability was very important.
Slowly, but surely, platform-based design is becoming the future of the EDA design
world.
The semiconductor market is undergoing revolutionary changes in design.
SOC is at a crossroads; the mainstream continues to believe that “core-based design”
will prevail, but a different approach, “platform-based design,” is emerging. The key
characteristic of this revolution is the shift from isolated components (processors,
memories, ASSPs, ASICs) [42] to integrated single-chip platforms.
As a result of this platform-based design approach, ASIC development
changes from hardware design to a combination of application software development
and platform configuration. The development of an ASIC can be done by applica
tion specialists (rather than hardware specialists), enabling higher levels o f creativity
and innovation in systems; 95 percent of current IC/ASIC EDA tools move into
platform development and implementation tools. Designers will now be using
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
47
application development (software), platform mapping, and platform configuration
tools [42],
Platform-Based Design
The SOC IC [42], [44], memories, processors, peripherals, custom blocks,
multiple processing blocks, and integration architecture [42] consist of:
1. I/O architecture
2. Task model
3. Task execution
4. Task communication: data and control
5. Task concurrency
6. Application to platform methodology
7. Application description
8. Mapping technology and approaches
Platform mapping is illustrated in figure 12.
Platform-Based Designs
The traditional core-based design is faltering, as it is getting too expensive for
semiconductor vendors and systems companies. Also, it only makes sense for
extremely high volume parts (e.g., Sony Playstation®). There has been an increasing
number of IC design project failures in the recent past. The only successful cores are
programmable processors, memories, and peripherals. Application-specific cores
have short lifetimes because of the fast-paced market. Integration is still a major
challenge for these cores. As a result of all these, the overall industry is moving to
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
48
Application Description
Mapping Methodology
Processor I/O Processor
Memory Memory
Processor I/O Processor
Processor
Memory
Processor
Figure 12. Platform mapping [44]
(and talking about) “platforms.” Most platforms are “boards on chips” and duplicate
the bus connection model. No one has put forward a compelling platform even for
specific application areas.
Benefits of Platform-Based Designs
The benefits of the platform-based design [44] include:
1. It has a rapid time to market.
2. The platform is already defined and ready for manufacturing (including verifi
cation suites) programmable on-chip resources.
3. Hardware resources do not need to be committed to a single function but can be
reused for multiple functions.
4. It involves true parallel hardware and software development.
5. Software can be developed independently of hardware implementation.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
49
6. A stable platform means that software development tools can ensure that soft
ware will run correctly on the platform, with product life-cycle support and
design reuse.
7. Once the initial product is delivered, the platform can be optimized for cost
reduction, etc. without changes to software.
8. Software can be updated to incorporate new features without changes to the
underlying platform.
9. It has a stable hardware target.
10. Same masks can be used across multiple projects (reprogrammed for different
applications).
11. Optimization and verification benefit all product teams because of its consistent
architecture.
12. The same platforms can be employed across different product configurations
(e.g., different channel densities).
13. It has a common design environment.
14. The software component can be re-used across products.
15. The methodology is consistent between product groups.
SOC Alternative
The key characteristics of the SOC are listed in table 2.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
50
Table 2. Key SOC Characteristics
Characteristic ASIC ASSP Platform
Configurable
Platform
Product ++ - + ++
Differentiation
performance
++ -H- + + +
Silicon
efficiency
-H- - - +
Time to market — ++ ++ +
IC design costs - ++ -H- +
Part cost
++ - + ++
Programmability - - + ++
Adaptability - - + ++
Scalability - - +
Source: [44],
The key platform characteristics are:
1. Performance
2. Programmability
3. S calability/integration
4. Configurability
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
51
Platform FPGA (Virtex™ II and CoolRunner™!
Structural Details
The platform is an FPGA with embedded processors (IBM’s PowerPC NP
NPe 450™), DSP functions, and gigabit I/Os. The overall structure looks like the
diagram in figure 13.
Soft IP
Hard IP
Hard IP
Soft IP
Hard IP
Figure 13. Platform FPGA architecture [40]
Here, the hard IP is the HDL design that configures the FPGA, and soft IP are soft
ware programs meant for embedded processing cores.
Architecture Features
There are 12 clock managers each with phase shift and frequency synthesis
capabilities. This allows multiple clock domains within a single FPGA [40],
Frequency ranges from 1 MHz to 420 MHz. There are 16 pre-engineered clock
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
52
domains. There is up to 3.5 MB of on-chip block RAM. The PowerPC NP NPe
405™ core has an over 6-GB peak communications bandwidth with the FPGA
fabric.
Each Vertex-II I/O pin is individually programmable for many I/O standards;
this allows flexibility to suit the application I/O requirements.
It is easy to integrate the different kinds of soft and hard IP blocks. Soft IP
blocks can be integrated by exploiting software parallelism. The hard IP block
integration helps improve design. These help in developing a diverse set o f archi
tecture components on a single chip.
Performance. The performance metrics of the platform FPGA are given in
table 3.
Table 3. Platform FPGA Performance Metrics
Function Fastest DSP Virtex™
8 x8 MAC 8.8 billion MACs 600 billion MACs
FIR, 256 tap, linear phase, 17MSPS (1.1 GHz) 180 MSPS
16-bit data/coefficient
1024 point FFT (16 bit data) 7.7 psecs (800 MHz) 1 p. sec (140 MHz)
Source: [40],
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
53
Tools and simulators. Xilinx, Inc. [40] has extensive tools for application
development with PFPGA. They probably have pretty much an automated process
for system development. There is no mention of application mapping tools from the
algorithm implementation point of view; it is probably still manual.
Target application domain. The target application domain is mainly DSP
functionality. Included are: echo cancellation, forward error correction, and image
compression/ decompression.
Limitations. PFPGA is not a candidate for handheld devices yet, but Xilinx,
Inc. has CoolRunner™, which is a one-time configurable device and meant for
handheld devices. But thermal dissipation is a major concern for which energy/
power cannot be neglected. And there can be power-constrained environments like
use of solar power or battery power (in satellites perhaps), which can be used as
argument to justify PFPGA as a device.
Market acceptance. Market acceptance seems reasonable, but Xilinx, Inc.
does not have a concrete idea.
In the future, by adding application-specific accelerators, a larger chunk of
the application market could be dominated through the PFPGA initiative. The size
and energy usage probably can eventually come down to a limit, which will make the
PFPGA a usable device for handheld devices. (There is no concrete argument for
this projection, except for Moore’s law, the emergence of SOCs, and the movement
towards billion-gate devices.)
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
54
CHAPTER 5
ENERGY-EFFICIENT RECONFIGURABLE ARCHITECTURES
Introduction
It is a widely known fact that as VLSI technology advances, there is a
looming crisis that is an obstacle to the widespread deployment of mobile embedded
devices, namely that of power. This problem can be tackled at various levels,
namely device, logic, operating systems, micro-architecture, and compiler.
The widespread deployment of embedded/reconfigurable processors in
mobile computing promises to open up many new frontiers in applications. How
ever, an important barrier, which can severely limit this development, is the power
consumption. An energy-efficient device can potentially last longer and requires less
bulky power supply units. In most devices, the computing elements account for the
most power consumption, and this problem is being attacked on various fronts. First
is the design of VLSI devices and logic. These make the most significant contribu
tion to saving power. Next is the use of novel microarchitecture techniques such as
voltage scaling. The third front is in the software runtime system, like the operating
system, which can utilize power-saving microarchitectural features and schedule
tasks accordingly. Lastly, given an application, a compiler can try and optimize its
power consumption.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
55
Dynamically reconfigurable systems offer the potential to realize efficient as
well as highly adaptable systems. Such systems are of extreme use for many future
applications, having limited battery power (energy-budgeted) and needing to handle
diverse data types, operating in dynamic applications and communication environ
ments, and requiring high performance. While most consumer microprocessors have
used recent technological advancements in computer architecture to improve their
performance, some critical applications like portable computing (mobile/handheld
devices, etc.) still require low power. There are several methods for programs to
dynamically change the performance of a processor, allowing parts of the program
that do not require the maximum performance of the processor to bypass some o f the
logic to attain a low-power mode with a decreased performance.
As the SOC concept is rapidly becoming a reality, time-to-market and
product complexity push towards reusing large macromodules. Reconfigurable
processors help achieve this goal. But this calls for methodologies that not only
develop a trade-off between the power-delay-area (PDA) metrics simultaneously but
also allows designers to evaluate various design choices at an early stage and also
evaluate the different implementation approaches. Several such energy-conscious
methodologies have been developed for reconfigurable DSP applications.
Several power-estimation tools for RTL models have been built. These
feature cache-based on-line model building that allows the reuse of previously
constructed models, at the same time offering automatic characterization capabilities.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
56
They guarantee fast estimation and good accuracy and are suited for architectural
design explorations.
Energy Management in Dynamically Reconfigurable
Mobile Systems
There is currently an explosive increase in the use of mobile handheld
devices such as cell phones, PDAs, digital cameras, GPAs, and so forth. Techno
logical advancements have led to wireless computing and networking. Personal
mobile computing or ubiquitous computing [9] will play a very important role in
driving technology in the future. In this platform, the basic personal-computing and
the communication environment will be integrated, battery-operated, and portable
[9], They will incorporate various features of the current mobile devices. Multi-
media tasks will be used for user interfaces. This technology [9] will be a part of a
greater networking infrastructure. But the users of these devices will be very
demanding, which will require them to have a lot more additional resources. The
technological challenges to establishing this paradigm of mobile computing are
nontrivial. This is because of the limited battery resources, need to handle different
data types, and need to operate in environments that are insecure and unplanned.
Since ASICS or high-speed GPRs cannot support this technology due to their
high time-to-market period and inefficiency in resource utilization, a different
approach was used. A heterogeneous reconfigurable architecture with a QOS-driven
operating system is used in which the granularity of reconfiguration is chosen based
on the model of task to be performed. A dynamic architecture-application-matching
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
57
approach [9] is used. The main concept here is to perform data operations where
they are most energy-efficient and minimum communication is required. The central
idea here is to design a system where data processing and communication are inte
grated and share equal importance [9],
Due to battery life and weight limitations, energy consumption is becoming
the limiting factor in the amount of functionality that can be placed in the portable
devices. Another important challenge in designing such mobile devices is to cope
with the dynamically changing attributes o f the environments. A possible solution to
this problem is to have a mobile device with a reconfigurable architecture that will
enable it to cope with these dynamically changing operating conditions. Adapt
ability and programmability are the two main requirements in the design of recon
figurable architectures for mobile systems [9], Essentially, there is a need to revise
the system architecture of a portable computer if we want to have a machine that can
be used effectively in a wireless environment. A system-level integration of the
mobile’s architecture, operating system, and applications is required [9], This
system should provide a solution with a proper energy-efficient mix balance between
flexibility and efficiency through the use of a hybrid mix of GPRs and ASICs.
Summarizing, the key issues in designing portable multi-media systems is to
find a good balance between flexibility and processing power, on one side, and area
and energy efficiency of implementations, on the other side.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
58
Solution Domains
There are several possible solutions, such as:
1. Fixed resources and fixed algorithms: mapping an algorithm directly to
hardware
2. Fixed resources and multiple algorithms: a fixed hardware supporting multiple
problems
3. Dynamic resources and dynamic algorithms: a trade-off between the above two
cases, where the resources and algorithm may vary over time to solve the
problem
Energy Management
Adaptability and flexibility are the two main recurring items when we
mention energy efficiency in mobile devices. Energy policy optimization is the
central idea in any energy management system. The policy is that the algorithm
decides what measures are to be taken in order to reduce energy consumption. In
mobile systems, energy management is not a one-time issue, and frequent adapta
tions to the system are essential to maintain an energy-efficient system. Finding an
energy management policy that minimizes energy consumption without compro
mising performance beyond acceptable levels is a complex issue. In personal mobile
devices, the system must cope with conflicting demands of compute-intensive
algorithms, communication-intensive applications, and long battery life. In general,
a designer tries to design a system suitable for a particular application and environ
ment. However, the energy efficiency in mobile systems in not a one-time design
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
59
problem that needs to be solved during the design phase. In a mobile system, power
management extends the idea of hardware/software co-designs, since we have to deal
with a highly dynamic application and communication environment. This multi
dimensional design space offers a large range of design trade-offs.
In a reconfigurable mobile system [9], functions can be dynamically migrated
between function modules such that an efficient configuration is obtained. The
networked operations of reconfigurable systems open up additional opportunities for
decomposition to increase energy efficiency. Essentially, there is a trade-off
between energy spent in computation and communication. Partitioning of functions
is an important architectural decision, which indicates where applications can run,
where data can be stored, the complexity of the terminal, and the cost o f the com
munication service. The key implication for this architecture is that the run-time
hardware and software environments on the mobile computer and the network should
be able to support such adaptability. Also, a substantial reduction is possible if the
computational complexity is high and the computation is regular and spatially
located. Also, the communication between modules is significant. Improving the
energy efficiency by exploiting this locality of reference and using energy-efficient
application-specific modules has a great impact on the mobile systems [9],
Power Estimation Tools and Flow
Power consumption has become another dimension in the design space of
integrated circuits [13], Fast and accurate power estimation tools are needed at each
level of abstraction in order to analyze the energy consumption of a design and to
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
60
validate that given design constraints are met. There are several RTL power-
estimation [11] tools that guarantee fast estimation and good accuracy and that are
suited for architectural design exploration. With the advent of the new technologies
that enable very complex systems to be implemented on a single chip, addressing the
problem of power estimation at the early stages of the design now has become
mandatory. Though several RTL-based power estimators [11] have been deeply
investigated in the recent past, only a very few of them have been used in the market
due to the lack of accuracy and delay caused by these tools. There are several ways
of attacking the problem of estimating the power consumed by an RTL [11] descrip
tion; the most promising one is based on the construction of abstract power macro
models for the various components in the RTL description [11], The models are
built by exploiting the available information on existing and/or previously designed
instances of the components. Power estimates for a specific workload (i.e., set of
input stimuli) are then obtained by properly summing together the contribution of
each element in the description, as it is provided by model evaluation [11],
Power Estimation Flow
The estimation procedure consists of the following main steps:
1. VHDL source analysis and elaboration: this yields an internal representation of
the complete design hierarchy.
2. Identification of the individual components of the description: the estimation
tool provides a browsable interface that identifies the various macros, the wires
connecting the various module instances, and a controller, if this is explicitly
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
61
instantiated as a finite state machine [11]. In the case where the controller is
specified as a gate-level netlist, the individual gates do not appear in the list of
components, but they are taken into account during the estimation.
3. Simulation of the VHDL description: the simulator traces all the internal signals
that define the boundaries between the various blocks. This is of fundamental
importance for both the construction and the evaluation of the macromodels.
4. Power estimation: the hierarchy is traversed from the inputs of the top-level unit
towards the outputs, and for each unit, the following operations are carried out:
a. Model construction: for each unit, the proper model is built using a cache-
based, on-line strategy. In fact, existing RTL estimators [11] typically follow
one of two approaches: exploitation of ad-hoc, offline pre-built models for
RTL components or usage of quick synthesis to map each component onto a
netlist of gates (or any generic precharacterized primitive). The key feature
of the modeling strategy lies in its ability in constructing a good model for a
given component, according to some guidelines (type of model, effort in the
construction) that are supplied by the user. Fixed, off-line models are only
used in the case of the controller (when specified as an FSM), for specific
RTL blocks [11] with well-defined structures (e.g., registers, MUXes) and
for third-party intellectual-property (IP) components for which a model is
provided. The tool supports various types of models. They differ in their
accuracy, complexity, and scope of use.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
62
b. Model evaluation: the proper parameter values are plugged into each model
[11], For modules with a mixed implementation (the hard macros), the
models include only those parameters related to switching. For soft macros,
the models also include complexity and/or technology parameters. The total
power is then obtained by summing up the contribution of the model of each
module, plus the wire power, so that both a total budget and a power break
down are reported to the user. Estimation of the individual components
present in the hierarchical description is also supported. This is possible
since components represent the finest level of granularity in the RTL
description.
c. Power models: a structural RTL implementation [11] is composed of a data
path, specified as a netlist of interconnected functional units, registers and
MUXes, and a controller, usually specified as an abstract finite state machine
(FSM).
Therefore, we see that the power estimation tools are required to be fast,
accurate, and particularly suited for design exploration of behavioral synthesis
alternatives, where changes in the design are normally incremental.
Architectural Adaptation for Power
Modem computer architectures represent design trade-offs involving a large
number of variables in a very large design space. Choices related to organization of
major system blocks do not work well across different applications. The power
variations across different applications and changing data set can easily be an order
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
63
of magnitude. Architectural adaptation [18] provides an attractive means to ensure
low power and high performance. Adaptable architectural components support
multiple mechanisms that can be tailored to application needs. Architectural adapta
tion [18] has received a lot of attention recently with a drastic increase in VLSI
complexity and transistor count as well as rapid advancements in reconfigurable
logic. The following steps are to be performed to make architecture adaptable [18]:
1. Identify mechanisms that are important for performance/power and can be easily
incorporated into the architecture.
2. Determine the relationship between application behavior and architecture
mechanisms.
Architectural adaptation [18] can be done at compile time or run time.
There are a number of places where architectural adaptation can be used, for
instance in tailoring the interactions of processing with I/O, customization of CPU
elements, etc. This is not a new idea, though. The concepts of adaptive routing and
adaptive traffic throttling have been used extensively in interconnection networks.
Here the network architecture is varied adaptively in order to reduce network
congestion and increase throughput. Thus, architectural adaptation [18] can enable
the application-specific customization [18] for high performance and low power.
Energy-Efficient Reconfigurable Architectures for
DSP Applications
Future handheld multimedia devices require a very high performance on a
very small energy budget [10]. Such devices can be realized only if the entire
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
64
system is energy cognizant. Since the functionality of these mobile computers and
devices will be limited by energy consumption, we need to focus on reducing it.
Since mobile computers will require a large number of specialized circuitry [10] and
functionalities for various applications, reconfigurability is very important for them.
It would further be desirable for wireless terminals to have architectural reconfigur
ability [10] whereby functions can be constantly modified dynamically [10]. Most
DSP algorithms for various applications are mapped to hardware using field pro
grammable functional arrays (FPFAs [10]).
One major obstacle in designing huge, billion-transistor circuits is the
physical design complexity, which includes the effort for design, development,
verification, and testing of the circuit. One solution to this is to work with a highly
regular structure, which means that we only deal with design and replication of a
single processor tile and an interconnection structure.
The design of mobile devices has to be handled at all levels of a system. In
the present technologies, interconnect delay contributes a lot to the energy consump
tion of the system. Multimedia applications have a high computational complexity.
They also have regular and spatially local computations. Exploiting such locality of
references improves the energy efficiency of the system [10]. Data operations
should be done where they consume the least energy and communication costs. This
can be decided by matching computational and architectural granularity.
FPFAs are reminiscent of FPGAs but have a matrix of ALUs and look-up
tables instead of CLBs. Basically, FPFAs [10] are low-power, reconfigurable
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
65
accelerators for an application-specific domain. Low power is achieved mainly by
exploiting the locality of reference. The FPFA [10] consists of a set of processor
tiles. Multiple processes can coexist in parallel, on different tiles. Within a tile,
multiple data streams can be processed in parallel. Each processor tile consists of
multiple reconfigurable ALUs, a local memory, a control unit, and a communication
unit.
Figure 14 illustrates the FPFA architecture.
Communication and
Reconfiguration Unit
Memory
Control Unit
ALU
Program Memory
FPFA
Figure 14. FPFA architecture [10]
The ALUs on tile are tightly interconnected and execute the inner loop of an
application domain. The ALUs on the same tile share the control and communica
tion units. They use the locality-of-reference concept to conserve energy.
The FPFAs [10] have a highly regular organization and, hence, only require
design and replication of a single processor tile. Also, their design, verification, and
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
66
testing is simple. They can perform very energy efficient media processing jobs due
to the above-mentioned advantages.
Several techniques [6], [10], [15], [16] can be used to map DSP algorithms
from the signal-processing domain to the FPFA-ALU hardware domain [10]. They
are:
1. Linear interpolation
2. Finite impulse response filter
3. Fast Fourier transform
There are several concepts used in the design of FPFA [10] for energy-
efficient mapping of DSP algorithms. The concept here is called balanced energy
minimization-, i.e., the minimization of architecture like FPFA [10] requires that the
energy consumed by the arithmetic (logic), communications (wiring), and data
storage (RAM) must be balanced. The FPFA [10] architecture is aimed at fine
grained operations in handheld multimedia computers. This architecture has a low
design complexity and good scalability and can execute various nontrivial algorithms
energy efficiently, while maintaining a satisfactory level of performance. In contrast
to the conventional FPGAs, which are aimed at bit level logic functions, FPFAs [10]
are focused on fine-grain operations. FPFAs [10] are more economical and efficient
as compared to the FPGAs. Hence, FPFAs [10] are compact, energy efficient, and
offer a good performance for most DSP algorithms.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
67
Low-Power FPGA Design
Significant reduction in energy consumption of FPGAs can be achieved by
tackling the circuit design and architectural optimization issues concurrently. The
energy consumption of interconnects is reduced by employing low-swing circuit
techniques [18], FPGAs are used extensively in embedded computing as perform
ance accelerators, which has created renewed interest in investigating various FPGA
architectures. The fine-grain reconfigurability of FPGA architecture makes it an
ideal candidate for use in SOC environments, which strive to integrate heterogeneous
programmable architectures. The advanced design issues encountered with FPGA
implementation needs to tackle the issues of power dissipation and energy efficiency,
which become increasingly important with higher levels of integration [18]. The
fine-grain programmability of FPGA results in poor energy efficiency. The energy-
delay product has been used as an optimization product to ensure that low energy is
not obtained by sacrificing speed performance [18],
The fine-grain programmability of FPGA stresses more on the interconnect
structure since the speed and energy performance of FPGA are controlled by the
interconnects. The interconnect is responsible for most of the energy consumption in
FPGAs. The logic only consumes 5 percent of the energy.
In designing low-power FPGAs, we have to take care of circuit and architec
tural innovations and optimizations. Emphasis is placed on interconnect architecture
during the optimization phase. The optimizations are performed at all the different
architectural levels of interconnect. Each of these architectural levels is optimized to
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
68
produce a low RC delay. The connectivity of configurable logic blocks [18] is
obtained by the following three levels of interconnects [18]. They are:
1. Nearest neighbor connections (level 0)
2. Mesh interconnect (level 1)
3. Hierarchical interconnect (level 2)
Clock Distribution
The contribution of the clock to the energy dissipation is around 20 percent.
It can be as high as 50 percent for highly pipelined circuits. By using proper
clocking/triggering mechanisms, we can bring down the energy dissipation due to
clocking.
To complement the optimizations made at the architectural level, proper
circuit level design is imperative. The main issues [18] dealt with are positioning in
the energy-delay design space and low swing interconnects. The low-energy FPGA
is suitable for embedded and portable applications. Hence, by a careful combination
of architectural redesign and circuit design, we can improve the energy efficiency by
more than an order of magnitude [18],
Role of Compilers in Energy Crisis o f Embedded/
Reconfigurable Processors
There have been extensive studies performed on the compiler optimizations
for power, but there is no systematic study that addresses the phenomena in the
interactions of the compiler, the application, and the microarchitecture of a processor
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
69
that give rise to power savings [4], There are different classes of compiler optimiza
tions. They are:
1. Class A optimizations: these optimizations [4] benefit energy due to an improve
ment in performance. For a program, the energy consumed is a product of the
average power dissipation per cycle and the number of cycles taken for comple
tion. Any reduction in the number of cycles in the completion time would
automatically translate into improvements in energy consumption. Reductions in
the number of loads and stores, procedure cloning, loop unrolling, loop trans
formations, etc. help save power.
2. Class B optimizations: these are the ones that benefit power but have no impact
on the performance. Innovations in instruction scheduling, register pipelining,
and code selection to replace high-power dissipating instructions by other ones
are a few examples [4],
3. Class C optimizations: these are bad for power dissipation and energy consump
tion. Typically, those optimizations that have a negative impact on performance
also have a negative impact on the power consumption. Thus, the highest impact
on power is obtained by proper code selection. Properly scheduling register
accesses also helps a lot in conserving energy [4], Thus, to a great extent, com
piler optimizations for locality and performance translate to optimizations for
power. However, in order to obtain substantial gains in energy, innovating
microarchitectural features and exposing them to the compiler are necessary [4],
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
70
Power Savings in Embedded Processors through
Decode Filter Cache
In embedded processors, instruction fetch and decode consumes around
40 percent of the processor power. An instruction filter cache is placed between the
CPU core and the instruction cache to service the instruction stream. Power savings
in the instruction fetch result from accesses to a small cache. On a hit in the decode
filter cache, fetching from the instruction cache and the subsequent decoding is
eliminated, which results in power savings in instruction fetch as well as instruction
decode. By this methodology [17] about 35 percent of processor power can be saved
at the expense of only 1 percent of performance degradation. Using small auxiliary
structures between the instruction cache and the CPU core can reduce instruction
fetch power. An instruction fetch cache stores multiple instruction cache lines. It
utilizes both temporal and spatial localities in the instruction streams.
A decode filter cache [17] helps provide decoded instructions to the CPU
core. A hit in DFC [17] eliminates one fetch from the I-Cache and the subsequent
decode, which results in power savings. There is one key difference between the
DFC [17] and IFC. On an IFC miss, the missing line can be filled directly into the
IFC. Subsequent accesses to that line need only to access the IFC. But in contrast,
on a DFC [17] miss, the missing line cannot be directly filled into the DFC [17]
because the decoded instructions in this line are not available, and hence the spatial
locality is not utilized. To enable instruction fetch power savings on DFC [17]
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
71
misses, we use a line buffer in parallel with the DFC [17] to utilize the spatial
locality of the instructions missing from the DFC [17].
There are several problems with the DFC [17], such as variable widths of the
decoded instructions and the performance degradation due to DFC [17] misses. In
order to make efficient use of cache space, we must classify it as cacheable and
uncacheable. Only instructions with small decode widths are cacheable. This
sectored cache design is used in DFC [17]. Lastly, we can also have an accurate
prediction mechanism to select dynamically the line buffer, the I-Cache, or the
D-Cache for the next instruction. All the above-mentioned techniques are simple
and very effective; they involve very low hardware cost and, hence, are suitable for
reconfigurable/embedded processors. Currently, research is on to extend the DFC
[17] design concept to multiple issue processors.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
72
CHAPTER 6
NETWORK PROCESSORS
Introduction
Network processors are an emerging class of programmable ICs based on
system-on-chip (SOC) technology that perform communications-specific functions
more efficiently than general-purpose processors.
First, the increasing needs to support the sophisticated network applications
led to an increasing demand for programming network components. At the same
time, continuing advances in the IC industry were making it possible to implement
several complete processor subsystems on a single chip.
Second, the establishment of the Internet as a worldwide communication
medium and its ever-increasing demand for bandwidth for the wide-range of appli
cations from entertainment to e-commerce were great motivations for the emergence
of network processors. This is because the Internet drove the need to develop
intelligent and scalable network features and functions in less time than traditional
design methods can support. In addition to this, the network processors provide
additional intelligence and processing power to analyze packet headers, look up
routing tables [34], classify packets based on their destination and source addresses,
and other control information and rules, and provide queuing and policing of packets.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
73
Another important point to be considered is the bandwidth issue. Many
leading telecommunications laboratories have already demonstrated terabits-per-
second speeds. The prediction is that the Internet bandwidth will be easily available
at very low cost by the end of year 2002. In other words, the communication band
width and the price of that bandwidth will be much less significant in the future. This
bandwidth availability will possibly mean that intensive computational tasks will
move towards the consumers on the “intelligent edge” o f the network. It is proposed
that it is at this “intelligent edge” that a significant amount of the future storage,
processing, and network management will take place. Hence, the network processors,
which are at the core of this “intelligent edge,” gain immense significance.
Market Trends and Survey
Currently the network market is being driven by three trends:
1. Rapidly increasing network traffic
2. Converging voice and data
3. Migrating legacy systems to new technologies
The convergence of voice and data will play a major role in defining tomor
row’s environment. Currently the transmission of the data over the IP networks is
free. Since voice communication will naturally follow the path of lowest cost, voice
w ill inevitably converge w ith data. Technologies such as voice over IP, voice over
ATM, and voice over frame relay are cost-effective alternatives in this changing
market. However, to ensure migration to these technologies is possible, the industries
have to offer QOS. Integrating legacy systems is also a crucial concern for
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
74
organizations as new products and capabilities become available. To preserve their
investments in existing equipment and software, organizations demand solutions that
allow them to migrate to new technologies without disrupting their current operations.
Emerging technologies will help network equipment vendors meet the challenges
presented by these market trends. Network processors will offer a building block for
a broad spectrum of technology solutions.
Despite the communications downturn, the network-processor (NPU) market
continues to grow rapidly. NPU vendors have announced more than 500 design wins,
and more than 30 companies are marketing or developing network processors. There
is no doubt that this will be a billion-dollar market in just a few years. But with so
many new vendors and products, the NPU market is difficult to get a handle on.
There are no benchmarks, few standards, and little agreement on the best technical
approach for delivering features and performance. Business plans are changing
weekly as acquisitions and partnerships reshape the landscape. The NPU market is
the hottest processor market in existence today. This market, which did not exist a
few years ago, will be a $7.2 billion industry in 2005 with the growth of NPUs con
tinuing to outpace most other semiconductor markets. Nowhere else can we find as
many buyouts, spin-offs, and true innovations in technology. While the NPU market
is not quite where it had hoped to be by now, it still has a very bright future ahead of
it. The increasing number of networking protocols, needs for more service-based
revenues, and a resurgence of the importance of flexibility are all factors that will
keep the NPU markets growing steadily. Besides the NPU makers, there are many
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
75
companies getting deeply involved in one part of network processing or another,
which is affecting almost every semiconductor maker in the market. Many com
panies are hoping that an NPU design-win will lead directly to a switch-fabrics win, a
physical-interface win, and perhaps co-processors. Ten-Gigabit Ethernet switches
will dominate both in terms of unit shipments and revenue through 2005 for mid
range and high-end network processors.
The overall future for NPUs remains extremely bright, but since their use
depends heavily on properly written software, there has been a more gradual initial
slope. Vendors, such as Agere Systems, Intel Corp., and Motorola, Inc., have
increasingly large market shares for mid-range network processors.
Market Challenges
Eliminating network bottleneck continues to be top priority for service pro
viders. Though routers cause these problems, often these are related to lack of
bandwidth, and providers turn to higher bandwidth solutions. But today, network
processor technologies are being proposed to manage bandwidth resources efficiently
and to provide advanced data services, at wire speeds that are commonly found in
routers and network application servers.
For remote access applications, performance, bandwidth on demand, security,
and authentication rank as top priorities. More sophisticated security solutions will
shape the future of the remote access network processor designs. Further, these will
have to accommodate an increasing number of physical mediums, such as ISDN, T l,
E l, OC-3 through OC-48 cables, and DSL modems.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
76
The network market will continue to be divided into low-end and high-end
volume opportunities. The low end is comprised of dynamic commodities, similar to
the PC industry. In contrast, the high end is a low-volume high-margin business char
acterized by a complex technology and a limited customer base. Network processors
will play a key role in enabling devices in the entire spectrum.
Rise of Network Processors
According to the Linley Group [46], a research firm providing extensive
coverage on the NPU market, the network processors have been designed into over
200 communications systems to date. This adoption of network processors in
mainstream communication-system architectures is important for two reasons. First,
it provides the evidence that developers are willing to look at outside technologies
even for relatively high-level functionality in their systems. If system designers are
willing to outsource layer-3 processing, they would outsource layers 4 through 7 too.
Secondly, it is believed that in choosing to use network processors, system architects
have traded time-to-market and flexibility advantages for a bigger hit to performance
than they have previously experienced when using third party silicon. This will
create a great opportunity for co-processors to pick up the slack where NPUs have
trouble keeping up with line-rate processing [46]. Systems developers who find that
the network processor they have selected cannot perform every function at line rate
are unlikely to trash their NPU-based design. More likely the developer will look to
see if the same silicon house that sold him or her the NPU has any suggestions for
improving performance. The smart vendors will be ready with co-processing chips
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
77
that can sit next to their NPU and off-load deep packet inspection, session termi
nation, security functions, or whatever else the designer is looking to accomplish, so
that the system can operate at the required speed.
Parallel Network Processors
Although parallel processing techniques [50] are employed inside a single NP
(e.g., in the processor complex), no true multi-NP approach, where a set of NPs
cooperates to give the appearance of a single higher-performance NP, has been pre
sented yet. Load balancing mechanisms could become the key to an efficient multi-
NP solution. Such a solution would dramatically reduce the memory interface
problems in the coming times.
Network Processors: Building Block for
Enhanced Network Performance
A network processor [50] is a programmable communications integrated
circuit capable of performing one or more of the following functions:
1. Packet classification
2. Packet modification
3. Queue/policy management
4. Packet forwarding
In addition to all these, network processors can provide speed improvement
through architectures, such as parallel distributed processing and pipeline processing
designs. These capabilities can enable efficient search engines, increase throughput,
and provide rapid execution of complex tasks. Network processors are expected to
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
78
become the fundamental building block of processors similar to CPUs for PCs. Typi
cal capabilities offered by the network processor are real-time processing, storing,
security, forward, switch fabric, and IP packet handling and learning capabilities.
Network processor’s target ISO layers 2 through 5 are targeted to perform network-
specific tasks [50],
The processor model network processor incorporates multiple general-purpose
processors and specialized logic. Suppliers are turning to this design to provide
scalable, flexible solutions that can accommodate change in a timely, cost-effective
fashion. The programmability that these network processors offer helps in easier
migration to new protocols and technologies, without requiring new ASIC designs.
With processor model network processors, network equipment vendors benefit from
reduced, nonrefundable engineering costs and improved time-to-market.
Functions
Network processors fit into the role once held by general CPUs or custom
processors in networking and peripherals [50], They handle the tasks of processing
packets, data streams, or network objects to accomplish the specific task for which
they are programmed.
The functions that the NPU [50] performs are categorized into:
1. Physical-layer functions: these functions handle the actual signaling over the
network media connections.
2. Switching and fabric-control functions: the switching and fabric-control func
tions are responsible for directing traffic inside the device.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
79
3. Packet-processing functions: the packet-processing functions handle the process
ing of all network protocols.
4. System-control functions: system-control or host-processing functions exist for
nearly allNPUs [50] to handle the management of all the other components of the
hardware unit.
Network Processor Designs
Network processor designs can be divided into three main architecture types
[50]:
1. A general RISC-based architecture
2. An augmented RISC-based architecture
3. Network-specific processors
RISC-Based Networking Processor
Several people have tried to design a networking processor by integrating
multiple “off the shelf’ RISC processors [50] a single chip. By definition, the RISC
architecture reduces chip complexity as compared to the CISC architecture. In a
RISC, the microcode overhead is eliminated. The drawbacks of this variety of
network processor include use of numerous commands, the time it takes to perform
complex tasks, and their inability to modify the data path. Hence this networking
processor will not be able to deliver the processing performance expected.
Though RISC-based networking processors [50] are often deployed in parallel
to produce high speeds, this architecture is still constrained by the RISC throughput.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
80
Also there is a limit in the number of RISCs that can be incorporated without overly
increasing the system complexity and the size of the chip [50],
Augmented RISC-Based Processors
Here, the available RISC functions are tailored to suit network applications,
and hardware boosters or accelerators are added to speed processing [50], These
hardware accelerators can copy frames at wire speeds to boost performance, but these
accelerators themselves are neither flexible nor programmable [50], The limitations
of the previous RISC-based networking processor [50] exist here too. Combined
ASIC and RISC is employed to overcome some limitations. Here RISC acts as the
core, and certain tasks are handed over to the ASICs.
These hard-wired ASICs provide speed but are severely restricted by their
inherent inflexibility. ASICs are bounded by their silicon functions and cannot
support new features [50],
Network-Specific Accelerators
These are a new wave of network processors developed to improve the
performance required for the next-generation networking applications [50], These
integrate many small, fast processor cores that are each tailored to perform a specific
networking task [50], By optimizing the individual processors for the various
network-intensive tasks, the various limitations posed by the previous RISC imple
mentations can be overcome [50],
Figure 15 illustrates simple network processor architecture.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
81
Code NP Core
Port
I/O
RAM
Buffer
Figure 15. Simple network processor architecture
Future of Network Processors
The ever-increasing demands from the networking applications will keep the
network-processor research going and enhance it. The future looks very bright for the
network processor as it provides a simple, cost-effective solution. They will act as
the “intelligent edge” in times to come. Network processors’ utility will spread far
more than the networking applications, and they will be used extensively in other
areas too. They will help a great deal in imparting some sort of “intelligence” into
these applications.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
82
CHAPTER 7
IP ROUTING FOR THE NEXT-GENERATION
NETWORK PROCESSORS
Modem networks require the flexibility to support new protocols and net
work services without changes in the underlying hardware. Routers with general-
purpose processors can perform data-path packet processing using software that is
dynamically distributed. However, custom processing of packets at link speeds
requires high computational power. Multiple network processors with cache and
memory on a single application-specific integrated circuit are used to overcome
the limitations of the traditional single-processor systems. Different vendors are
proposing a number of approaches for the next-generation Internet router archi
tectures [31], Most of this processing speed stems from employing the latest high-
performance network processor as the forwarding engine of the router [31]. This
enhances the performance of the Internet routers.
Since the last decade, the Internet has grown substantially in terms of the
continuously increasing traffic amount and number of the hosts added to the
network. One of the major functions of IP routers [32] is packet forwarding, which
is basically doing a routing table lookup based on the IP destination field in the IP
packet header [32] and identifying the next hop to which the incoming packet should
be sent. Primarily, three approaches have been used for IP route lookup: hardware,
software, and a combination of the two [32],
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
83
As the Internet is developing at a rapid pace, companies invest more and
more money into the research and development of fast IP routers for sustaining many
gigabits per second of aggregate traffic [26], The problems associated with design
ing routers are:
1. The increased number of ports and peak bandwidth of the incoming links
requires millions of packets to be processed by the routing processor.
2. The new router must support features like multicast, QOS [26], voice security,
etc., which increases the complexity of IP forwarding and hence the time needed
to process packets. Software processing of packets has become the bottleneck in
a router design.
To meet these requirements, the routers should use more hardware resources and
better algorithms. Although a lot of work has been done in IP routers, most of it has
considered different components of a router in isolation. Work in this area has
focused primarily on specialized hardware, switching, caches, and more efficient
lookup algorithms [26], Caching improves overall performance.
Overview of IP Routing Architectures
Generic router architecture is comprised of:
1. Line cards, which consist of the input/output ports
2. A routing processor, or forwarding engine (a network processor)
3. Switching fabric
Figure 16 presents an overview of IP routing architecture.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
84
I/O Ports Line Cards
Routing Processor
Operating System
Management Configuration and Software
Figure 16. IP routing architecture overview [26]
Line Cards
Line cards [26] perform the inbound and outbound packet processing. Each
packet is assigned an identifier and is broken up into several subpackets when it
arrives at an inbound line card. Its header is sent to the forwarding engine (network
processor) to get routing information [26]. The updated header replaces the old one,
and its routing information is used to queue the entire packet for transfer to the
appropriate destination. When an outbound line card receives subpackets of a packet
from a switch, it assembles those subpackets into an entire packet and transmits over
the output link [26].
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
85
Routing or Forwarding Engine
(Network Processor')
The main function of the router is to perform route lookup [26]; i.e., given a
packet with an IP destination address, the router must determine an appropriate out
put port for this packet. The process of table lookup involves a longest prefix match
of the variable destination network address contained in the packet header contained
in the multiple entries in the routing table. The one selected contains the most bits
that match the destination address. A routing table lookup [26] can be done in
hardware by building a specialized ASIC, but software solutions are more popular
due to their flexibility. A network processor is used as a routing engine [26],
because it makes it possible to describe complex packet-forwarding operations by
using a high-level programming language. Thus, a router can acquire new functions
simply by downloading new program codes [26], This will help reduce developing
costs and time.
Switching Fabrics
Switching fabrics [26] interconnect the input ports, the output ports, and the
network processor. Most commonly used switching fabrics [26] are buses,
crossbars, and shared memory.
Requirements of Network Processor Architecture
In order to enable efficient packet forwarding operations [26], the network
processor should satisfy the following requirements:
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
86
1. Flexibility in multi-processing: the network processor should be able to combine
pipeline processing and parallel processing so that various packet-forwarding
operations can be implemented effectively.
2. Communication capacity, latency, and scalability: for efficient pipeline process
ing and distributed state maintenance, the capacity for communication between
processors and memories should be large, and the communication latency should
be low, even when the number of processors and memory increases.
3. Real-time operation and guaranteed performance: the processor should provide
strict real-time operation to guarantee wire-speed packet forwarding and exact
QOS [26] control.
As shown in figure 17, the network processor architecture for IP routing integrates,
on a chip, multiple memory interfaces, on-chip memories, co-processors, chip
interconnects, a data interface, and a number of processing elements consisting of a
register file shared by multiple processor cores. The memory interfaces and the
on-chip memories are used for the data and packet storage [27], The co-processors
are used for low-level operations. The chip interconnects [27] are used for
coordination between multiple network processors and the host processor. The
above network-processor architecture [27] supports very high-speed line interfaces
and advanced QOS [26] control mechanisms as well as flexible header handling
operations [27] as a software routine for backbone routers and switches [27],
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Reproduced w ith permission o f th e copyright owner. Further reproduction prohibited without permission.
Other
Network
Processors
O flf-Chip
Memory
. v _______
Chip
Interconnect
Memory
Interface
Bridge
Next Stage
Processing Engine Cell/Packet
Interface
On-Chip Memory
Interconnect
i z _ _ _ _ _ _ _ _ .
Bridge Local
Processing Engine
Figure 17. Network processor architecture for IP routing [27]
oo
-J
88
Optimizations for Network Processor Cache
A central issue in the design of network processor architecture is to perform
packet classification at wire speed [33], The network processor cache maintains
results of previous packet lookups or classification computations for subsequent
reuse. Routing-table lookup is a special case of packet classification based solely on
a packet’s destination address [33], Some work has been done on designing compact
routing-table lookups [33] and implementing them in hardware. Alternatively, we
can also use caching to reduce the number of times the lookup algorithms [33] are
invoked. But a major concern with using caching [33] to optimize the lookup tech
nique is the lack of sufficient locality in the packet-address streams when compared
to instruction/data address streams in program execution. IP-host address caching
helps improve the cache addressing performance. This is due to the fact that there is
only a limited number of outcomes to a lookup table irrespective of the IP address.
One can improve the effective coverage of the network processor cache only by
reducing the capacity/conflict miss and taking care of cache inconsistencies due to
frequent updating of routing tables [33], In order to reduce the number of conflict
misses, we can either aim to reduce the deviation in the number of cacheable entries
mapped to each cache set or allow a different number of cache sets to be associated
with different IP address partitions. These two techniques [33] help reduce the
cache-miss ratio to a large extent, which leads to cache optimization in network
processors.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
89
Network Processors: Challenges/Achievements
To summarize, the challenges facing the future of network processors are:
1. Providing wire-speed performance
2. Integrating with existing software
3. Validating forwarding engine reliability
And the achievements realized to date are:
1. Wire-speed performance up to 10 Gbps
2. Carrier-grade reliability with software/hardware
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
90
CHAPTER 8
COMMERCIAL EMBEDDED PROCESSORS:
INDUSTRY PERSPECTIVE
Intel Corp.’s Strong ARM* SA-110 Processor
Intel Corp.’s ARM® architecture-compliant Strong ARM* SA-110 processor
[35] is a 32-bit microprocessor featuring superior power efficiency, low cost, and
high performance. A member of the Intel Strong ARM* processor family, it is avail
able at five speeds, ranging from 100 MHz to 233 MHz, and at power dissipation
levels of less than 300 mW to under 1,000 mW. In addition to providing impressive
MIPS-per-dollar and MIPS-per-milliwatt advantages, the SA-110 processor offers
compatibility with existing ARM® [35] development tools and operating systems.
The SA-110 processor [35] is ideally suited for a wide range o f embedded applica
tions, including high-bandwidth network switching, intelligent office machines,
storage systems, and remote access devices. It is also a cost-effective solution for
Internet appliances and smart handheld products, such as handheld personal com
puters (HPCs) and mobile phones. The SA-110 processor’s [35] comprehensive
development environment, together with Intel’s full-featured 21285 core logic chip
for the SA-110 processor [35], helps manufacturers to bring products to market more
quickly and cost-efficiently.
SA-110 processors [35] deliver the high performance that embedded
applications need to perform routing calculations, move data in Internet-working
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
91
applications, improve I/O throughput, and provide PCI connectivity in intelligent I/O
applications. The SA-110 processor [35] can also help fulfill the increasing perform
ance requirements of smart handheld products while satisfying the stringent power
constraints of battery-operated devices [35],
International Business Machines Corp.’s
PowerPC NP NPe 405™ Processor
The IBM Power NP NPe 405™ is a family of PowerPC 405™-powered
embedded processors [3], [36] were created for network control in wired communi
cations applications, such as network routers and switches, servers, and cellular base
stations. These chips are functionally configurable and application-code compatible
with all PowerPC 405™ family chips.
Transmeta Corp.’s Crusoe™ Reconfieurable Processor
Transmeta Corp.’s premier product is the Crusoe™ processor [45], It is
specially designed for the handheld and lightweight mobile computing market. The
high-performance Crusoe™ processor [45] consumes 60 to 70 percent less power
and runs much cooler than competing chips by transferring the most complex part of
a processor’s job— determining what instructions to execute and when—to software
in a process called code morphing [45], Because it enables a battery charge to last
twice as long, this technology allows all-day computing. The Crusoe™ processor is
ideal for Internet devices and the ultra-light mobile PC category due to the following
features:
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
92
1. The Crusoe™ has remarkably low power consumption, allowing the processor to
run cooler than conventional chips. Battery life is extended up to a whole day.
2. Its performance is high, optimized for real-life usage patterns. Crusoe™
delivers, whether the user is browsing the web, watching a DVD, or recalculating
a spreadsheet.
3. The Crusoe™ has legacy compatibility, so the user is free to run the applications
and Internet plug-ins of choice.
The Crusoe™ smart processor is a flexible and efficient hardware-software
hybrid that replaces millions of power-hungry transistors with software.
Ultra-light mobile PCs and Internet devices made with Crusoe™ processors
will be among the lightest, fastest, and coolest on the market.
The Architecture
Transmeta Corp.’s Crusoe™ processor enables a new class of low-power
computing and legacy software compatibility by combining the energy efficiency of
a VLIW engine with the versatility of the innovative Code Morphing™ software.
Crusoe™ hardware architecture requires far fewer transistors than conven
tional legacy microprocessors, thereby minimizing power consumption. Modem day
processors, including Crusoe™, execute several instructions at once to improve per
formance. However, a large fraction of the transistor count in traditional processors
is devoted to rearranging instructions for optimal parallel execution. The Crusoe™
very long instruction word (VLIW) design avoids this power and transistor penalty
by implementing such complexities in the Code Morphing™ software. These
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
93
missing transistors then allow Crusoe™ to operate at lower power, in addition to
making it easier to design and cheaper to manufacture [45],
Code Morphing Software
Transmeta Corp.’s innovative Code Morphing™ software layer provides the
Crusoe™ processor with legacy compatibility while empowering the complex
microprocessor with the flexibilities typically enjoyed only by traditional software
products. The Code Morphing™ software is designed to dynamically translate
legacy instructions into VLIW instructions for the underlying Crusoe™ hardware
engine. The Code Morphing™ software resides in flash ROM and is the first
application to launch when the Crusoe™ processor is powered up. Upon completion
of its initialization, other system software components, such as the BIOS and
operating system, are loaded in traditional fashion.
Advantages of the Crusoe™ Processor
Lighter. Transmeta Corp. has designed the Crusoe™ processor predomi
nately in software. With the majority of the microprocessor logic implemented in
the Code Morphing™ software, the Crusoe™ processor is designed with fewer logic
transistors than conventional processors and, therefore, is smaller and lighter. In
addition to the Crusoe™ processor’s inherently simple design, Transmeta also
possesses an adaptive power management technology called LongRun that effici
ently manages the CPU thermal environment and, in some cases, replaces the need
for a traditional cooling mechanism (such as a fan).
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
94
This combination of LongRun thermal management technology and the
Crusoe™ processor’s silicon design simplicity enables the creation of systems that
are both lighter and thinner than those possible with conventional microprocessors,
thereby creating a new class of mobile devices.
Longer. The fundamental issue with mobile devices is battery life—the
longer, the better. Switching a single logic transistor on or off requires a touch of
energy. With conventional processors increasing in transistor count, more and more
electricity is required to operate these increasingly complex designs, and this has a
negative effect on a system’s battery life. Designed as a software-based micro
processor, the Crusoe™ contains fewer power hungry transistors than conventional
processors and thus requires less electricity to run. This directly equates to longer
battery life.
In addition to their inherently simple design, Crusoe™ processors utilize
Transmeta Corp.’s LongRun power management technology. LongRun is an
adaptive technology that determines the requirements on the processor and delivers
just enough performance to satisfy the workload at hand, without wasting more
energy than is necessary. This intelligence is implemented into the Crusoe™
processor’s Code Morphing™ software [45] layer and improves battery-life by:
1. Delivering high performance when needed
2. Conserving power when demand on the processor is low
With an adaptive power-management technology and a processor where the
bulk of the microprocessor functionalities are implemented in software, Transmeta
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
95
Corp. is taking a different approach to microprocessor design and building energy-
efficient processors that enable a new class of longer lasting mobile computing
platforms [45],
Cooler. Conventional microprocessors consist of 40 million transistors
squeezed onto a sliver of silicon only a few hundred square millimeters large. And
as these chips get denser and faster, they get hot—hot enough to boil water. Indeed,
many industry experts are claiming that managing heat in mobile designs is now one
of the industry’s top challenges. Mobile computing platforms are limited in height,
weight, and space. Therefore, finding a way to cool the device within such a con
fined form factor is a huge challenge. Traditional methods to cool a mobile platform
involve either a heat sink or a fan (or both). Designs accommodating these addi
tional cooling mechanisms [45] typically increase the thickness of the system and
result in a bulkier design.
Transmeta Corp. [45] circumvents this problem by designing a software-
based microprocessor and embedding it with innovative power-saving intelligence—
the LongRun power management technology. The Crusoe™ processor is smaller
than conventional processors and dissipates less heat, thereby enabling the creation
of cooler mobile devices.
Upgradeable. Early on in the design of the first Crusoe™ processor [45] it
was noted that a key advantage to the software-based architecture was the ability to
perform upgrades to the processor. Upgrades can be made not only during the
design and verification phase but also to customers and even to end-users.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
96
Software upgrades allow Transmeta Corp. [45] to design and bring to market
processors in roughly half the time it takes for the standard hardware processors.
Instead of having to tape out a new chip to fix bugs in silicon, which can take weeks
for the new chip to come back. The software can be fixed, re-compiled, and loaded
into a system in the same day. Often late in the development cycle of a new mobile
computer, an OEM will run across a problem that may not be related to the processor
but to an I/O device. To proceed through qualification and meet schedules, a soft
ware upgrade may be the only answer to getting a product out on time.
The ultimate goal of software upgrades is to provide new features and
performance enhancements to end-users. Transmeta Corp. is early in the develop
ment of the technology and the interaction of its customers. Therefore, most
enhancements have been made in an incremental fashion to the newly arriving
second wave of Crusoe™ ultra-light mobile PCs. More significant upgrades are due
to arrive as the architecture expands with the next generation of Crusoe™
processors.
Performance. The Crusoe™ processor helps enhance multimedia applica
tions like DVD or Internet content playback. In fact, in the category of mobile
Internet computers, Crusoe™ processors deliver the highest performance available.
But when you are a mobile user, peak performance is only part o f the story.
Traditionally, desktop processors and systems have been optimized for peak
performance, which is fine as long as unlimited power and cooling are available. In
mobile systems, the situation is quite different. In a mobile setting, the issue of
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
97
battery lifetime enters the picture; running the processor any faster than is necessary
to get the job done is a waste of energy and reduces battery lifetime.
Unlike conventional processors, Crusoe™ processors are designed for mobile
applications. Advanced power-management techniques maximize battery life by
dynamically matching performance levels to application demand. In this way,
Crusoe™ processors give high speed and unprecedented battery life.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
98
CHAPTER 9
CONCLUSIONS AND FUTURE OF EMBEDDED/
RECONFIGURABLE PROCESSORS
As a part of this survey, a detailed introduction on the embedded computing
concept, the design methodologies, a market survey, target applications, perform
ance, and the design flow were given. The SOC architectures and their importance
for embedded applications were overviewed. The embedded processors were
classified based on performance metrics, and some of the popular architectures were
explored in detail. The ASIC revolution and the applicability of configurable plat
forms to embedded systems were intuitively explained. The various architectures
and applications of embedded processors and their applicability to the future of
technology were explored. A good insight on the embedded processor concept and
design phases was discussed. The next section presented an extensive and exhaus
tive analysis of energy-efficient reconfigurable processors. Network processors,
which give a new dimension to embedded computing and their utility in the
IP-routing of the next-generation processors, were explored in the subsequent
sections. The last section talked briefly about the several commercial embedded and
reconfigurable processors being designed and used. The explosive increase in the
use of handheld/portable devices and wireless interfaces will help in integrating the
basic computing world with the communication device. This integrated device will
be battery-operated and small enough to be carried around. This kind of integrated
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
device will incorporate multimedia facilities, a good user interface, and many other
additional resources as per user requirements. The technology challenges of estab
lishing this paradigm of integration devices are nontrivial. In particular, these
devices have limited battery resources, must handle diverse data types, and must
operate in environments that are insecure and unplanned and show different charac
teristics over time.
Traditionally, the embedded processors have demanding applications—those
driven by portability, performance, or cost—and require the development of one or
more custom processors or application-specific integrated circuits (ASICs) to meet
the design objectives. However, the development of ASICs is expensive in time,
manpower, and money. In a world now running on “Internet time,” where product
life cycles are down to months and personalization trends are fragmenting markets,
this inertia is no longer tolerable. Existing design methodologies and integrated
circuit technologies are finding it increasingly difficult to keep pace with today's
requirements. An ASIC-based solution would require multiple design teams running
simultaneously just to keep up with evolving standards and techniques. The answer
to the problem is embedded processors implemented using reconfigurable and SOC
architectures using the various design strategies explored. This could be the hope for
the future of the ever-demanding technology market.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
100
BIBLIOGRAPHY
[39], Arthur Abnous, “Low-Power Domain-Specific Processors for Digital Signal
Processing,” Ph.D. thesis, 2001. Available from: http://bwrc.eecs.
berkeley.edu/Publications/2001/Theses/L-pwr_domain-spec_process-
sors/Abnous/thesis.pdf.
[37]. Analog Devices, Inc., “Blackfin Architecture.” Available from: http://
www.analog.com/ industry/dsp/blackfin.
[1], Eric Auzas, “Design Considerations for the Embedded PC,” paper presented
at the Embedded Systems Conference West, San Jose, California,
Nov. 1995. Available from http://www.intel.com/design/intarch/
papers/esc_empc.pdf.
[19]. Manish Bhardwaj, Rex Min, and Anantha P. Chandrakasan, “Quantifying
and Enhancing Power-Awareness of VLSI Systems,” IEEE Trans
actions on Very Large Scale Integration (VLSI) Systems, vol. 9, no. 6,
pp. 757-72, 2001. Available from: www.ieee.org.
[26], L. Bhuyan and H. Wang, “Execution-Driven Simulation of IP-Router
Architectures,” IEE E International Symposium on Network Com
puting and Applications, p. 2/11, fig. 1; pp. 145-55, 2001. Available
from: ww w .ieee.org.
[11]. A. Bogliolo, I. Colonescu, R. Corgnati, E. Macii, and M. Poncino, “An RTL
Power Estimation Tool with On-Line Model Building Capabilities,”
Universita di Ferrara, DI, Ferrara, Italy, Politecnico di Torino, Dauin,
Torino, Italy. Available from http://patmos2001.eivd.ch/program.
[29], Massimo Bombana, Nikolav Fominykh, Giulio Gorla, Alexander Kriajev,
Boris Krivosheyin, and Jury Rytchagov, “IP-Based Design of Custom
Field Programmable Network Processors,” IEE E International Con
ference on Circuits and Systems, vol. 1, pp. 467-71, 1998. Available
from: www.ieee.org
[47], Kiran Kumar Bondalapatti, “Modeling and Mapping for Dynamically
Reconfigurable Hybrid Architectures,” Ph.D. thesis, pp. 78-194,
fig. 5.1, Aug. 2001. Available from: http://halcyon.usc.edu/~kiran/
work/thesis.pdf.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
101
[4], L. N. Chakrapani, P. Korkmaz, V. J. Mooney, III, and K. W. F. Wong, “The
Emerging Power Crisis in Embedded Processors: What can a (Poor)
Compiler do?” CASES’ Ol, Nov. 16-17, 2001. Available from http://
www.ece.gatech.edu/research/codesign/publications/crest/paper/chakr
apani_cases2001 .pdf.
[32], Tzi-cker Chiueh and Prashanth Pradhan, “High-Performance IP-routing
Table Lookup Using CPU Caching,” paper presented to the State
University of New York at Stony Brook, Computer Science Dept.
Available from: http://citeseer.nj.nec.com/cache/papers/cs/.
[2]. Seonil Choi, Ju-wook Jang, Sumit Mohanty, and Viktor K. Prasanna,
"Domain-Specific Modeling for Rapid System-Level Energy Esti
mation of Reconfigurable Architectures,” paper presented at the
Engineering of Reconfigurable Systems and Algorithms International
MultiConference in Computer Science, June 24-27, 2002. Available
from: http://milan.usc.edu/.
[41], Cradle Technology, Inc., “UMS Microprocessor forum.” Available from:
http ://www. cradle, com.
[48], A 1 Crouch and JefFFreeman, “Designing and Verifying Embedded Micro
processors,” IEEE Design & Test o f Computers, vol. 14, no. 4, pp. 87-
94, Oct.-Dec. 1997. Available from: www.ieee.org.
[50], EZChip Technologies, “Network Processor Designs for Next Generation
Networking Equipment,” White Paper, pp. 1-4, Dec. 1999. Available
from: http://www.ezchip.com/images/pdfs/ezchip_white_paper.pdf.
[18], Varghese George, Hui Zhang, and Jan Rabaey, “The Design of a Low
Energy FPGA,” Low Power Electronics and Design International
Symposium, pp. 188-93, 1999. Available from: Avww.ieee.org.
[33]. Karthik Gopalan and Tzi-cker Chiueh, “Optimizations for Network Proces
sor Cache,” paper presented to the State University of New York at
Stony Brook, Computer Science Dept. Available from: http://
citeseer. nj. nec. com/cache/papers/cs.
[9], Paul J. M. Havinga, Lodewijk T. Smit, Gerard J.M. Smit, Martinus Bos, and
Paul M. Heysters, “Energy Management for Dynamically Reconfigur
able Heterogeneous Mobile Systems,” Parallel and D istributed
Processing Symposium 15th International Proceedings, 840-52, 2001.
Available from www.ieee.org.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
102
[10]. Paul M. Heysters, Jaap Smit, Gerard J. M. Smit, and Paul J. M. Havinga,
“Exploring Energy-Efficient Reconfigurable Architectures for DSP
Algorithms,” research supported by the Program for Research on
Embedded Systems & Software (PROGRESS) of the Dutch Orga
nization for Scientific Research NWO, the Dutch Ministry of Eco
nomic Affairs and the Technology Foundation STW, p. 3/10, fig. 1.
Available from http://wwwhome.cs.utwente.nl/~havinga/papers/
ExplPROGRES S2000. pdfr.
[44], Improv Systems, Inc. Available from: http://www.improvsys.com.
[35], Intel Corp. Available from: http://developer.intel.com/design/intarch/
papers/index.html.
[36], International Business Machines Corp. Available from: http.V/www-
3.ibm.com/chips.
[3], International Business Machines Corp., “PowerPC NP NPe 405™
Embedded Processors,” PowerPC data sheet. Available from http://
www-3.ibm.com/chips/techlib/techlib.nsfrproductfamilies/PowerPC_
M icroprocessorsandEmbeddedProcessors.
[34], H. Michael Ji and Ranga Srinivasan, “Fast IP Routing Lookup with
Configurable Processor and Compressed Routing Table,” GLOBE-
COM ‘ 01. IEEE, vol. 4, pp. 2373-77, 2001. Available from: www.
ieee.org.
[13]. Gerd Jochens, Lars Kruse, and Wolfgang Nebel, “Application of Toggle-
Based Power Estimation to Module Characterization,” OFFIS, Divi
sion 1, Embedded Systems, Oldenburg, Germany. Available from:
http://www.dice.ucl.ac.be/~anmarie/patmos/papers/.
[20], Meenakshi Kaul and Ranga Vemuri, “Optimal Temporal Partitioning and
Synthesis for Reconfigurable Architectures,” Design, Automation and
Test in Europe, pp. 389-96, 1998. Available from: www.ieee.org.
[46]. Linley Group, “Network Processors.” Available from: http://www.
linleygroup.com/npu/index.html.
[14], David Maze, Edwin Olson, and Andrew Menard, “Reconfigurable Issue
Logic for Microprocessor Power/Performance Throttling,” Advanced
VLSI Computer Architecture, Spring 2000. Available from: http://
www.cag.lcs.mit.edu/6.893-f2000/project/maze_checklslides.ppt.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
103
[30], Onat Menzilcioglu and Steven Schlick, “NECTAR CAB: A High-Speed
Network Processor,” Proceedings o f the International Conference on
D istributed Computing Systems, pp. 508-15, May 1991. Available
from: http://citeseer.nj.nec.com/context/.
[21]. W. Shields Neely, “Reconfigurable Logic for Systems on a Chip,”
Design, Automation and Test in Europe, p. 340, 1998. Available
from: www.ieee.org.
[7], Jan M. Rabaey, “Low Power Reconfigurable Multimedia Processors,” final
report for MICRO Project #97-142, 1997-98. Available from http://
bufify.eecs.berkeley.edu/IRO/Summary/98abstracts/hui. 1 .html.
[15], Jan Rabaey and Marlene Wan, “An Energy-Conscious Exploration Metho
dology for Reconfigurable DSPs,” Design, Automation and Test in
Europe Conference and Exhibition, pp. 341-42, 1998. Available
from: www.ieee.org.
[31], Vinay S. Rathore, “Network Processors: The Next Generation,” Networld
Interop, Sept. 10,2001. Available from: http://www.interop.com/
atlanta2001/online_pre/downloads/nps/np_v_rathore.pdf.
[22], Tony Rybczynski, “New World routing-Open IP,” Nortel Networks.
Available from: http://www.nortelnetworks.com/solutions/financial/
collateral/csm_openip_vl .pdf.
[23], Sartaj Sahni and Kun Suk Kim, “Data Structures for IP Lookup with Bursty
Access Patterns,” paper presented to Univ. of Florida, Dept, of Com
puter and Information Science. Available from: http://www.cise.ufl.
edu/~sahni/papers/burstyc. pdf.
[24], Sartaj Sahni and Kun Suk Kim, “Efficient Construction of Variable-Stride
Multibit Tries For IP Lookup,” Symposium on Applications and the
Internet (SAINT 2002), pp. 220-27, 2002. Available from: www.ieee.
org.
[49], Manfred Schlett, “Trends in Embedded Microprocessor Design,” IEEE
Computer Proceedings, no. 8, pp. 44-49, Aug. 1998. Available from:
www.ieee.org.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
104
[27], Hideyuki Shimonishi and Tutomu Murase, “A Network Processor Archi
tecture for Flexible QoS Control in Very High-Speed Line Inter
faces,” IEEE Workshop on High Performance Switching and Routing,
p. 2/5, fig. 1; pp. 402-06, 2001. Available from: www.ieee.org.
[17]. Weiyu Tang, Rajesh Gupta, and Alexandru Nicolau, “Power-Savings in
Embedded Processors through Decode-Filter Cache,” Design, Auto
mation and Test in Europe Conference and Exhibition, pp. 443-48,
2002. Available from: www.ieee.org.
[8], Weiyu Tang, Alexander V. Veidenbaum, and Rajesh Gupta, “Architectural
Adaptation for Power and Performance,” Proceedings o f the 4th
International Conference on ASIC, pp. 530-34, 2001. Available from
www.ieee.org.
[3 8], Texas Instruments, “Open Multimedia Applications Platform (OMAP).”
Available from: http://www.ti.com/sc/docs/apps/omap.
[45], Transmeta Corp., “The Transmeta Crusoe™ Processor.” Available from:
http ://www.transmeta. com/.
[43], Nick Tredennick, “The Death of DSP,” tech. rpt, pp. 1/10, 4/10, Aug. 17,
2000. Available from: www.dynamicsilicon.com.
[16]. Tim Tuan, “Platform-Based Design of Low-Energy, Hybrid Reconfigurable
Fabric for Wireless Protocol Processing,” master’s thesis, 2001.
Available from: http://bwrc.eecs.berkeley.edu/Publications/2001/
Theses/pltfrm-base_des_L-e_hybrid_reconfig_fabric/Tuan/Tim_
Thesis.pdf.
[25], Tilman Wolf and Jonathan S. Turner, “Design Issues for High-Performance
Active Routers,” IEEE Journal on Selected Areas in Communica
tions, vol. 19, no. 3, Mar. 2001.
[12], Qing Wu, Qinru Qiu, Massoud Pedram, and Chih-Shun Ding, “Cycle-
Accurate Macro-Models for RT-Level Power Analysis,” IEEE Trans
actions on Very Large Scale Integration (VLSI) Systems, vol. 6, no. 4,
pp. 520-28, 1998. Available from: www.ieee.org.
[42], David C. Wyland, “The Universal Micro System Hardware Performance
with Software Convenience,” Cradle Technology, Inc. White Paper,
p. 3/30, Nov. 1999. Available from: http://www.cradle.com/
products/pdfs/white_paper_5.0 .pdf.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
105
[28], Lajos Gazsi Xiaoning, “A New Network Processor Architecture for High-
Speed Communications,” IEEE Workshop on Signal Processing
Systems, pp. 548-57, 1999. Available from: www.ieee.org.
[40]. Xilinx, Inc., “Platform FPGA.” Available from: http://www.xilinx.com/
xlnx/xweb/xil_publications/_index.jsp.
[6], Vojin j Zivojnovic Chris Schlager, and Joachim Fitzner, “System-Level
Modeling of DSP and Embedded Processors,” Thirty-Second
Asilomar Conference on Signals, Systems & Computers, vol. 2,
pp. 1730-34, 1998. Availablefromwww.ieee.org.
[5], Yervant Zorian, “System-Chip Test Strategies,” paper presented at the
Thirty-Fifth DAC Conference, June 1998. Available from http://
www.sigda.org/Archives/ProceedingArchives/Dac/Dac98/papers/.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Energy and time efficient designs for digital signal processing kernels on FPGAs
PDF
Hierarchical design space exploration for efficient application design using heterogeneous embedded system
PDF
Energy -efficient information processing and routing in wireless sensor networks: Cross -layer optimization and tradeoffs
PDF
Energy latency tradeoffs for medium access and sleep scheduling in wireless sensor networks
PDF
Improving memory hierarchy performance using data reorganization
PDF
A unified mapping framework for heterogeneous computing systems and computational grids
PDF
Compression, correlation and detection for energy efficient wireless sensor networks
PDF
Boundary estimation and tracking of spatially diffuse phenomena in sensor networks
PDF
Extending the design space for networks on chip
PDF
Error-tolerance in digital speech recording systems
PDF
An efficient design space exploration for balance between computation and memory
PDF
A multimodal screen reader for the visually impaired
PDF
Design tradeoffs in a packet-switched network on chip architecture
PDF
Initiation of apoptosis by application of high-intensity electric fields
PDF
Area comparisons of FIFO queues using SRAM and DRAM memory cores
PDF
A literature survey on a managerial perspective on the process of innovation management
PDF
Mapping parallel algorithms onto parallel architectures
PDF
Application-specific external memory interfacing for FPGA-based reconfigurable architecture
PDF
Dynamic logic synthesis for reconfigurable hardware
PDF
Characteristic acoustics of transmyocardial laser revascularization
Asset Metadata
Creator
Sethuram, Ashwin
(author)
Core Title
A technical survey of embedded processors
School
School of Engineering
Degree
Master of Science
Degree Program
Electrical Engineering
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
engineering, electronics and electrical,OAI-PMH Harvest
Language
English
Contributor
Digitized by ProQuest
(provenance)
Advisor
Prasanna, Viktor (
committee chair
), Raghavendra, Cauligi S. (
committee member
), Ung, Monte (
committee member
)
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c16-301509
Unique identifier
UC11337097
Identifier
1414853.pdf (filename),usctheses-c16-301509 (legacy record id)
Legacy Identifier
1414853.pdf
Dmrecord
301509
Document Type
Thesis
Rights
Sethuram, Ashwin
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the au...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus, Los Angeles, California 90089, USA
Tags
engineering, electronics and electrical