Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Efficient PIM (Processor-In-Memory) architectures for data -intensive applications
(USC Thesis Other)
Efficient PIM (Processor-In-Memory) architectures for data -intensive applications
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
E F F IC IE N T PIM (PRO CESSO R-IN -M EM O RY ) A R C H IT E C T U R E S FO R
DA TA-INTEN SIVE A PPLIC A TIO N S
by
Jung-Y up K ang
A D issertation Presented to the
FACULTY O F T H E G R A D U A TE SCHOOL
U N IV ER SITY O F SO U TH ERN CA LIFO RN IA
In P artial Fulfillm ent of the
R equirem ents for the Degree
D O C T O R O F PH ILO SO PH Y
(ELEC TR IC A L EN G IN EER IN G (G O M PU T E R EN G IN EER IN G ))
M ay 2004
Gopyright 2004 Jung-Y up K ang
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
UMI Number: 3140490
Copyright 2004 by
Kang,Jung-Yup
All rights reserved.
INFORMATION TO USERS
The quality of this reproduction is dependent upon the quality of the copy
submitted. Broken or indistinct print, colored or poor quality illustrations and
photographs, print bleed-through, substandard margins, and improper
alignment can adversely affect reproduction.
In the unlikely event that the author did not send a complete manuscript
and there are missing pages, these will be noted. Also, if unauthorized
copyright material had to be removed, a note will indicate the deletion.
UMI
UMI Microform 3140490
Copyright 2004 by ProQuest Information and Learning Company.
All rights reserved. This microform edition is protected against
unauthorized copying under Title 17, United States Code.
ProQuest Information and Learning Company
300 North Zeeb Road
P.O. Box 1346
Ann Arbor, Ml 48106-1346
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
D edication
To my parents...
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
A cknow ledgem ents
I am so much blessed and I am so grateful...
First of all, this dissertation would not have been possible without the advisement I
received from my committee members.
My deepest gratitude goes to my advisor, Professor Jean-Luc Gaudiot, who has shown
me so much insight and enthusiasm, and has given me so much encouragement and
support. He has taught me what is im portant during the course of research and life. He
also introduced me to the joy of flying which allowed me to enjoy the skies of Southern
California in this very special way. It was his support and encouragement that replenished
me with faith to write this dissertation.
I would like to express my gratitude to Professor Sandeep Gupta for listening to my
worries and concerns when I first arrived at USC. It was his assertive comments and his
advice that helped me to solidify mere ideas. I would like to thank Professor Shahabi
for his cooperation and valuable feedbacks. Also, I would like to express my respect to
Professor Monte Ung for teaching me that whatever happens in the course of life, I win.
Finally, my appreciation goes to Professor Jonckheere for his help and generosity.
I would like to acknowledge the National Science Foundation for supporting my
project. I would also like to mention Wes Hansford and Tom Vernier for supporting
me during the times I spent at MOSIS.
iii
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
I would like to salute the USC-PDPC and UCI-PASCAL group members for their
valuable feedback and comments. My gratitude also goes to the CREDO cell members
for their support.
My deepest respect goes to my parents. W ithout their love, teaching, support, and
embrace, I would have not been able to endure the long process. My father has shown
me so much bravery, strength, persistency, insights, and love th at he always is the person
I wanted to be if I grow up. My mother has given me everything she has and more. It
was her dedication and sacrifice that reminded me to carry through the long process.
I am much obliged to my brothers, Sang-Yup and Byoung-Yup for giving me support
and taking over the role of first son which I vacated for such a long time while I was
studying.
My deepest appreciation goes to my wife, Yon-Hong. She has given me so much love,
encouragement, support, and most im portantly patience. It was her support and patience
th at allowed me to concentrate on this thesis. Most of all, I am in eternal debt to her for
raising M atthew and Rachel mostly by herself. I also would like show my gratitude to
her family for helping her supporting me.
Lastly, I would like to thank the One who always listens, overlooks, guides and fulfills.
Thank you, GOD.
IV
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
C ontents
D edication ii
A cknow ledgem ents iii
List O f Tables vii
List O f Figures viii
A b stract x
1 Introd uction 1
2 Background Study 8
2.1 The Kernels of Data-Intensive A p p lic a tio n s................................................... 8
2.2 Memory Wall P ro b le m .......................................................................................... 12
2.2.1 Reasons behind Memory Latency and Bandwidth Problem .... 14
2.2.2 Architectural Techniques to Hide Memory L a te n c y .......................... 16
2.3 Processor-In-Memory T echniques....................................................................... 18
2.4 Related W o r k s ....................................................................................................... 21
2.4.1 I R A M ............................................................................................................ 22
2.4.2 Active P a g e s ............................................................................................... 24
2.4.3 F le x R A M ..................................................................................................... 25
2.4.4 D I V A ............................................................................................................ 26
2.4.5 C o m p a riso n s............................................................................................... 26
3 H ardw are/Softw are C o-D esign C om puting U sing P IM 29
3.1 Computing Paradigm .......................................................................................... 29
3.2 P ro g ram m in g .......................................................................................................... 31
3.3 Synchronization and Interface with the Host P ro cesso r................................ 32
3.4 A rchitecture.............................................................................................................. 33
3.4.1 D ata-in ten sive C om p u tin g S ystem A r c h ite c tu r e ....................................... 33
3.4.2 Computation-In Memory Module A rc h ite c tu re ................................. 34
3.4.3 Application-specific Computing Memory Element Architecture . . 35
3.5 Execution F lo w ....................................................................................................... 36
3.6 Hardware C o s t ....................................................................................................... 37
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
4 Benchm ark A pplications for D C S A rchitecture 39
4.1 Motion Estimation of MPEG Encoding .......................................................... 39
4.1.1 Motion Estimation Algorithm and Characteristics .......................... 40
4.1.1.1 Algorithm .................................................................................. 40
4.1.1.2 C haracteristics............................................................................ 42
4.1.1.3 Estim ated Computation C om plexity..................................... 43
4.1.2 Conventional E x e c u tio n ............................................................................ 44
4.1.3 Execution in DCS Computing P a rad ig m ............................................... 45
4.1.3.1 D ata P la c e m e n t....................................................................... 45
4.1.3.2 FSM Controlled E x e c u tio n .................................................... 54
4.1.3.3 Hardware C o s t............................................................................ 55
4.1.3.4 D ata D is trib u tio n ..................................................................... 55
4.2 The Kernels of B L A ST ........................................................................................... 60
4.2.1 BLAST Algorithm and Characteristics of its Kernel Operations . . 61
4.2.1.1 Algorithm .................................................................................. 62
4.2.1.2 Kernels and C haracteristics..................................................... 63
4.2.1.3 Estimated Computation C om plexity..................................... 64
4.2.2 Conventional Execution M odels............................................................... 67
4.2.3 Execution in DCS Computing P a rad ig m ............................................... 67
4.2.3.1 D ata P la c e m e n t....................................................................... 67
4.2.3.2 FSM Controlled E xecutions.................................................... 69
4.2.3.3 Hardware C o s t ............................................................................ 71
4.2.3.4 D ata D is trib u tio n ..................................................................... 72
5 D C S Perform ance E valuation M ethod ology 84
5.1 Simplescalar S im ulator........................................................................................... 84
5.2 DCS Architectural S im u la to r .............................................................................. 85
5.3 Required Tasks Prior to Actual S im ulations.................................................... 86
5.4 Simulation Steps for Comparing P e rfo rm an ce................................................ 87
6 Sim ulation R esults and A nalysis 89
6.1 Motion E s tim a tio n ................................................................................................. 89
6.2 BLAST K ernels........................................................................................................ 99
6.3 Analysis S u m m a ry ..................................................................................................... 104
6.4 Advantage of DCS Computing P aradigm .............................................................. 106
7 Sum m ary and Future O pportunities 109
Bibliography 116
VI
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
List Of Tables
2.1 Memory Bandwidth C o m p a riso n ........................................................... 10
2.2 Memory Bandwidth and Computation Requirement for Various Frame
Sizes of Motion Estimation ................................................................................. 11
2.3 Related PIM Architectures C om parisons.............................................. 27
4.1 D ata Distribution Table for Motion E s tim a tio n ................................. 57
4.2 D ata Arrangement of a Reference W indow ........................................... 77
4.3 BLAST Computation Requirement for Searching Whole Database .... 79
4.4 Hardware Requirement for ESM Executions in Each ACME for BLAST
Kernels ..................................................................................................................... 80
6.1 Performance Comparisons for BLAST Kernels Executions for Various Sys
tem C onfigurations........................................................................................ 100
V ll
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
List Of Figures
2.1 Two different styles of P I M ................................................................................. 22
3.1 The DCS A rch itectu re ........................................................................................... 34
3.2 Internal Organization of CIMM: ACME C onnections................................... 35
3.3 Inside of ACME .................................................................................................... 36
3.4 CIMM Execution E low ........................................................................................... 37
4.1 A Reference Window in Frame F{n) for a Macro Block Bi in Frame F{n + 1) 42
4.2 Pixels to be Stored in Memory Modules Corresponding to Sub-Frames
Shown for a Corner Sub-Frame, a Sub-Frame on an Edge, and a Sub-
Frame Internal to the Frame.................................................................................. 46
4.3 Dividing a Frame into Sub-Frames with Identified Sections.......................... 48
4.4 Reference Window for each Sub-Fram e............................................................. 50
4.5 Corner Portion Sharing by 4 different A C M F s................................................ 51
4.6 Side Portion Sharing by 2 different A C M F s ................................................... 52
4.7 ACMF-Memory D ata Placement for Motion E s tim a tio n ............................. 53
4.8 CF for Motion E stim atio n .................................................................................... 73
4.9 FSM for Motion E stim a tio n ................................................................................. 74
4.10 CIMM with an Efficient D ata D is tr ib u to r ....................................................... 75
4.11 Modified Memory Tree Decoder for CIMM D ata D is trib u tio n ................... 76
vni
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
4.12 Three Steps of BLAST Algorithm ..................................................................... 78
4.13 ACME-Memory D ata Placement for BLAST .............................................. 79
4.14 D ata Structures for Query Word and M a tc h ................................................... 80
4.15 FSM to Control Database Scanning ..................................................... 81
4.16 FSM for Match E x te n sio n ...................................................................................... 82
4.17 Memory Tree Decoder for CIMM D ata Distribution with Broadcasting . . 83
5.1 Program Modifications to be Executed on DCS Computing Paradigm . . 88
6.1 Performance Improvement for Motion E stim atio n .......................................... 90
6.2 Memory Access Reduction for Motion E stim atio n .......................................... 92
6.3 Performance Improvement for Motion Estimation with Address Generation 96
6.4 Memory Access Reduction for Motion Estimation with Address Generation 97
6.5 CIMM with D ata Distribution and Address Generation ............................. 98
6.6 Performance Improvement for BLAST using DCS C o m p u tin g ........................101
6.7 Memory Access Reduction for BLAST using DCS C o m p u tin g ........................102
6.8 Performance Improvement by Number of ACMEs Executing in Parallel for
BLAST K ernels............................................................................................................ 103
IX
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
A bstract
Data-intensive applications such as media and graphics applications have gained so much
importance that it changed the way processors are now designed. Indeed, the special
characteristics of data-intensive applications are not easily matched to the capabilities
of general-purpose processors. They also impose a prodigious amount of data transfers
and computations. In these data-intensive applications, there are kernel sections which
dominate the overall execution times.
The processor-memory performance gap has been increasing and it is now the primary
obstacle to any performance improvement in computer systems. This performance gap
problem is particularly critical for data-intensive applications that require a large amount
of data transfers between the processor and memory.
Therefore, this dissertation presents a hardware/software co-design computing para
digm that uses an efficient PIM (Processor-In-Memory) architecture to efhciently execute
the kernels of data-intensive applications. The computing paradigm used in this disser
tation not only reduces the memory latencies and increases the memory bandwidth but
executes the operations inside of the memory where the data are located, thereby reduc
ing the amount of memory interactions. It also executes the operations in parallel by
dividing the memory into small segments and by having each of these segments execute
the operations concurrently.
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
This dissertation also demonstrates that there are data sharing and address gen
eration overheads involved when executing operations in parallel using the computing
paradigm. Therefore, several architectural techniques are introduced to overcome such
overheads. W ith the computing paradigm used in this dissertation, the memory access-
and computation- intensive kernel sections of data-intensive applications are more effi
ciently executed. A reduction of up to 2034 x in the number of memory accesses and
a performance improvement of up to 439 x for execution of the kernel sections of these
data-intensive applications has been obtained in our simulations.
XI
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
C hapter 1
Introduction
Data-intensive applications such as multimedia workloads require real-time response, in
tensive continuous data, significant fine-grained parallelism, and very high memory band
width [12], It has also been reported that such workloads have been increasing in im
portance during the past two decades and that they were becoming the top priorities
considered in the design of modern microprocessors [12].
In such data-intensive applications, there is a small kernel section of the program that
dominates the total execution time. This section normally adheres to the “90/10 rule”
which states that a section of a program that occupies about 10% of the program absorbs
about 90% of the total execution time of the program. This is the basic motivation behind
a number of ASIC (Application Specific Integrated Circuit) architectures [50, 45, 37, 25,
44] that were developed to efficiently execute these kernel sections.
The operations in these kernel sections (kernel operations or core operations), have
several common characteristics. First, they handle small data elements such as pixels,
vertices, or frequency/amplitude values [12]. Second, they are very simple operations
such as addition, subtraction, comparison, AND operation and absolute value finding.
1
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
Third, they require a large number of data transfers in and out of memory. Lastly, they
contain a tremendous amount of inherent data parallelism.
These unique characteristics make general-purpose processors (GPP) not quite suit
able for executing the kernel sections and this is another reason behind the development
of the above mentioned ASIC hardware modules. Therefore, current microprocessors are
normally equipped with some kind of media extensions [12, 4], co-processors, or special
purpose hardware modules.
However, it turns out that these media extensions perform poorly [48, 4] due to the
overhead involved in preparing the media extension executions. They also incur large
penalties due to the memory wall problem [7, 8, 39] which is caused by the limited chip
pinout and the tremendous amount of memory accesses.
The memory wall problem also known as the “Processor-Memory Performance Gap”
problem [39], has now become the primary obstacle to increasing the computer system
performance [39, 7]. The penalties due to these problems are quite significant because
each memory access can cost more than a thousand clock cycles [42, 7, 9].
For data-intensive applications that require a prodigious amount of data transfers,
the “Processor-Memory Performance Gap” is thus an obvious serious problem because of
the extremely large amount of repeated memory-to-processor data transfers.
Therefore, the four main objectives of this dissertation are:
1. To understand the characteristics of the execution of data-intensive applications:
First, the identification of the kernels is imperative. Then, it is im portant to under
stand the behavior of the operations involved in the kernels such as computations
and memory access operations. Some issues are:
2
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
• W hat kind of operations are involved?
• W hat is the complexity of such operations?
• How often are these operations executed?
• Do these operations have parallelism?
• How much parallelism?
• On what operands are these operations performed?
• W hat are the characteristics of the operands?
• How often do memory accesses happen?
• Can any data locality be exploited?
• Can the current memory technology sufficiently support the bandwidth needs?
• Can the current processors provide computing power to execute the opera
tions?
2. To design a computing paradigm that can efficiently execute the kernels of such
data-intensive applications;
3. To design an architectural simulator for such a computing paradigm
4. To evaluate and analyze the performance of such a computing paradigm
Therefore, in this dissertation, we present a hardware/software co-design paradigm
th a t uses ap p lication -sp ecific PIM (Processor-In-M em ory) m odu les to efficien tly execu te
the kernel sections of data-intensive applications. This approach is different from the
conventional hardware/software approach in the sense that the conventional approach
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
suffers from the drawbacks of limited memory bandwidth caused by the limited number
of pins and insufficient storage space inside of the hardware modules.
The PIM modules reduce the memory access penalty caused by the large amount of
memory accesses and the limited memory bandwidth. In the computing paradigm used
in this dissertation, the PIM module is segmented and the kernel operations are executed
in each of these segmented PIMs in parallel. Therefore, the most time-consuming and
the most memory bandwidth hungry kernel operations can be executed inside of the
memory where data is located instead of transferring data each time to and from the host
processor.
The key idea behind the PIM architecture used in the computing paradigm of this
dissertation comes from the following analysis: since each kernel operation requires sim
ple computations, a small-sized CE (Computing Element) can be designed to efficiently
perform the operation. Due to the small size of such a CE, a number of CEs can be
incorporated at appropriate locations within the memory tree. A memory tree is frag
mented according to the size of one ACME (Application-Specific Computational Memory
Element). Each ACME contains its own CE and memory block. The CE in each ACME
executes lightweight operations (by CE), in parallel with the CEs of other ACMEs. Since
this obviates the need to transfer data to the GPP, the amount of data transferred between
processor and memory is greatly decreased.
This dissertation makes several contributions as summarized below:
• Identification of the kernels of data-intensive applications and the characteristics
of such kernels. The behaviors of the operations involved, the complexity of the
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
operations, the memory access patterns, the size of the operands, and the hard
ware cost are studied. It is observed that the operations are simple operations
such as addition, subtraction, absolute-value finding, comparison and accumulation
and that they are based on small data elements such as pixels, vertices, or fre
quency/amplitude values which are normally 8-bits. Our analysis shows that they
require relatively small amount of hardware. It is also observed that the data access
patterns are regular and that the memory addresses can be found with simple stride
calculation.
• Introduction of a new computing paradigm that uses an efficient PIM architecture
to accelerate the execution of such kernels. By moving the data-intensive kernels
to inside the memory, the tremendous amount of memory interactions between the
processor and the memory can be removed and therefore, the possible penalties
resulting from the memory wall problem can be eliminated. It is also observed
that by dividing the PIM modules into many small pieces and executing the kernels
inside of each small piece, the inherent parallelism of the kernel operations can
be efficiently exploited. A reduction of up to a 2034 x in the number of memory
accesses and a performance improvement of up to 439 x for execution of the kernel
sections of example data-intensive applications can also be observed.
• Presentation of a data replication mechanism to efficiently support the parallel
execution of the kernels. It is observed that when executing the kernels in parallel
among many small PIM modules, each such module may require data from other
small PIM modules. Our analysis shows that if this data sharing were to be handled
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
by the host processor, it may work against the goal of our computing paradigm by
increasing the amount of communications (up to 60%) between the processor and
memory. Therefore, we have devised an automatic and intelligent data distribution
mechanism to efficiently distribute a data to multiple small PIM modules.
• Presentation of an architectural simulator for the computing paradigm used in this
research. In order to evaluate the performance of the computing paradigm, we have
modified Simplescalar architectural simulator to simulate our computing paradigm.
The memory of Simplescalar has been modified to emulate the PIM module and the
core acts as the host processor. We have used the Simplescalar annotation feature
to include new instructions to interface with the PIM modules.
• Presentation of a methodology to implement the computing paradigm. As a fu
ture work for this dissertation work, we have defined a prototype implementation
methodology for the PIM architecture.
In Chapter 2, first, the detailed analysis of the kernels of data-intensive applications
are presented. Second, the memory wall problem is described in more detail as how
it is developed and what are the techniques to hide it and then, the PIM technique is
described. Finally, the related PIM architectures are described.
In Chapter 3, the details of the computing paradigm of this dissertation is presented.
First, the programming for the computing paradigm is described. Second, the synchro
nization and interface with the host processor to the PIM architecture are explained.
Then, the architecture description, from the system-level to the leaf-level PIM modules is
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
explained. Fourth, the execution flow of the computing paradigm is explained and lastly,
the hardware cost for the PIM architecture is described.
In Chapter 4, the benchmark applications and their kernels (Motion estimation of
MPEG encoding and BLAST kernels of BLAST) are explained in detail. The conventional
execution models are explained and then, the exact executions of the kernels with my
PIM module are described.
In Chapter 5, the evaluation methodology is explained. First, the description of the ar
chitectural simulator for the computing paradigm for this dissertation is presented. Then,
the steps of simulations that I took in order to compare the performance is described.
Finally, the simulation results and the analysis are shown.
In Chapter 6, the summary of this dissertation and the suggesting future works are
presented.
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
C hapter 2
Background Study
2.1 The K ernels of D ata-in tensive A pplications
In many data-intensive applications, such as image and graphics applications, there is a
small kernel (core) section of the program that dominates the execution of the application.
As mentioned in Chapter 1, this section normally adheres to the 90/10 rule. For instance,
signal and image processing applications often consists of small loops or kernels that
dominate the overall processing time [12]. Therefore, many ASIC architectures [50, 45,
37, 25, 44] were developed to execute those kernel sections efficiently.
As mentioned in Chapter 1, these kernel sections have several common characteristics:
• They handle small data elements (normally 8-bits) such as pixels, vertices, or fre
quency/amplitude values [12]
• They consists of simple operations such as addition, subtraction, comparison, AND
operation, and absolute value finding
• They require a large number of data transfers in and out of memory
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
• They contain a tremendous amount of inherent data parallelism
One good example is motion estimation of MPEG encoding. Motion estimation is the
most time consuming part among the operations in MPEG. It absorbs about 90% of the
total execution time of the entire MPEG encoding process [10, 11, 25]. The operations
involved in motion estimation consist in huge numbers of iterations of simple comparisons,
subtractions and additions on small 8-bit pixel data.
The operations involved in the kernels tend to impose a great burden on the processor
because not only do they require so many computations but they also require a prodigious
amount of memory transfers between the processor and the memory. For instance, in the
motion estimation case, for a frame size of 352 x 288 pixels with 15 frames/sec and 16 x 16
macro block and 16 pixel displacement, a data bandwidth of roughly 3.1 billion bytes/sec
is required. For HDTV applications, which is 1,920 x 1,080 pixels, the numbers would
increase dramatically. This is greater than what the current high-end memory products
can produce. The maximum bandwidths for the current high-end memory products are
shown in Table 2.1 [18].
Table 2.2 shows the memory bandwidth and computation requirements for motion
estimation operations for various frame sizes. Their frame rates are 15 frames/sec, except
for HDTV, which is 30 frames/sec. Their macro blocks are 8x8 with 8 pixel displacement.
Three of example frame sizes cannot be supported by the current high-end memories
(DDR with 2.1 billion bytes/sec) from the Table 2.1. This is way below what HDTV
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
Type
Bus
Width[Bytes]
Frequency[MHz]
Maximum Transfer
Rate[Mb/s]
EDO DRAM 4/8 66 264/528
SDRAM 8 133 1064
DDR SDRAM 8 266 2128
D-RDRAM(Rambus) 2 800 1600
Table 2.1: Memory Bandwidth Comparison
requires from memory which is 32 billion bytes/sec. It also shows that the current high-
end processors cannot meet the computation requirements for motion estimations of the
larger-sized frames.
This implies that processors must find a way to provide very high memory bandwidth
and must tolerate long memory latencies. Hence, many researchers have attem pted to
increase the size of the cache to reduce the latency and provide higher bandwidth. How
ever, while handling such media data, caches get rapidly polluted, making it less effective
for the other tasks in execution. And of course, caches are never enough. There have
been attem pts to utilize data prefetch and cache bypass however, these approaches have
design complexity and side effects. These will be discussed in the next section.
Also, these operations frequently misuse the computing resources when executed in a
conventional GPP since they are small in byte size compared to the size of the operands of
the regular functional units in a comparable GPP. For instance, in the motion estimation
case, the basic computation size is 8 bits whereas in current GPPs, the operand size is
normally 32 or 64 bits.
10
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
Frame size
Requirements
352 X 288
(15 f/sec)
800 X 600
(15 f/sec)
1280 X 1024
(15 f/sec)
1920 X 1088
(30 f/sec)
Total Bandwidth
Requirement
778 M
Bytes/sec
3.6 G
Bytes/sec
10.2 G
Bytes/sec
32 G
Bytes/sec
Total Computation
Requirement
389 M
inst/sec
1.8 B
inst/sec
5.1 B
inst/sec
16 B
inst/sec
Table 2.2: Memory Bandwidth and Computation Requirement for Various Frame Sizes
of Motion Estimation
Further, these workloads contain a huge amount of fine-grained data parallelism that
cannot be easily exploited from executing in a GPP. For instance, in the motion estimation
case, ideally, all the pixel comparisons for each macro block can be executed concurrently
and also all the macro blocks can be compared with each reference block in parallel.
However, a GPP has only limited parallelism.
One other noticeable characteristic of these core operations is that once a piece of data
has been used, it is seldom used again. In other words, data is normally streamed in and
out. Consecutive data elements come, get used by the computation, and pass through.
This special characteristic would not benefit from the control-flow style execution of GPP.
However, a data-flow style execution with minimum control overhead will provide efficient
execution for applications with this special feature. In such stream-based execution, the
11
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
cache will not be able to help much since the data brought into cache will not be accessed
again, thereby resulting in constant cache misses.
All of the above characteristics of the data-intensive applications demand an architec
ture which can endure long memory accesses and provide much more parallel execution.
2.2 M em ory W all Problem
It is well known [39] that microprocessor speed has been advancing at a rate of 60%
per year whereas memory speed has gone up only at a rate of 7% each year (ever since
these two were separated into two different fabrication lines). According to Patterson’s
projections [42] of the performance gap between processors and memory from 1980 to
2005, the gap continuously grows and the performance difference is expected to reach
over 1000 x by the year 2005.
This ever-increasing performance gap phenomenon has been dubbed the “Memory
Wall” (or the “Processor-Memory Performance Gap”) problem. The difference in perfor
mance improvement is mainly due to the two totally different objectives of microproces
sors and memory manufacturers: the microprocessor industry has been using all possible
approaches to improve the performance while the memory industry has focused its effort
on increasing memory density. Not surprisingly, memory speed has not kept pace with the
speed of the microprocessors. Worse yet, memory has been placed in a separate package
from the microprocessor and the off-chip communication has been causing more traffic
and delays due to the limited bandwidth and has worsened the latency. As a result, the
memory wall problem now has become the primary bottleneck [39]. In the same vein, it
12
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
has also been predicted by Burger [7] that a microprocessor may soon be able to issue
hundreds or even thousands instructions while a single memory access is performed. It
will not be easy to fill that large of a gap with program-flow independent instructions
even with help from the compiler.
There have been many attem pts to bridge the gap by applying architectural tech
niques (caching, prefetching, speculation, out-of-order execution, and multithreading).
However, it has been claimed by Burger [7] that, although all those techniques have
somewhat improved the performance, they all had adverse effects on the memory latency
and bandwidth. In fact, his analysis has shown that the time spent idling the processor
because of high memory latency and low bandwidth was almost half of the total execution
time [7] for many benchmark programs.
PIM architectures such as IRAM [39], Active Pages [37], FlexRAM [24], and DIVA [16]
on the other hand, have demonstrated that by placing memory and processor on the same
chip, the memory-related problems could be greatly alleviated: 1) The latency can be
reduced dramatically by shortening the distance and placing memory and logic close to
each other, thereby eliminating chip-to-chip communications. 2) The bandwidth can be
dramatically improved since the potential memory bandwidth, which is normally wasted
by going through the limited number of pins, can be fully exploited by the processors. 3)
More storage can be provided since DRAMs can provide 25 to 50 times more storage in
the same space compared to SRAMs [41].
Thus, PIM techniques are extremely attractive for data-intensive applications (such
as multimedia) not only for their latency and bandwidth advantages, but also because
these applications require a memory capacity that can handle a large volume of data [12].
13
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
In the following sections, first, an explanation about the memory latency and band
width problems and the reasons behind the problems are presented. Then, a list of
architectural techniques developed to hide or reduce the latency problem and the reasons
for these techniques not being efficient are presented. Then, a short history and intro
duction of PIM architectures are presented. Finally, several related PIM architectures
are presented.
2,2.1 Reasons behind M em ory Latency and B andw idth Problem
It has been reported that a new generation DRAM which reduces its memory cell area
to 40% of its prior generation, was developed every three years. Improvements mainly
come from the 70% decrease in the minimum feature size which results in a 49% decrease
in cell area. The rest comes from improvements in cell design. There are approximately,
about 3.75 X more memory cells in the newer generation DRAMs [18].
Looking into the memory cell design helps to better understand the performance gap
between memory and processor. The design goal of a memory cell is to make it as small
as possible so they can put more cells into a given die space. A memory cell requires a
large capacitance and memory fabrication processes use many polysilicon layers to form
the capacitance. Effectively providing the required size of the capacitance in the memory
cell is the most critical part of memory design [18]. Memory cell capacitance should use
as small an area as possible while still having a capacitive value large enough to handle
parasitics and noise. However, ironically, the resistance of these polysilicon layers causes
huge RC (Resistance-Capacitance) delays. Since the objective of memory process is to
provide a denser storage cell in order to provide more bit storage, the delay was not a
14
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
big concern for the memory manufacturers. At least, not until they realized that it had
become the bottleneck for the performance of current computer systems [7, 39].
On the other hand, microprocessor manufacturers used more metal lines to accomplish
faster transfers and more silicon resonrces to gain performance improvement. The silicon
space was not as im portant a concern for the microprocessor designers as the performance.
The two totally different design goals from these two parties contributed to make the
performance gap wider and wider.
Besides the memory latency problems, there is the bandwidth problem that memory
cannot sufficiently provide to microprocessors. Memory chips internally have sufficient
bandwidth to keep up with an average high-end microprocessor. However, most of this
bandwidth is wasted when the data leaves the memory chip. The number of pins, limi
tation caused by packaging the chip, is the main cause of the bandwidth problem.
Despite the advancement of the packaging technology, the progress has not keep up
with the transistor density of memory and the performance increase of microprocessors
It is expected that the bandwidth problems caused by crossing the chip boundary will
become the greater limit to the high performance [6]. It is obvious that the applications
that requires more data accesses will suffer more from this handicap.
Because of the latency problems and the bandwidth problems, in the future, off-chip
accesses will be so expensive that all system memory will reside on one or more processor
chips [7].
15
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
2.2.2 A rchitectural Techniques to Hide M em ory Latency
There are many architectural techniques to hide memory latency. However, many of the
latency-tolerance techniques may have adverse effects in terms of the absolute amount of
memory traffic because they fetch more in order to hide or reduce the latency. In this
section, let us explain those techniques and the reasons why these techniques may have
adverse effects in memory latency and bandwidth.
Caches have been the major approach for reducing memory latency. By having a
small sized memory between the processor and the main memory, the usual latency from
the processor to the main memory is greatly reduced. SRAMs, located in the processor,
are the caches. By increasing the size of the cache, more useful data are located near the
processor and it will lead to a reduction in latency. Caches are identified as the first level,
second level and so on. This hierarchy also helps the memory latency problem since, if
a memory access cannot be fulfilled by a small higher level cache, the larger lower level
caches may be able to provide the requested data before going out to the off-chip memory.
The main role of the caches is to efficiently maintain the data that the microprocessor
will use in the near future. In order to do so, many architectural techniques are developed
to allow the caches to be filled with useful data. Some of those techniques are prefetching,
speculation, and multithreading.
Prefetching is used to pre-fill the cache with data that the microprocessor is likely to
use in the very near future in program execution. However, it may increase the overall
traffic to the main memory because it may prefetch data too early and fill up the cache
and still end up releasing the prefetched data before it can be used and bringing it back
16
R eproduced with perm ission of fhe copyright owner. Further reproduction prohibited without perm ission.
again when it is actually needed. Alternatively, it could force the release of the data in
use to keep the prefetched data. Either way, it may cause cache misses and additional
memory traffic.
Prefetching combined with speculation techniques may increase memory traffic severely
particularly when the speculation is incorrect. In this case, not only the memory band
width but the speculatively executed work will be wasted.
M ultithreading is an advanced technique to hide the latency by switching to a different
thread when the currently executing thread encounters some long-latency events such as a
memory stall. However, this context switching increases the total memory traffic because
in order to supply the microprocessor with the ready-to-execute thread that were in
waiting status, it has to swap the threads to and from the memory. This multi-context
swapping leads to increase in cache misses and the total memory traffic that may cancel
out the effectiveness of the technique.
The above techniques such as prefetching, speculation, and m ultithreading draw their
advantages from the use of more on-chip-memory and caches but they all may increase
memory bandwidth problems.
Burger has analyzed the effects of the memory bandwidth and latency [8]. He has
analyzed the correlation between the use of advanced architectural techniques and the
memory bandwidth and latency. He has pointed out that the techniques have reduced
the overall execution times for every benchmark programs. However, for most benchmark
programs, the absolute amount of memory problems in the configuration that use the
advanced techniques remained as they were in the configurations without the techniques.
Also, memory latency and bandwidth have become a larger portion of the total execution
17
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
time for the configurations which use the advanced techniques (and are even becoming
the dominant factor of the overall execution time!).
The increased memory bandwidth becomes a more serious problem for the overall
performance (It became more than 50% of the overall execution times). Indeed, it has
been observed by Burger that in many cases, it would be necessary to use these aggressive
latency hiding or reducing techniques with discretion because they have the potential to
increase the memory bandwidth.
We argue that these are the physical limitations that cannot be efficiently compensated
by the higher level, indirect, and architectural solutions that have the potential to bring
more memory overhead. To efficiently hide or reduce the memory problem, there should
be a more fundamental solution to circumvent the memory problems and we believe that
PIM techniques are a viable solution.
By implementing PIM technology, memory bandwidth increases dramatically and
memory latencies are dramatically reduced. Furthermore, by reallocating memory access
intensive applications to the processing logic inside of the memory, the processor and
memory interactions can be drastically reduced.
2.3 Processor-In-M em ory Techniques
PIM Techniques have opened a new pathway to solve the “Processor and Memory Per
formance Gap.” By physically placing the processor adjacent to the memory in the same
chip, these styles of computer architecture designs actively solve the “Processor-Memory
Performance Gap” problems and deliver a phenomenal amount of memory bandwidth
18
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
and reduced memory latency compared to the conventional processor and memory rela
tionship.
The idea of putting logic and the memory into a single chip is not new. However,
practicability was not there until recently when fabrication processes advanced to the
point that desired combination of speed, memory density, and yield could be achieved
and we may have now reached an era when it becomes cost-effective [7].
PIM techniques are also gaining some potential interest from data-intensive applica
tions such as multimedia applications not only for their latency and bandwidth advantages
but because these applications require a memory capacity that can handle a large volume
of data [12].
Several research group have put their efforts into development of computer architec
tures using the PIM techniques. For instance, Patterson [39] has proposed the IRAM
architecture. Kogge [26] has proposed a PIM architecture for the HTMT (Hybrid Tech
nology MultiThreaded) machine. Chong [37] has reported his RADRAM Architecture
using Active Pages. Torrellas [24] and his group have proposed the FlexRAM architec
ture.
Industrial organizations also have been manufacturing processors with integrated
DRAMs. These are DRAM embedded processors which embed DRAMs and proces
sors into a single chip. Mitsubishi Electronics has developed “eRAM /SI” , which is a
hybrid custom LSI chip with built-in DRAM/micro-controller logic circuit. They went
into mass production in September 1997. In the year 2000, eRAM, which uses 0.12 um
HyperDRAM technology, was projected to embed 256 Mbytes of DRAM and 5 Million
gates of logic in a single chip [32].
19
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
Ever since Mitsubishi’s first success to develop a merged technology that incorporates
high-capacity memory and logic circuits on a single chip, there have been several others
followed. Hitachi has released its process technology which combines logic and DRAM
for multimedia equipments [17]. IBM also started their first combined logic and memory
on a single die in year 1999. It has used its trench-capacitor DRAM technology along
with its copper interconnects to merit its most advanced logic and memory technologies
on a single die [20]. IBM announced its Blue Logic SA-27E which was developed for
high-function and high-density applications [19]. TSMC has also been developing its
embedded high-density memory technologies. PIM technology is also the foundation to
SOC (System-On-Chip) technology. Companies such as Neomagic have been providing
SOC solutions to various multimedia applications. The MiMagic architecture of Neomagic
combines a RISC CPU, memory(DRAM), multimedia engine, and I/O functions all into
a single chip [36].
As listed, there have been many movements in academy and in industry as well to
incorporate the processor into the memory. They all realized the demand for tremendous
amount of memory in their applications and the need for a solution to the “Processor
and Memory Performance Gap” problem.
The idea behind all the PIM architectures instead of Memory-In-Processor architec
tures is that the on-chip memory capacity is greatly increased by using DRAM technology
instead of much less dense SRAM memory cells. It is cited that SRAM to DRAM density
ratio is from 25:1 to 50:1 [29]. By increasing the size of the memory, it will be easier for
the processor to handle big applications.
The advantages of having the PIM technology are:
20
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
• Higher Bandwidth
• Low Latency
• Energy Efficiency (primarily not going ofF-chip)
• Memory Size and W idth
• Board Space
Despite the current success of the PIM technologies, there are potential problems with
the techniques such as: area and speed of logic in a DRAM process, area and power impact
of increasing bandwidth to DRAM core, retention time of DRAM core when operating
at high tem perature, applications that are greater than the size of the allocated DRAM,
and the yield after combining with the logic.
However, as technology advances more those problems will be minimized, if not van
ished altogether. Therefore, in this dissertation, I concentrate on taking the best opportu
nity with the PIM technology and design a computer architecture that benefits most out
of the PIM technology and achieves phenomenal performance improvement for a group
of data-intensive applications. I believe that the PIM technology will become the most
im portant design principle of the future computer architectures.
2.4 R elated Works
Most PIM architectures can be categorized into two different configuration groups.
• Island-style PIM (Figure 2.1(a)): Places the logic immediately next to the memory.
(IRAM, DIVA)
21
R eproduced with perm ission of fhe copyrighf owner. Further reproduction prohibited without perm ission.
Memory
Network
Logic
(a) island-style PIM (b) Completely Integrated PIM
Figure 2.1: Two different styles of PIM
• Completely-integrated PIM (Figure 2.1(b)): Attaches the logic to each small mem
ory blocks of the memory tree. (Active Pages, FlexRAM)
Four different PIM architectures are discussed so as to understand how each architecture
tries to solve the memory wall problem and improve the performance.
2.4 .1 IR A M
IRAM (Intelligent Random Access Memory) is one of the first PIM architectures intro
duced to improve performance by increasing the bandwidth between the processor and
the memory. IRAM has chosen vector architecture principles to better utilize the in
creased bandwidth [40]. Memory latency improvements by factors of 5 to 10 and memory
bandwidth improvements by factors of 50 to 100 have been reported [40]. It has also been
reported that Vector-IRAM outperforms conventional SIMD media extensions and DSP
processors by factors of 4.5 to 17 [28].
2 2
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
However, because IRAM keeps the memory separate from the processor, it still re
quires the data to travel from the memory to the Vector Processing Unit in order to be
processed. The overhead involved in the conventional computing paradigm to bring the
data to the processing unit still exists in the IRAM model, although it is lithologically
minimized.
Compared to the island-style structures, the completely-integrated PIM structures
will be able to provide far more effective memory bandwidth increase because 1) by
placing the logic within each sub-memory module, the latency will be greatly reduced
since the distance that the data travels will be greatly minimized and it is faster to access
a smaller memory, and because 2) all the sub-memory modules are accessed in parallel.
This increased effective bandwidth is a critical factor for data-intensive applications such
as BLAST.
However, although much parallelism can be achieved by having several pipeline stages
running in parallel in Vector-IRAM, the degree of parallelism Vector-IRAM can provide is
lower compared to that the completely-integrated PIM structures can provide by enabling
each sub-memory module to execute in a parallel fashion with integrated pipelining ca
pabilities. Data-intensive applications will benefit much from such increased parallelism.
Since data-intensive applications can benefit more from completely-integrated PIM
structures than island-style PIM structures, it is the intention of this dissertation to
turn the attention to more completely-integrated PIM architectures. The following two
examples (Active Pages and FlexRAM) are two such architectures.
23
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
2.4.2 A ctive Pages
Active Pages, a completely-integrated PIM, developed by Chong [37] is the architecture
which most closely resembles the computing paradigm of this dissertation. The RADRAM
(Reconfigurable Architecture DRAM) is a chip module which encapsulates Active Pages
(128 of them) and each Active Page has a memory subarray of 512 Kbytes [37] and
256,000 transistors for logic. It is a memory system based upon the integration of DRAM
and reconfigurable logic (using PPG A). The logic inside the Active Pages can change
its function to one of the several functions it supports. A 1000 x speedup was reported
with several applications using the RADRAM-Active Page system, when compared to a
conventional memory system [37].
Active Pages originally had a LUT (Look-Up-Table)-level reconfiguration. This very
fine level reconfiguration costs time and area as reported for the more recent versions
of Active Pages [38] and that is why they had to develop a VLIW version of Active
Pages. Module level reconfiguration instead of bit level or LUT-level reconfiguration
would have provided better performance and more simpler reconfiguration by trading-off
flexibility. Active pages also had an inefficient communication medium. For example, in
the FPGA-version of Active Pages, the host processor must intervene in every communi
cation involved between the memory modules in the execution. This intervention doubles
the communication required between the processor and memory, thereby exacerbating
the memory wall problem.
24
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
2.4.3 FlexR A M
FlexRAM, developed by Torrellas [24], uses a completely-integrated PIM structure to
boost the performance. A computer system using FlexRAM would consist of three levels
of hierarchy. At the highest level, a 6-issue complex superscalar host processor would
manage the overall execution of the program. This host processor would be located
outside of the FlexRAM. The second and third level processors would be integrated in
the FlexRAM. At the second level, a two-way superscalar processor with floating-point
support would manage all the third level processors. At the bottom level, 64 S2-bit
fixed-point RISC processors would be located and executed in a parallel fashion. Each
third-level processor would be associated with a 1 Mbyte memory bank.
FlexRAM provides communication mechanisms throughout the system hierarchy. The
third-level processors have a ring connection which allows the processors to communicate.
The systems using a FlexRAM are scalable in that several FlexRAMs can be attached to
the host processor. An interconnection dedicated for the inter-FlexRAM communications
is also provided. It has been reported that a system with FlexRAMs gains about 25-40
X performance over an equivalent system without FlexRAMs [24].
However, FlexRAM is designed to support general-purpose applications. The S2-bit
operations inside the third level processors maybe wasteful for operations involved in
most data-intensive applications that only require operations for small operands. This is
especially true for BLAST, when most operations are performed on bytes.
One interesting point about FlexRAM is that it does not use caches inside the third-
level processors so as to better exploit the potential bandwidth of memory. Instead, it uses
25
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
row buffers of memory to temporarily store several memory rows of memory. However,
it still requires data to be loaded into registers to be processed.
2.4.4 DIVA
Kogge developed PIM-based building blocks [49, 26, 27] for multiprocessor systems. Based
on these, the DIVA architecture [16] can be used as a single processor and extended to
a scalable multiprocessor. The building blocks of DIVA are island-style PIMs. DIVA
is equipped with a PIM-to-PIM communication mechanism to reduce the communica
tion overhead in the host processor. It is equipped with this feature because its goal
is to support irregular (and data-intensive) applications that require many remote data
accesses.
One other noticeable characteristic of DIVA is that it is targeted for applications
which execute extremely heavy-duty operations such as floating-point operations, vector
operations and complete M PP operations in the PIM modules. This is rather different
from the general characteristics of data-intensive applications.
2.4.5 Comparisons
The comparison of the above example PIM architectures is shown in Table 2.3. It is
practically impossible to directly compare the performance of the above PIM architectures
since they target different applications and the performance results are measured, using
different benchm ark program s. T h us, focus is on th e purpose, characteristics, com p lexity
and target applications of their PIM architectures.
26
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
Table 2.3: Related PIM Architectures Comparisons
IRAM Active P ages FlexRAM DIVA CIMM
UC Berkeley UC Davis UlUC Notre Dame/ISI USC/UCI
PIM Type Island
Completely
Integrated
Completely
Integrated
Island
Completely
Integrated
Role of PIM CPU
Functional
Unit
Multi
p ro cesso r
(Two-levels
of pro cesso r
hierarchy)
CPU
Functional
Unit
C haracteristics
of Logic In PIM
Vector
Processing
R econfigurable
Logic
(FPGA)/VLIW
G eneral
Processing
G eneral
Processing
Application-
Specific
Complexity of
Logic In PIM
Simple to
Complex
Simple
Sim ple to
Complex
Complex Simple
Com m unication
Inside PIM
NO NO YES (Ring)
YES
(Dedicated
com m
m echanism )
YES (limited)
T arget
Application
General
P urpose
and Vector
P rocessing
Data-Intensive
G eneral
P urpose
Scientific
and
Irregular
Data-lntenslve
27
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
First of all, IRAM and DIVA are island-style PIMs whereas Active Pages and FlexRAM
are completely-integrated PIMs. As mentioned before, completely-integrated PIMs offer
a potential for much greater memory bandwidth improvements and possible performance
improvement for applications with tremendous parallelism. Only Active Pages is designed
to assist the host processor by executing specific sets of operations. All the other archi
tectures are intended to execute general-purpose applications, thereby replacing the host
processor. One other noticeable thing is that Active Pages is the only reconfigurable (and
VLIW) architecture.
FlexRAM and DIVA have dedicated communication mechanisms to communicate be
tween PIM modules. However, these two architectures are multiprocessor architectures
and since they share the execution of an application, they require some form of commu
nications. Active Pages architecture divides the work among many active pages but does
not have any inter active-page communication mechanism. Active Pages relied on the
host processor to retrieve data from one of Active Pages and delivered to one requesting
it.
28
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
Chapter 3
H ardw are/Softw are C o-D esign C om puting U sing PIM
In this chapter, I describe the details of the architecture used in this dissertation. First,
the computing paradigm used in this dissertation is explained with the goal and the ap
proach. Then, the programming for a computer system based on the computing paradigm
is described. The detailed system level and PIM module level architectures are explained
next. Then, finally, the general execution flow of the computing paradigm is explained.
3.1 C om puting Paradigm
The computing paradigm used in this dissertation (DCS (Data-intensive Computing Sys
tem) Computing Paradigm)) is different from the conventional general-purpose computing
paradigm. It is based on a hardware/software co-design computing paradigm where the
hardware executes the kernels of data-intensive applications and the software executes the
rest of the program. Therefore, its main goal is to assist the host processor by executing
the most time-consuming (computation-intensive and memory access-intensive) sections
of data-intensive applications in hardware (PIM).
The hierarchical approaches used in the computing paradigm are described as follows:
29
R eproduced with perm ission of fhe copyrighf owner. Further reproduction prohibited without perm ission.
1. Hardware/Software Co-Design:
The first approach is to extract the kernel sections (most computation-intensive
and memory access-intensive sections) from data-intensive applications and execute
them in an efficiently designed PIM module. The rest of the application is executed
by the host processor.
The execution happens where the data are located instead of bringing the data to a
host processor. By doing so, the enormous amount of data transfers can be limited
and therefore, the critical penalties due to these transfers can be reduced. At the
same time, the host processor is relieved not only from memory access-intensive but
also computation-intensive kernel operations.
2. Application-Specific Execution:
The second approach is to accelerate the execution inside the PIM modules by hav
ing kernel-specific hardware. This is possible because the kernels of data-intensive
applications consist of simple operations. This is also good because the kernel oper
ations exhibit a stream-based execution behavior. The control flow style execution
of general-purpose processors suffer from the overhead of bringing the data to the
memory, then to the cache, then to the registers and finally to the functional units
during such execution. In such cases, the cache does not contribute much to the
performance due to the lack of data locality. Therefore, the kernel operations inside
P IM m od u les arc carried ou t w ith kcrnel-specifie F in ite S ta te M achines.
30
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
3. Parallel Execution:
The third approach is to execute the kernel operations in each sub-memory module
in a parallel fashion. The kernel operations have a huge amount of inherent par
allelism and this parallelism can be exploited by executing the kernel operations
in many small sub-memory modules in parallel. Since the kernel operations can
be executed with a specific hardware and in parallel, there will be a tremendous
amount of performance improvement.
3.2 Program m ing
Programming for the DCS computing paradigm is different from the conventional pro
gramming style of general-purpose processors. In the DCS computing paradigm, an ap
plication is partitioned into two sections: the software part (other than the kernel section)
and the hardware part (the kernel section). The software part that has to be executed
by the host processor is programmed in the exactly same way as in the original program
for the application. Only the hardware part (which is executed by the PIM architecture)
is separated from the original program and modified in order to be executed in the PIM
architecture.
The programmer must specifically write instructions to put appropriate data into
CIMM (Computation-In Memory Module) and the host processor must signal CIMM to
begin the execution of the kernels once all the necessary data are available in CIMM.
The execution in CIMM is carried out under the control of a finite state machines (FSM)
controller. CIMM shares the addressing space with the main memory so that CIMM can
31
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
be a part of the normal memory for the host processor when not executing the kernel
operations.
3.3 Synchronization and Interface w ith th e H ost Processor
There are two different kinds of communications involved in the DCS computing para
digm. First, the host processor must communicate with the CIMM. Second, the ACMEs
inside a CIMM need to communicate with each other.
The host processor and the CIMM communicate through synchronization variables
(specific memory locations) located inside the memory in the CIMM. The host processor
writes into a specific memory location to acknowledge the CIMM that it can begin its
operation. Then, the CIMM sends “ready” signals to each ACME to begin its operations.
After completing the work, each ACME flags a specific memory location (or a special
register) to notify the CIMM that it has finished its share of the work. This is explained
in more detail in a later section.
The host processor treats the CIMM as a normal memory and when not executing
the kernel operations. W hen CIMM acts as additional memory, speed and functionality
should be compatible with any other off-the-shelf memories.
Communication between the ACMEs can be achieved either by a processor-mediated
approach or by a distributed-control approach. W ith a processor-mediated approach,
the host processor mediates all communications between the ACMEs. If a data from a
remote ACME is needed by an ACME, it requests the data to the host processor and
upon receiving the request, the host processor grabs the data from the remote ACME
32
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
and delivers the data to the requesting ACME. This is a simple approach. However, this
is not practical for applications that require frequent communications among ACMEs
because the penalty paid is high since each data request may require up to four inter-chip
communications.
W ith a distributed-control approach, each ACME requests the data directly to the
remote ACME. The dedicated network will provide the transfer of the data. Although
efficient, this kind of protocol is very complex and expensive compared to the former one
and requires a dedicated interconnection network protocol to communicate.
The approach used in this research is a modified version of the former because of the
cost and the complexity involved with a dedicated communication protocol inside the
memory. The approach used in this dissertation is limited (because it does not support
direct ACME to ACME communications), however, it is cost-effective for data-intensive
applications since it can be used to efficiently support simple and regular memory access
patterns.
3.4 A rchitecture
In this section, the computer architecture for the DCS computing paradigm is described
from the system level to the leaf-level PIM modules.
3.4.1 D ata-intensive Com puting System Architecture
A general DCS (Data-intensive Computing System), a computer system based on DCS
computing paradigm, consists of a state-of-the-art G PP which acts as the host processor,
a CIMM which is the PIM that executes the kernels, one or more coprocessors, MM
33
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
Co
p ro cesso r
----
I/O GPP CIMM Memory
System Bus
Figure 3.1: The DCS Architecture
(Main Memory), and I/O devices. All the components are connected by the system bus
(Figure 3.1). The number of components can be varied independently to obtain different
DCS configurations.
CIMM communicates with the host processor through the system bus. The host
processor accesses the memory of a CIMM in the same way as it would access ordinary
DRAM.
3.4.2 C om putation-In M em ory M odule Architecture
CIMM consists of many ACME (Application-specific Computing Memory Element) cells.
Each ACME consists of a CE, an ACME-memory, and an FSM. Figure 3.2 shows the
internal structure of a CIMM and its sub-memory module. CIMM has a normal memory
decoding interconnection as an off-the-shelf memory would have. All the memory modules
within a CIMM are connected via a tree decoder and appropriate data and address busses
to a set of pins of the chip(s) implementing the leaf modules (ACMEs). Via these pins,
the host processor will access the data in any memory module within each ACME as it
would access data in any location of an off-the-shelf memory chip/bank. There are 256
ACMEs in the example CIMM configuration shown in Figure 3.2. However, the number
34
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
MEMORY T R E E \
DECODER
ACME
MEMORY BUS
SUB-PIM MODULE
Figure 3.2: Internal Organization of CIMM: ACME Connections
(and the size) of ACME can be changed according to the requirements of the target
applications and what the technology has to offer.
3.4.3 A pplication-specific Com puting M em ory Elem ent A rchitecture
Figure 3.3 shows to the internal details of an ACME. There are two ways to access the
ACME-memory. One is from the outside of its own module and the other is from its own
FSM (when accessed by FSM the data goes to either FSM or CE). The FSM in each
ACME governs its overall execution.
35
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
FSM
CE
ACME-
memory
Figure 3.3: Inside of ACME
3.5 E xecution Flow
The execution flow for kernels in the DCS computing paradigm is depicted in Figure 3.4.
The host processor stores all the necessary data to CIMM using normal store instructions
(T) in Figure 3.4). After all the data are stored, the host processor signals CIMM to begin
the kernel operations (@). CIMM initiates the “ready” signal to each ACME (@) and
upon receiving this signal, each ACME executes the operations according to its FSM (@).
Upon completing its own operations, each ACME notifies the CIMM that it has finished
its portion of work (®). After gathering the “DONE” signals from all the ACMEs (that
were working for the kernel operations), CIMM notifies the host processor that the results
are ready (®) for further execution. The host processor retrieves the results using normal
memory instructions and continues with the remaining portions of the application.
36
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
U U U U U G U
□
□
□
□
□
□
n n n n n n n '
Host
Processor
) All the data
I ready signal
► /
^ © DONE signal / '
uuuuuu uu
□ CIMM
/ /
/ /
/ . . L . c :
ACME
- P '
A C IS jF ACME
\ , 'N
\ \
\
/ \ % >
/ V
/
/
\
\
\
\
\
\
\
\
\
Figure 3.4: CIMM Execution Flow
3.6 Hardware Cost
According to the year 2002 edition of the ITRS [22] (International Technology Roadmap
for Semiconductors), in year 2003, a 4 Gbits (4.29 Gbits) memory chip is introduced with
chip size of 364 mm^. If about half of the chip area is allocated for logic, there can be 2
Gbits of memory and 182 mm? of die space for logic. W ith MPU production technology
in 2003, there can be 153 million transistors in 140 mm? of die space. If projected, with
182 mm? of die space, there can be 199 million transistors.
It is known [21] that when dividing a Gbit DRAMs into smaller modules, for the
current technology, 512 Kbytes is an efficient sub-array size in terms of saving power and
reducing latency. Therefore, I have chosen 512 Kbytes for the size of each ACME-memory
37
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
and there can be 512 ACMEs with the above technology. If the transistor number was
divided by 512, each ACME can have about 388,000 transistors.
38
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
Chapter 4
Benchm ark A pplications for D C S A rchitecture
The DCS computing paradigm is better explained with examples. In order to show how
the execution is carried out and to prove the effectiveness of the computing paradigm,
a set of data-intensive applications has been evaluated with our computing paradigm.
Those are M PEG (Moving Picture Experts Group) encoding and BLAST (Basic Local
Alignment Search Tool).
In this chapter, we first identify the kernels for the benchmark applications. Then, we
explain the algorithms of the applications and show how they are conventionally executed.
Then, we demonstrate how the kernels are executed in the DCS computing paradigm.
We do this separately for each application. Let us now first go into details of MPEG
encoding and then BLAST.
4.1 M otion E stim ation of M PE G Encoding
MPEG is an increasingly demanding application in the computing world and it has been
examined by many researchers [33, 14, 10, 11, 25, 50, 47]. There are seven identifiable
steps in the family of MPEG encoding algorithms [33, 14] and motion estimation is the
39
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
most time consuming stage: it absorbs up to 90% of the total execution time [10, 11, 25].
The reason is that it deals with a huge amount of data and consequently, the aggregated
penalty due to the memory wall problem [7, 8, 39, 41] is huge.
The importance of motion estimation in MPEG has led to the development of a large
number of ASIC (Application-Specific Integrated Circuit) modules [11, 50, 47, 25]. An
other reason behind the development of the hardware modules is the unique characteristic
of motion estimation operations which makes general-purpose processors not quite suit
able. Therefore, most current microprocessors are normally equipped with some kind of
media extensions [12, 4]. However, it turns out that those media extensions show little
performance improvement [48, 4] due to the overhead involved in preparing the execution
for the media extension.
4.1.1 M otion E stim ation A lgorithm and Characteristics
As mentioned above, motion estimation is the kernel section for most M PEC encod
ings [33, 14]. It is not only massively data-intensive but it is also a computationally-
intensive problem for which parallel computing is quite appropriate. The operations
involved in motion estimation consist of a large number of iterations of simple compar
isons, subtractions, additions and absolute value extraction. Generally, these operations
deal with small 8-bit pixel data.
4.1.1.1 A lgorithm
Among all the techniques used for motion estimation, the block matching motion estima
tion technique is the most common. First, each frame is divided into small pieces called
40
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
macro blocks. Since successive frames are tightly correlated, it tries to find the motion
vectors for each macro block in the current frame from the matching block in the previous
frame (also from the next frame when finding motion vectors for a certain type of frame).
It then delivers the motion vectors instead of sending the full frame data which often has
many redundancies from the previous frame.
In the block matching motion estimation algorithm, the video frame number n + 1,
say F{n + 1), is partitioned into macro blocks of size b x b. The objective is to find,
for each macro block Bi from frame F{n + 1), the 6 x 6 pixel area in frame F{n) with
which Bi has the best match. The algorithm finds the Mean Absolute Difference (MAD)
between the compared blocks. The block with the smallest MAD will then be determined
to be the best match. Equation 4.1 is used to find the MAD. a{i,j) is the pixel value in
the current frame and b[i + mj,j + nj) is the pixel value in the previous frame. {mi,rij)
represents the corresponding displacement.
M N
M AD = |a(z, j) - 6(f + m*, j + rij)\ (4.1)
i 3
The vector describing the separation, in horizontal and vertical pixels, between 6 x 6
macro block Bi in F ( n + 1) and its best match 6 x 6 pixel area in F{n) is called the motion
vector for Bi- Such a motion vector is computed for each Bi in F (n + 1). The search for
the best m atch for Bi is limited to a certain maximum displacement, d. Figure 4.1 shows
the macro block and the corresponding reference window.
41
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
A reference window
for macroblock B:
A macroblock
B;
I _______ I
ri
A sub-frame
(a) Frame F(n) (b) Frame F(n+1)
Figure 4.1: A Reference Window in Frame F{n) for a Macro Block Bi in Frame F{n + 1)
4.1.1.2 C haracteristics
The common characteristics of data-intensive applications were listed in Chapter 2. Mo
tion estimation has similar characteristics:
• It consists of simple operations (comparison, subtraction, absolute value finding and
accumulation) on small operands {8-bit pixel data).
• Its operations are highly parallelizable. Each macro block can be searched in parallel
(no dependency among macro blocks). Each block in a reference block can be
searched in parallel (no dependency among blocks in reference window). Each pixel
comparison in each block search can be executed in parallel (no dependency to
neighboring pixels). Even for a small frame (352 x 288) with macro block size of
16 X 16 and 16 pixel displacement, there are ((22 x 18) macro blocks x 16 x 16 pixels
42
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
in each macro block x 32 x 32 reference window searching) = 104 million operations
can be executed in parallel. This number increases as the frame size increases.
• It requires a tremendous amount of memory accesses and bandwidth as frames are
continuously coming in a stream-based fashion causing many block replacements.
Because of these unique characteristics, GPPs are not suitable to execute motion
estimation:
• There exists a tremendous amount of parallelism. However, the maximum paral
lelism that can be obtained from GPPs is only equal to the depth of the pipeline
and the width of issue of the superscalar machine.
• Execution in GPPs is highly penalized by the memory wall problem due to the
continuous incoming data and saving of temporary frames while the cache can
hold only portions of a frame (whereas it requires many frames to co-exist in the
memory).
• Execution in GPPs, entails bringing the data to the memory and to the cache and
to the registers and to the functional units. This process works against motion
estimation because motion estimation needs to process the next frames which are
coming from I/O . This does not utilize the data locality which is, after all, heavily
used in the architecture of modern GPPs.
4.1.1.3 E stim ated C om putation C om plexity
It is better to explain the complexity of computations involved in motion estimation by
examples. For example, consider a small example where is 8 x 8 and d = 8. In this
43
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
case, the best match is found between a macro block Bi in frame F{n + 1) and every
8 x 8 pixel area in F{n) that lies within a reference window that extends by d = 8 on
each side of the location of B^. In other words, as shown in the Figure 4.1, the reference
window will be a 24 x 24 pixel area in F{n) whose center corresponds to the location of
Bi in F{n + l). Therefore, the motion vector is found by scanning the macro block inside
of its reference window.
In a full-search method, there is 0{N^) complexity involved for comparing pixels for
motion estimation. There are J number of macro blocks in the current frame that need
to find the motion vectors from the previous frame, K number of block rows in the search
window of the previous frame, L number of block columns in the search window of the
previous frame, M number of pixel rows in the macro block, and N number of pixel
columns in the macro block.
The Table 2.2 shows the memory bandwidth requirement for motion estimations of
various sized frames. The table also shows the computation requirements.
4.1.2 Conventional Execution
The high memory bandwidth requirement of motion estimation means that processors
must find a way to provide very high memory bandwidth and must tolerate long memory
latencies. Hence, many projects have attem pted to expand the size of the cache to reduce
the latency and provide higher bandwidth. However, the limitation in the size of the cache
coupled with the overhead in data storing and retrieving prohibit GPPs from meeting the
performance goal. There have been attem pts of utilizing data prefetch and cache bypass
44
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
techniques. However, these approaches have a large design complexity and significant
side effects [7, 8].
GPPs with some kind of special hardware such as MMX were able to execute several
instructions during a single clock cycle. However, it has been reported that MMX and
other media extensions are not as effective as expected because of the overhead involved
in preparing the execution for the media extension [48, 4].
Several ASIC (Application Specific Integrated Circuit) designs [50, 47] were developed
to provide real-time computation requirements. The pipelined architecture proposed
in [50] is commonly used to execute motion estimation, since it reduces the required
memory bandwidth by temporarily storing pixels and scheduling computations in such a
manner that their values are reused until they are no longer needed. However, as Yang [50]
has reported, there are physical limitations (Pin Count) and addressing problems which
prevent cascading many chips together.
4.1.3 Execution in DCS Com puting Paradigm
Motion estimation follows the sequences of the DCS computing paradigm execution flow
introduced in Figure 3.4. Therefore, in this section, the data placement in ACME-memory
and the FSM execution are described in more detail.
4.1.3.1 D ata Placem ent
F irst, th e su itab le p lacem en t of p ixels in each ACME m ust be prepared. One sim ple
approach is to partition each frame into 2^ sub-frames and place each sub-frame in the
memory modules on which the corresponding ACME operates as in Figure 4.1. However,
45
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
J___ L
L
r
J
n
Figure 4.2: Pixels to be Stored in Memory Modules Corresponding to Sub-Frames Shown
for a Corner Sub-Frame, a Sub-Frame on an Edge, and a Sub-Frame Internal to the
Frame.
as shown in Figure 4.1, the computation for the macro blocks near an edge/corner of a
sub-frame in F (n -|-1) not only requires pixels from the corresponding sub-frame in F{n)
but also some pixels from adjacent sub-frames. A dedicated communication mechanism
inside of the CIMM is a possible solution.
Alternatively, one can replicate some of the pixel data such that the memory modules
that belong to each ACME contain all pixels required for the computation carried by
the ACME. The dotted boxes in Figure 4.2 illustrate the data that needs to be placed
in each ACME’s memory module to cover the reference window for all the macro blocks
in a sub-frame. The amount of data to be replicated depends on the frame size and
the number of sub-frames, the macro block size (6x6), and the displacement (d). For
352 X 288 frame size, 16 sub-frames and d = 8, the overhead is around 16%. However, if
d = 16, the overhead increases to about 30%.
46
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
Due to constraints on the bandwidth, it might be beneficial to reduce the number
of write cycles required when the frame is being loaded into the various ACME-memory
modules. This is a problem with which the DCS computing paradigm had to deal and
the approach to solve the problem is explained in the later section. For the time being,
we will focus on how the data are arranged in each ACME-memory assuming that each
ACME contains all the data that it needs to execute motion estimation operations.
First, we illustrate how much data (sub-frame, and its reference window) are stored
in each ACME with an example. In Figure 4.3, a frame is divided into nine distinct sub-
frames. Sub-frame 5 is highlighted with green color in Figure 4.3. There is one more step
of division done in Figure 4.3. The sub-frame was divided into sections that are either
shared by 4 ACMFs, 2 ACMFs, or 1 ACME (This additional division of a sub-frame is
used for the data distribution mechanism introduced in the later section).
Each rectangle (portion of a sub-frame) is named and divided as follows:
• Each sub-frame has a distinct number that represents the sub-frame identity. This
distinct number is attached to each rectangle as a superscript. For instance, in
Figure 4.3, all the rectangles with 1 as superscript belong to the sub-frame 1. All
the rectangles with 2 as superscript belong to the sub-frame 2 and so on.
• Each rectangle represents a portion in a sub-frame. The first letter represents
whether the rectangle is located in top (T), middle (M), or bottom (B) section
of a sub-frame. The second letter represents whether the rectangle is located in
the left (L), center (C), or right (R) of a sub-frame. As mentioned, the number
47
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
T_L' T_C T_R> T_L^ T_C^ T_R^ T_L* T_C’ T_R3
M_L> M_C‘ M_H> M_L^ M_C^ M_R2 M_D M_C’ M_R3
B_L> B_C' B_R> B_L^ B_C^ B_R2 B_L^ B_C* B_R3
T_L‘ ' T_C‘ ‘
T_r 4
T_L* .T .C ’,; ■ T_L‘ T_c«
T_R«
M_L‘ * M_R“ M J.*
. ■ ■ ■
M_L‘ M_C‘ M_R‘
B_L‘ ‘ B _C B.R'' B_L> B j? . B_L‘ B_C« B_R‘
T_L’ T_C’ T_R'' T_L« T_C* T_R* T_L’ T_C’
T r O
M_L’ ' M_C’ M_R’ M_L* M_C* M_R» M_L’ M_C’ M_R’
B_L’ B_C’ B_R’ B_L« B_C* B_R» B_L’ B_C’ B_R’
Figure 4.3: Dividing a Frame into Sub-Frames with Identified Sections
48
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
represents the identity. For instance, (B is for Bottom, L is for Left, and 2 is
for sub-frame identity) indicates the Bottom Left portion of a sub-frame 2,
• The portions are divided in such a way that each of the corner portions (T_L",
B.EP) are the size of one macro block. This sizing is made so that
the reference windows are easily obtained: if a sub-frame is divided in such a way,
each of the corner portions are shared by 4 ACMEs. Each of side portions (T_(7” ,
M_L” , is shared by 2 ACMEs and the M_C" is exclusively owned
by nth ACME. For instance, in Figure 4.5, T_L^ is a corner portion of a sub-frame
allocated to ACME 5 and it is shared by ACME 1, 2, 4 and 5. A side portion,
is shared by ACME 4 and 5 (Figure 4.6).
As mentioned before a simple assignment would be to assign a sub-frame to its ac
cording ACME number. For instance, sub-frame 2 will be allocated to ACME number 2
to compute motion estimations. However, the reference window (the shaded portion of
Figure 4.4) for a sub-frame is bigger than the sub-frame itself and each ACME needs to
access the data that belong to other ACMEs. For instance as in Figure 4.4, the boundary
rectangles B.L'^, B.C^, B_B?, T_L®, M.R^, M_L^, B.R^, B_L®,
T-RJ, T-L^, T_C^, T_R^, and T.L^) of the sub-frame 5 are from sub-frames that belong
to different ACMEs.
Therefore, each ACME should have a group of macro blocks (a sub-frame (colored
portion of Figure 4.3)) and the reference window (colored portion of Figure 4.4) that
corresponds to all the macro blocks in the sub-frame.
49
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
T_L' T_C' T_R> T_L^ T_C^ T_R^ T_L^ T_C’ T_R^
M_L‘ M_C> M_R' M_L2 M_C^ M_R^ M_L^ M_C’ M_R3
B_L‘ B_C> B_R' B_L* B_C* B_R* B_L1 B_C3 B_R3
T_L‘ ‘
Tc-i
T_r4
T_L* T_C* T_R» T_C‘ T_R‘
M_L‘ ‘ M_C‘ ‘ M_R< M_L*
■ ' -i'*4
M _R®
mSSmm
M M_C‘ M_R‘
B_L-' B.C-* B _R'* B_L> B_C» B_R> B_|p^ B_C‘ B_R‘
T_L’ T_C’ T R’
-
T_L* T_C* T_R« T_C’ T_R’
M_L’ M_C’ M_R’ M_L« M_C« M_R* M_L’ M_C’ M_R’
B V B C’ B R’ B L* B C* B R* B L’ B C’ B R»
Figure 4.4: Reference Window for each Sub-Frame
50
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
ACME 1 ACME 2
T L > T C‘ T R > T T T T T T R S
M V M C > M R > M M M R^ M M M R3
B V
T L‘ *
M V
B C
T C
M C " *
B R > B B B R2
T R‘ ‘ T C* T R S T L‘
M R'' M L* M R ® M L«
B C’
T C‘
M C‘
B R3
T R‘
M R«
B L'' B C -* B R '* B C’ B R’ B L‘ B C‘ B R«
T V T C’ T R’ T L* T C* T R« T L’ T C’ T R’
M V M C’ M R^ M L* M C» M R* M L» M C’ M R’
B C’ R’ B L» B C» B R» B L’ B C» B R»
ACME 4 " ACMES
Figure 4.5: Corner Portion Sharing by 4 different ACMEs
51
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
T_L> T_C' T_R' T_L^ T_C^ T_R^ T_L3 T_C^ T_R3
M_L' M_C" M_R’ M_L^ M_C^ M_R^ M_L’ M_C’ M_R’
B_L* B_C> B_R> B_L^ B_C^ B_R2 B_L^ B_C* B_R3
T_L‘ ‘ T_C''
T_r 4
T_L* T_C* T_R* T_L‘ T_C‘
T_R«
M_L‘ ' M_C‘ ‘ M_R''
1
M_C* M_R* M_L‘ M_C‘ M_R‘
B_L‘ ‘ B_C‘ ‘ 1 B_R'* B_L* B_C* B_R* B_L® B_C‘ B_R«
T_V T_C’ T_R’ T_L* T_C* T_R* T_L’ T _c’
T_r 9
/
M_L’ M_C^ 1 M_R’ M_L* M_C* M_R* M_L’ M_C’ M_R’
B V B C’ R’ B L* B C * B R* B L» B C’ B R’
ACME 4 ACME 5
Figure 4.6: Side Portion Sharing by 2 different ACMEs
52
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
8 bits used for colum n addressing
________________ A _________________
O )
.c
y )
< e
l <
. ■ 2
- S
7FF
01A Motion Vectors
O O A
Reference Blocks
000
Macro Blocks
Figure 4.7: ACME-Memory D ata Placement for Motion Estimation
For the evaluations, each frame is divided into 16 x 16 macro blocks (with the dis
placement of 16) and each ACME has nine macro blocks as a sub-frame to find motion
vectors. Therefore, each ACME also has to contain a reference window that is equal to
size of 25 macro blocks.
The data arrangement in each ACME-memory is shown in Figure 4.7. Each row of
ACME-memory is 256 bytes. Therefore, 8 bits are used for column accessing and 11 bits
are used for row accessing. Macro blocks take up 2304 bytes of memory and reference
window takes up 6400 bytes of memory. The motion vectors take up 18 bytes. Therefore,
the memory space used for storing frame data is below 2%. If there is backward and
forward motion estimations and if it was for a I-frame, four sets of the reference windows
and macro blocks are needed.
53
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
4.1.3.2 FSM C ontrolled E xecution
Once the data is ready in each ACME, the FSM of each ACME starts executing motion
estimation.
Yang’s motion estimation ASIC design [50, 47] was borrowed and slightly modified to
design the CE used in each ACME. Figure 4.8 depicts the hardware organization of the
CE that executes the motion estimation operations in each ACME. It is executed in a
pipelined fashion reusing loaded data.
Figure 4.9 shows the FSM that controls the execution of motion estimation. Each
motion vector is found in a pipelined fashion and it takes a total of 1051 CIMM clock
cycles to find the motion vector for a macro block. It takes 22 cycles (22 stage pipelining)
to fill the pipeline and there are 1024 blocks th at need to be searched to find the most
resembled block to the macro block. The FSM works as follows: once each ACME receives
a “begin” signal, in StO, it initializes the variables to prepare the execution. In S tl, the
control checks if there are more macro blocks to find the motion vectors. If there are
more macro blocks, the control goes to St2. Otherwise, it finishes its execution by going
to the EXIT state. In St2, it gets the next row of macro block. In StS, it passes the row
(obtained in St2) to the CE to store (the row) into registers in CE. If there are more rows
in the macro block, it goes back to St2 and continues the cycle until the CE is filled with
the macro block. Once the macro block is loaded, it transfers the control to St4 to load
reference blocks to CE. In St4~St5 the process of loading the reference blocks continues.
Once delivering the last reference block to the CE is finished, the control goes to St6.
In St6, it waits until the pipeline is completely finished and then transfers the control to
54
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
S tl. In S tl, it saves the motion vector to the ACME-memory and checks again to see if
there are more macro blocks.
4.1.3.3 H ardware Cost
Let us now estimate the hardware (in terms of number of transistors) cost of the CE which
executes motion estimation. According to Rabaey [43], a 1-bit master-slave D flip-flop
consumes 12 transistors and 1-bit full-adder consumes 24 transistors.
Assuming that this flip-flop is used as the basic unit to build registers used in the
FSM and CE and the full-adder is used to build adders, subtractors and comparators,
the total number of transistors used by the logic in each ACME is 212,000. The CE is
equipped with 16 x 16 tiles of special functional units (Figure 4.8). Each of these units
is equipped with an adder, an accumulator, a Muxer, and a Flip-Flop. This number is
below the transistor budget (which is 388,000) of each ACME.
4.1.3.4 D ata D istribution
As mentioned in the previous section, in order to execute the motion estimation operations
among many ACMEs in parallel, some frame data located in each ACME must be copied
to several neighboring ACMEs. Not only does this cause additional memory usage but
also data transfers between the host processor and the CIMM. This is a serious problem
because it means that up to 30% of additional interactions (60% if the data have to be
brought into the host processor and then delivered to the appropriate ACME) between
the host processor and memory.
55
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
This is exactly the opposite of what the DCS computing paradigm was meant to
obtain: the goal was instead to eliminate the interactions between the processor and the
memory and reduce memory access penalties.
Therefore, we devised an efficient data distribution mechanism that minimizes this
overhead of data replication. Figure 4.10 shows a CIMM with a data distribution unit
(Data Distributor). The data distributor is accessed only when a data has to be repli
cated to several ACMEs. Other memory accesses to the CIMM are transparent to the
data distributor. When a data arrives at a CIMM along with distribution signals (needed
for identifying which portion of a sub-frame that data belongs to), it is sent to the “data
distributor” hardware located on top of the memory tree of the CIMM. This data dis
tributor looks up the distribution table next to the distributor and retrieves distribution
address information.
The distribution table has information for ACME addresses and offset addresses that
are needed to calculate the actual addresses inside each ACME. Once the distributor ob
tains the information, it will find the appropriate ACME numbers and memory addresses
for each ACME found. Once the distribution addresses are found, the data is simultane
ously sent to multiple addresses by using the modified decoder depicted in Figure 4.11.
Table 4.1 shows the distribution table for motion estimation. When a frame is divided
into sub-frames, there are K sub-frames in a row of sub-frames. Each sub-frame is
allocated to consecutive ACMEs. For ACME number N, each pixel data in a sub-frame
N along with its portion information is looked up in this table to find the other ACME-
addresses (ACME number) and ACME-memory addresses (address in each ACME, potion
offset in Table 4.1). After finding these sets of values, the corresponding destination
56
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
Portion
Portion
Identity
Replication #1 Replication #2 Replication #3
ACME#
Portion
Offset
ACME#
Portion
Offset
ACME#
Portion
Offset
T_L 0000 N-K-1 25 N-K 22 N-1 10
T_C 0001 N-K 23
T_R 0010 N-K+1 21 N-K 24 N+1 6
M_L 0011 N-1 15
M_C 0100 No Replication Needed
M_R 0101 N+1 11
B_L 0110 N+K-1 5 N+K 2 N-1 20
B_C 0111 N+K 3
B_R 1000 N+K+1 1 N+K 4 N+1 16
N : The num ber of ACME itself
K : The num ber of sub-fram es In a sub-fram e row of a frame
Table 4.1: D ata Distribution Table for Motion Estimation
addresses are found by adding these ACME-addresses and ACME-memory addresses to
the pixel data remaining offset addresses.
This table puts each pixel data to correct ACME-memory addresses as in Figure 4.4.
Each ACME-memory should contain the data of reference window in order as in Table 4.2.
This table shows how the reference window in Figure 4.4 is stored in ACME-memory.
From the top left portion {BM^) to the bottom right portion (T.L^) of the reference
window of the sub-frame 5, each portion is stored in the ACME-memory in top to bottom
and left to right order.
57
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
Let us explain this further with an example. The reference window for a sub-frame
located in ACME 5 (colored portion in Figure 4.3) is shown in Figure 4.4. This reference
window includes portions from many other ACMFs (ACME 1, 2, 3, 4, 6, 7, 8, and 9).
Instead of considering receiving data from other ACMFs, if giving a portion of a sub-
frame to other ACMFs is considered, the problem becomes easier to handle. It actually
becomes dividing a sub-frame into portions that are shared by 1, 2, or 4 ACMFs. This
is exactly what the data distribution table does.
When a data in is being stored into ACME 1, the data distribution table is
accessed to see if there are other ACMFs that have to receive the data. It finds that a
data belongs to B-R portion should be sent to three additional addresses. According to
Table 4.1, the first replication (replication ^ 1) finds the ACME address as 5 (because
N=1 and K=3) and the offset address as 1. This means that the data has to be also sent
to ACME 5 at ACME-memory address position 1.
Table 4.2 shows how the reference window (Figure 4.4) is stored in ACME-memory.
The portions in Figure 4.4 are stored from top to bottom and left to right order. The
numbering in Table 4.2 is only used to show the order.
Likewise, all the data from other ACMFs are sent to multiple addresses to where
the data must go according to the data distribution table. For the sub-frames that are
located either at corners or side lines of a frame, the reference windows go over the scope
of the frame itself. In such cases, some portions of Table 4.2 will have no values because
some portions are out of bounds of a frame. Because of such cases, before loading frame
data, each ACME-memory should be initialized to all “0”s. This can be easily achieved
by resetting the ACME-memory.
58
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
The hardware requirements of the data distributor include only 6 adders, some control
logic, and a look up table. The outputs of the data distributor are the destination memory
addresses (1, 2 or 4 addresses). Since every access to the data distributor automatically
means extra time necessary to access the table and 6 parallel additions, there are 1 or 2
extra CIMM cycles of penalty. The extra cycle(s) can be avoided if the distribution and
the actual access can be pipelined into two different stages. Furthermore, a series of data
in the same portion does not need to access the table again because the information from
the table will be the same.
This data distribution can be generalized. By updating the distribution table accord
ing to the applications’ distribution need, the distribution can support several different
applications. Or if there were several tables as in the Figure 4.10, several different appli
cations or different core operations for an application can be supported by just switching
to different tables.
This data distribution mechanism is efficient and automatic. It eliminates multiple
sequential data transfers from processor to memory, thereby efficiently supporting parallel
execution. Although the distribution mechanism is very restricted, the simplicity and
regularity of motion estimation data access patterns allow the data distribution at an
affordable cost.
However, the host processor must initially generate the ACME-memory addresses for
each pixel in a sub-frame and store that pixel to corresponding ACME-memory address.
The host processor also have to identify the portion of the pixel in the sub-frame. If
only then, the data distributor recognizes the necessity of distribution and generates
59
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
the additional addresses to send the data. Therefore, there are additional computations
involved for calculating the addresses.
4.2 T he K ernels of BLA ST
BLAST (Basic Local Alignment Search Tool) [1, 2, 35] is a molecular biology application
where a database of protein (or DNA) molecule sequences is compared against a query se
quence representing a reference protein molecule so as to uncover similarities. It is a highly
time-consuming application simply because it deals with a large database of sequences
and hence must perform a large number of memory accesses and computations [2, 34].
In the past, a variety of high performance machines such as NOW (Network Of Work
stations) or massively parallel computers [31,15, 30] have been used to speed up execution
of BLAST. In such systems, each node, which is a general-purpose processor (GPP), is
responsible for its share of the database. However, GPPs are not efficient in executing
BLAST because they are susceptible to the memory wall problem [8, 7, 6, 39, 41] due
to the large number of memory accesses in BLAST. At the same time, their limited con
currency prohibits them from exploiting the tremendous amount of parallelism inherent
in BLAST. Executing in GPPs is also wasteful because the basic operand size of the
operations of BLAST (one byte), does not fully utilize the 32- or 64- bit functional units
of GPPs. Moreover, since the majority of its operations are simple comparisons, accu
mulations, and subtractions, the complex functional units of GPPs, such as multipliers,
dividers, and floating-point units remain unused during the execution of BLAST.
60
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
Therefore, by using the DCS paradigm, the PIM module reduces the memory access
penalty caused by the large number of memory accesses by relocating the data-intensive
operations where data are located instead of transferring data to be operated upon. It
is also more convenient and effective to exploit the inherent parallelism of BLAST with
DCS computing paradigm.
In the next sections, description of BLAST and its most im portant characteristics
are presented. The estimation of the complexity of its execution and the reasons why
GPPs are not efficient are shown as well. Then, its execution with the DCS computing
paradigm is presented.
4.2.1 BLA ST Algorithm and Characteristics of its K ernel Operations
There are many different versions of BLAST (all from the National Center for Biotechnol
ogy Information (NCBI)), each with different goals and characteristics. Some of the com
mon versions include the original BLAST (ungapped), BLAST 2 (advanced ungapped).
Gapped BLAST (with gaps), Psi-BLAST (to find very distantly related proteins), Phi-
BLAST (to find sequences with a certain pattern), and Mega-BLAST (to compare two
large sets of sequences against each other).
There is also a multi-processing version of BLAST and a stand-alone version. These
BLAST algorithms have been under development at NCBI for almost three decades and
all of these BLAST programs are based on the original algorithm designed by Altschul [1].
In this dissertation, the focus is on the original BLAST because it is still being used
and because it is the core of all the other versions. Consequently, the techniques developed
in this dissertation can be applied to the other versions.
61
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
4.2.1.1 A lgorithm
BLAST [1, 46] is based on a rapid heuristic algorithm which attem pts to approximate
rather than to find the optimal similarity as is the case for dynamic programming ap
proaches. The reason BLAST is based on a heuristic algorithm is that, as shown by
Altschul, it is impractical to base a search of a large database on a dynamic programming
algorithm [1], However, it is still a significantly time-consuming application since the
database is normally quite large. Indeed, it has been reported to have about 24 million
sequence entries (where a typical protein sequence entry occupies around 250 bytes) as
of April 2003 [34] and that every year, its size increases by a factor of 1.5 to 2 [31].
There are three major steps in the BLAST algorithm (the details of each will be
further described in the next section);
1. Compilation of Query Words (Figure 4.12(a)); the first step consists in compiling
a list of high-scoring (higher than some threshold “T ”) query words using a substi
tution score m atrix table based on the query sequence. For a protein sequence, one
query word is normally three bytes [46, 2].
2. Database Scanning (Figure 4.12(b)); the second step consists of scanning the
database (each sequence one by one) with the query words compiled in the first
step to detect if there are matches (hits) to the query words in any position in the
searching sequence.
3. Match (hit) Extension (Figure 4.12(c)); The last step consists in extending the
matches found in the second step and locate high scoring pairs (HSP) that score
62
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
over a threshold, S. The extensions occur to the left and to the right of the matches
to find HSPs.
4.2.1.2 K ernels and C haracteristics
It has been reported that, among the three steps, the match extension step consumes
92% of the total execution time of the BLAST algorithm [2]. This is the kernel section of
BLAST and it consists of a tremendous number of simple comparisons, additions, sub
tractions, and accumulation operations on 8-bit operands. This is another good example
of the 90/10 rule.
The two major steps (the second and the third steps) of BLAST have similar charac
teristics like other data-intensive applications.
1. They consist of simple operations (comparison, addition, and subtraction) on small
operands (1 byte).
2. Their operations are highly parallelizable in the database scanning step, since each
query word may scan each sequence concurrently. This means that the potential
parallelism is ( ^ of query words) x of sequences in database). Also, in the
match extension step, the extension of each match can be executed in parallel.
3. They require a tremendous amount of memory accesses as sequences must be moved
from the memory to the processing unit in a stream-based fashion causing many
block replacem ents.
That GPPs are not suitable to execute the kernels of BLAST becomes obvious when
one considers the above characteristics in the light of GPPs capabilities.
63
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
1. There are 24 million sequences and each of these sequences may be searched inde
pendently. However, the maximum parallelism we can obtain from GPPs is only
equal to the depth of the pipeline and the width of issue of the superscalar machine.
2. Execution in GPPs is highly penalized by the memory wall problem due to the fact
that the cache can hold only a small fraction of the database at any given time.
3. Execution in GPPs entails bringing the data to the memory, then to the cache, then
to the registers, and then to the functional units. This process is especially wasteful
in BLAST because BLAST needs to search every sequence and hence cannot utilize
locality upon which all modern GPPs rely very heavily for performance.
4. Simple byte-wise operations (comparison, addition, subtraction) in BLAST are not
directly supported by GPPs and are executed by underutilizing a fraction of the
functional units, while many larger functional units remain unused.
4.2.1.3 E stim ated C om putation C om plexity
As mentioned before, it is inefficient to execute the BLAST kernel operations in GPPs.
Consider an example scenario where BLAST is executed on a single 2.0 GHz micropro
cessor. The computing time required for a query sequence to search through the whole
database with such a microprocessor is calculated.
According to Altschul [2], the last step (match extension step) consumes 92% of the
total execution time and the remaining 8% is shared by other operations including the
first and the second steps of the algorithm. The average sequence length is assumed to
64
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
be approximately 250 bytes [46] (throughout the calculations, the length of a sequence
whether query or database, is assumed to be 250 bytes).
For the first step (the compilation of the query word list step), the query sequence
is scanned from the beginning of the sequence to the end (with window size = 3 byte
and stride size = 1 byte) to generate an initial list of query words. This is shown on
the left side of the “Substitution Score M atrix Table” of Figure 4.12(a). At the end of
the extraction of the initial query words, 248 initial query words are found. Each of
these initial query words are looked up in the substitution scoring m atrix table (such as
BLOSUM or PAM [1, 46]) to find neighborhood query words that score more than a given
threshold score T. For example, if one of the initial query words was “PQ G ” and T was
13, then, the substitution m atrix finds the neighborhood query words th at score at least
13 by substitution. If the neighboring query words are (PQG with a score 18, PEG with
a score 15, PRG with a score 14, PKG with a score 14, PNG with a score 13, PDG with
a score 13, PHG with a score 13, PMG with a score 13, PSG with a score 13, PQA with
a score 12, PQN with a score 12 ...), then the ones with a score equal or greater than
13 are added to the final list of query words. In general, there are about 50 such query
words for each initial query word [46]. The right side of the “Substitution Score Matrix
Table” in Figure 4.12(a) shows the ultimate compiled list of query words. Assuming 50
neighborhood query words for each query word, there are about 12,500 query words in
total that are used in the subsequent database scanning step.
Assume that it takes 250 cycles (equal to the average length of each sequence) to
scan one sequence from the database with a query word (actually, it would take many
more cycles in a G PP because there is a large overhead such as bringing data to the
65
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
memory, then to the cache, and then to the registers). There is also the overhead of
saving information if matches are found. Since it involves comparing only three bytes,
there is also the overhead of controlling these specific compare operations). Then, in
order to scan the entire database with all of the query words, it takes (250 cycles x
12,500 query words x 24 million sequences =) 75 trillion cycles {Tcycles).
Let us also assume that the database scanning step takes most of the 8% (which is
the remainder of the total execution time of BLAST without the m atch extension step).
If 8% took 75 trillion cycles, we can extrapolate these figures to estimate the time to
execute 92% of BLAST. The match extension step, which is 92% of the total execution
time, will take 862.5 trillion cycles. In total, it takes 937.5 trillion cycles for a 2.0 GHz
processor to execute a BLAST search of the whole database. This is about 5.4 days of
dedicated processing for a 2.0 GHz processor (Table 4.3).
During the processing, the whole database (estimated at around 6 Gbytes = 24 million
sequences x 250 bytes) has to be brought (multiple times) into memory and to the cache
(normally less than 512 Kbytes) and there is not much locality because each next sequence
has to be processed. This causes a tremendous amount of cache misses and therefore of
memory operations. However, the overhead of such memory accessing cycles is not even
considered in the above calculations. This huge amount of memory accesses is quite
critical because of the memory wall problems which is further explained in the next
section.
In real-life BLAST searches, one would restrict the search to portions of the whole
database by choosing specific sub-databases. However, it still involves a large database
6 6
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
(normally sizes of 1 ~ 2 Gbytes in gzipped format) and thus an extremely large amount
of computations and memory accesses are involved.
4.2.2 Conventional E xecution M odels
Since BLAST involves a large database and is thus quite time-consuming, it has tradition
ally been implemented for execution on large-scale computer systems such as Network Of
Workstations (NOW) and massively parallel computers [31]. LLASAP [15] is one example
of this class.
In such networks of workstations or clusters, each node, which is a normal general-
purpose processor, is responsible for its share of the database for sequence matching to the
search sequence. GPPs are not suited to execute the operations of BLAST and therefore,
the focus of this dissertation is in improving the performance of each node of such parallel
systems executing the kernels of BLAST.
4.2.3 Execution in DCS Com puting Paradigm
BLAST kernels follow the execution flow for kernels in the DCS computing paradigm in
Figure 3.4. Therefore, the next issues are how the data are arranged and how the kernels
are executed in each ACME.
4.2.3.1 D ata P lacem ent
First, let us look at how data are prepared in the DCS computing paradigm. The database
is divided into smaller databases (of sizes that can be fit into each ACME-memory) and
allocate each small database to an ACME so that each small database can be searched
67
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
separately and concurrently. Each of these small databases may have a different size,
because each sequence (which is unique) can be of a different length.
After the host processor completes the compilation of the list of query words, is
completed (by the host processor), the list of query words are sent to CIMM. CIMM
needs to send this list to each ACME. CIMM also receives the query sequence and the
substitution scoring table and must send them to each ACME as well. Eigure 4.13 shows
the data placement within each ACME-memory. First, the query sequence is stored at
address 0. Then, the scoring table and so on, as shown in the figure. 20,000 bytes of
memory space was allocated for sequences from the database. It has been reported that in
the “E. coli genomic nucleotide sequences” database (which is the small example database
given by NCBI) the longest length of a sequence was 560 bytes. Therefore, assuming that
any sequence can be 560 bytes, about 35 sequences can reside in 20,000 bytes of memory
space.
The data structure was defined for query words used for the execution using the
syntax of the “C” programming language as in Figure 4.14(a). In this data structure,
each structure requires 8 bytes. Since there can be up to 12,500 query words for a query
sequence, the list of query words may consume 100,000 bytes.
After each ACME safely receives all the data, the host processor signals CIMM to
begin executing the database scanning step of the BLAST algorithm. After receiving the
“begin” signal from the host processor, CIMM signals each ACME to begin its execution
of database scanning on its share of database. During its scanning of its share of the
database, each ACME stores each match found into a part of its ACME-memory, as
shown in Figure 4.13 as “matches.” The data structure for each match is defined as in
68
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
(see Figure 4.14(b)). Each match data structure requires 32 bytes and the number of
matches varies from execution to execution (depending upon sequences in database and
the query words) therefore matches are located at the top of the ACME memory and
appended one after the other. After all matches are found, the match extension step of
the BLAST algorithm begins execution in each ACME.
4.2.3.2 FSM C ontrolled E xecutions
Now, let us examine the FSMs that control database scanning and match extension
in more detail. Figure 4.15 shows the FSM that controls the database scanning step.
Figure 4.16 shows the FSM for the match extension step. Both of these FSMs access
the ACME memory and retrieve a row (no column access) and then store the row into
tem porary registers (exact size of a memory row) in the CE. Then, the FSMs use the
temporary registers to obtain necessary data and continue with their executions. The
states where the decisions are made are circled with thicker lines. The description that
overlaps with each state describes the actions taken in that state. All the states in
hundreds (such as St200, StSOO, St400, ...) are memory access states.
Briefly describing the steps in the FSM for database scanning (Figure 4.15), in StO,
if there are more query words that need to be scanned against the database sequences,
the control goes to S tl, otherwise the control exits the FSM. In S tl, if there is a need to
access the ACME-memory for the next query word, the control goes to St200 and accesses
the ACME-memory for the next set of query words. (16 query words are accessed at
once. Recall that the ACME memory is accessed only by rows and save the row into a
temporary register. One row is 128 bytes and there can be 16 query words within 128
69
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
bytes.) Otherwise, the control goes to St2 and obtain the next query word from the
tem porary register holding the query words. In St3, if there are more sequences in the
database, the control goes to St4. Otherwise, the control goes to StO to start a new scan
with a new query word.
Continuing with the FSM, in St4, if there is a need to access the ACME-memory
for the next database sequence, the control goes to StSOO. Otherwise, it goes to St5
to initialize the beginning of a scan. In StO ~ StlO, scanning continues until a match
is found from the searching sequence. If a match is found, then in Stl2, the match is
collected until there are four matches and when there are four matches are collected in
StlS, the padded versions of the four matches (128 bytes) are stored into the ACME
memory. W hen four matches are collected, it is stored into ACME memory in order to
minimize the interactions with the ACME memory. This process of scanning continues
until there are no more query words with which to search the database.
The FSM for match extension works as follows. Each match is extended first to the
left (St6 ~ S tll) until the score is at its maximum and then extended to the right (Stl2
~ St 17) in order to obtain a high score for the extension and this process continues until
each match had a chance for extension. Other states are similar to the states in the
database scanning FSM as they check if the ACME memory is to be accessed to obtain
the next matches and database sequences. In St6, if the query word is located at the very
first byte position of the query sequence or the corresponding position of the database
sequence is the first byte, then the left extension is skipped and the control goes to the
right extension. In St7, the byte positions of the one left extension for the query sequence
and for the database sequence are found and the corresponding letters from that positions
70
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
are kept. In St8, using the letters kept from St7, the corresponding indexes in the scoring
table are found and by using the scoring table the score is retrieved. In St9, the current
score is updated with the sum of the current score and the score retrieved from the table.
The current score is initialized to the score of the query word (St6) and updated each
time extension occurs (St9). In StIO, if the current score is higher than the high score
so far, the high score is replaced by the current score and the left value is replaced by
the current left byte position. The current score can be smaller than the high score since
the scoring table can return a negative score. Then in S tll, the cutoff criteria used for
BLAST is checked and also if either the query sequence or the database sequence has been
searched left to the very first byte position. If the above conditions are not met, FSM
checks if it has to access ACME-memory and goes to St7 for further left extension. On
the other hand, if the conditions in S tll are met, the control goes to Stl2 to begin right
extension. Right extension goes through similar steps (Stl2 ~ Stl7) as the left extension
(St6 ~ S tll). In St200, before the next matches are accessed, the updated information
(such as left extended amount, right extended amount, score, and so on) about previous
accessed matches are updated in ACME-memory by storing back the temporary register
(which contains matches).
4.2.3.3 H ardware Cost
The amount of hardware (in terms of number of transistors) required by these ESMs and
functional units are measured using the same number of transistors (TRs) used for calcu
lating for motion estimation. The total number of transistors used by the computations
was 106,752. This number is less than one-third of the number of transistors allocated
71
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
to each ACME for logic. One uncommon computing unit used in the FSM is the scoring
table index-finder. A scoring table index-finder compares one 8-bit data with 24 possible
match candidates to find the index of the match. Therefore, it can be considered as a
simultaneous 24 comparisons of 8-bit data.
4.2.3.4 D ata D istribution
There is one undesirable complication involved when executing BLAST kernels with DCS
computing paradigm. The query sequence, the list of query words found in the first step,
and the scoring table are common to each ACME. This is similar problem to that of data
sharing in motion estimation.
This means that the host processor must send all the common data to each ACME
sequentially. Indeed, this is evidently an overhead because if there were 512 ACMEs (like
in the example architecture), the total store instructions for storing the common data
will be increased by 512 x. This could work against the goal of removing (and reducing)
the interactions between the processor and memory and degrades the performance.
Therefore, we have modified the data distributor in Figure 4.10 to support the broad
casting of BLAST. When broadcasting is needed, the data is sent to every ACME using
the modified decoder to the same addresses in each ACME. W hen the broadcast signal
of each decoder is on, the decoder delivers the data to both (left and right) directions
regardless of the address bits. Each ACME will receive the data and store them to the
addresses at remaining address bits (ACME-memory addresses) therefore the data go to
the same address for each ACME.
72
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
- O O i
-n -n i
■ d i
O i
I
I
I
I
d i
-n-i
I
d i
t l ] l [
-Di
I
■ [
]i
li
li
li
v - 2
d d i
H
li
li
li
li
li
l-D i
li
li
li
li
li
li
li
li
li
li
l i
li
li
li
li
li
li
li
li
li
li
I
i - n n n - n - n i
i D o l u l ^ i b l
i - n o D O O i
i D - n o - D - n i
i- O O O O i
l-Oi
I
I
I
I
I I
i | P | [ 5 5 5 d ^
i n - n - n - n
i l u l D D l p i b l
i-n-n-pD-cpi
I D - D D D - a
iId - d - d
I D - S E I d D
X 'T X V p iip ; ^
I
I
I
T
V 2
i D D i
l-n-DDi
li
i - n - n o i
i-n-n-ni
]-n-noi
i :
-n-Oi
I
I
l-D O i
l O O i
I
I
I
I
I
I
I
I
I
I
i-n - n o i
i - n o O i
i - n - n n i
I
l-DDDi
I
l O O O i
i - n - n o i
I
I D - O
I D D i
I
I
I
I
I
I
I
I
l-DDi
I
I
T
i :
I
I
I
I
I
I
I
I
T
I
£
Adder
R eference
Block M acro Block
□
CoiDDare
Add
FF
Figure 4.8: CE for Motion Estimation
73
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
EXIT
goto S t 2
F eed it into CE o f Macro Block
goto EXIT
S ta rt
Initialization
goto St 2
goto St4
After 22 cycles,
If there are more
rows for the R ef Window
goto St 4
goto St
Figure 4.9: FSM for Motion Estimation
74
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
M
5
I Q
o
■ = g
.2 o >
Q V )
CIMM
D ata
D istributer
M otion
E stim ation
D istribution
T able
D»cod»r ^
ACME ^ C M E -
Figure 4.10: CIMM with an Efficient D ata Distributor
75
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
B B B s
!5 15
z z z z
o C M < o
t n (A 0 0 )
X S
m ( 0 0 ) ( 0
£ £ £ s
Q
■ o ■ o * o *o
■ o ■ o ■ 0 ■ o
< < < <
A ddress 0 (N-1 bits)
A ddress 1 (N-1 bits)
A ddress 2 (N-1 bits)
A ddress 3 (N-1 bits)
Data
R/W
D e c b d e r < ^ '
A ddress 0 (N-1 bits)
A ddress 1 (N-1 bits)
A ddress 2 (N-1 bits)
A ddress 3 (N-1 bits)
Data
R/W
Figure 4.11: Modified Memory Tree Decoder for CIMM D ata Distribution
76
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
ACME-Memory A ddress O rder
(these num bers are ju st representative.
The actual num bers are relative
m em ory ad d resses)
The portions th at should be
stored
1
B_r n k i
2
B_ln- k
3
B cn- k
4
B_rn- k
5
B_un-k + i
6
T_rn-i
7 T_L"
8 T_C'*
9 T_R"
10
11
12 M_LN
13 M_C"
14 M.R"
15
16
B_RN-i
17 B_LN
18 B_C^
19 B_RN
20
B ln+ 1
21
T_RN«-1
22
T_i. n*k
23
T_cn+ k
24
T rn+ k
25
J |_ N + K + 1
Table 4.2: D ata Arrangement of a Reference Window
77
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
Query Sequence
! 1 " Query W ord !
2 "* * Q uery W ord !
SuNI
Compiled List of Query Words
C om piled
L ist of
W o rd s
(a) Step: Com pilation of Query Words
S e q u e n c e s fro m D a ta b a se
I N -2 I N-1
I N -2 I N-1
C om p iled E x act M a tch es (to Q u ery W ords)
(b) 2'*'* Step: D atabase Scanning
C om p lied E xact M a tch es (to Q uery W ords)
Q uery S e q u e n c e S e q u e n c e from D a ta b a se
initial M atch
High S c o r in g Pair (by e x te n sio n )
Left E x te n sio n R ight E x te n sio n
(c) 3’" “ ^ Step: M atch Extension
Figure 4.12: Three Steps of BLAST Algorithm
78
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
Table 4.3: BLAST Computation Requirement for Searching Whole Database
Operation
C om puting\..._^^
Requirement
Other than Match
Extension
Match Extension Total
Percentage 8% 92% 100%
Cycles {Trillion Cycles) 75 Tcycles 862.5 Tcycles 937.5 Tcycles
Time
(for one 2.0 Ghz processor)
0.43 days 4.99 days 5.42 days
.I *
n
%
8
<
. ■ a
• Q
C N
3B5
7 bits used for colum n addressing
________________ A _________________
If
List of Query Words
0A7
O O A
Database Sequences
005
Scoring Table
000
Query Sequence
Figure 4.13: ACME-Memory D ata Placement for BLAST
79
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
struct query_word {
unsigned char word[3]; I I query word
Int pos; II byte position
I I in the query word
unsigned char score;
// the beginning score
}
struct match {
int pos; II byte pos of query word
int db_seq_no; I I sequence number in DB
int pos_db_seq; I I byte pos in seq in DB
int extended; II was it extended?
int extjeft; I I how far to left
int ext_right; II how far to right
int score; II score after extension
int hsp; I I is this hsp?
)
(a) Data Structure for a Query Word (b) Data Structure for a Match
Figure 4.14: D ata Structures for Query Word and Match
Table 4.4; Hardware Requirement for FSM Executions in Each ACME for BLAST Kernels
— Operation
R equirem ent
D atabase
Scanning
Match
Extension
Total
N um ber of FSM states 19 25 44
C om putational Units
(num ber of TRs)
18 1-byfe com parators (3456)
4 3-byte com parators (2304)
1 4-byte com parators (768)
7 3-byte adders (4032)
3 1 -byte ad d ers (576)
3 1-byfe subtracters (576)
2 scoring table index- finders (9216)
20928 TRs
R egisters
(num ber of TRs)
1 7 1-byfe registers (1632)
3 3-byte registers (864)
7 4-byfe registers (2688)
1 3-byte registers (768)
2 128-byferegisters (24576)
1 576-byte register (55296)
85824 TRs
Total num ber of TRs 106752 TRs
80
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
200, EXIT
Start
[300,
G et next query words
from ACME-memory
Initialization for
se q u en ce scan
Put 4 m atches
into ACME-memory
G et th e next db se q u en ce
from ACM E-memory
G et the next query
tem p_query_w ords
query words
goto St 1
goto EXiT
goto St 14;
St 13;
If there are m ore
s e q u e n c e s In datab ase
goto St 4
goto St 0
If n eed to a c c e s s ACME-memory
(to get the next db seq u en ce )
goto St 300
goto St 5
Hf n eed to a c c e s s ACME-memory
(to get the next query w ords)
goto St 200
goto St 2
If three con secu tive bytes in the se q u en ce
m atch to the query word
goto St 12; I I m atch found
goto S t l l ;
m ore bytes in the current se q u en ce
goto St 7;
goto St 3;
Else if the first byte of the query
word and a byte position in the
se q u en ce m atch
goto St 9;
Else (if the first and secon d
bytes o f the query word and
two con secu tive byte positions
of the se q u en ce match)
goto St 10;
if this is the beginning o f a new
goto St 6;
Figure 4.15: FSM to Control Database Scanning
8 1
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
sta rt
1 Get the next matches |
/ S t A
if there are more
matchs
goto St 1
^ goto EXIT
If need to a ccess ACUE_memory
(to get the next matches)
goto St 200
goto St 2
Calculate the address of If need to a ccess ACME memory
the matched (to gel db sequence)
db sequence goto St 300
®
/ s t
W goto St 6
/ St \
Get the next match
from a group of four
m atches and track
which match fias to tw
ssed next
Initialization for left extension
lafl_extend_amount = 0;
If (query_p0 8 == 0) ||
(db_seqj30s is the first byte
ofth ed b sequence)
goto St 12;
goto St 7;
calculate the address of
the matched
db_sequence
0 3
Left_extend_amount++;
queryjndex = (query_seq[queryj3os -
Left_extend_amountl);
db_seq_lndex = (db_seq(db_seq_pos -
Left_extend_amounl));
I get_db_sequence
templndex_query - Indexgen (queryjndex);
templndex_seq = Indexgen (db_seq_lndex);
I f (highscore < currentscore)
hlghscore = currentscore;
left = left_extended_amount;
Initialization for iight_extension
right_extend_amount = 0;
'currentscore = currentscore +
scoring J ab le (templndex_qusry|(tempindex_seq);
If (currentscore > highscore • cutoff) & &
((query_pos-m > 0) & & (db_seq_pos - m > 0))
If (need to get the another row for the db_sequence)
goto St 400
ngnt_0 xtend_amount++;
queryjndex = {query_seq[query_pos
+ Left_extend_amounlj);
db_seq_index = (db_seq[db_seq_pos
+ Left_extend_amount]);
If ( (query_seqlqueryj30s +
right_extend_amount) == A O ’) | |
db_seqldb_seqjx3s ♦
right_extend_amountl == A O ' ) )
goto St 18;
goto St 14;
=v
Right Extension (counter part of St 8 ,9 ,1 0 ,1 1 )
■'StJ
.15.
Figure 4.16: FSM for Match Extension
8 2
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
A d d re ss 0 (N-1 bits)
A ddress 1 (N-1 bits)
A ddress 2 (N-1 bits)
A ddress 3 (N-1 bits)
Data
R/W
B roadcast
Decoder
A d d re ss 0 (N-1 bits)
A ddress 1 (N-1 bits)
A ddress 2 (N-1 bits)
A ddress 3 (N-1 bits)
Data
R/W
B roadcast
Figure 4.17: Memory Tree Decoder for CIMM D ata Distribution with Broadcasting
83
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
C hapter 5
D CS Perform ance Evaluation M ethodology
In order to measure the performance, an architectural simulator for DCS has been con
structed. The applications were divided into the hardware section and the software sec
tion. The hardware section is analyzed and transformed into an FSM model that is used
by each ACME. Then, the synchronization between the host processor and the CIMM is
constructed.
We have gone through a series of steps of simulations to compare the performances
w ith/w ithout CIMM executing the kernels. In this chapter, let us first introduce details
of the DCS simulator and then show what other steps are required to evaluate the per
formance. Then, we can examine what steps of simulations were performed to compare
the performance.
5.1 Sim plescalar Sim ulator
Simplescalar [5, 3] is a complete architectural level simulation tool package which consists
of compiler, assembler, linker, simulation, and visualization tools for the SimpleScalar
84
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
architecture. It is an execution-driven simulation tool and it has been used by many
researcher to simulate real programs on a wide range of modern processors and systems.
It provides a range of simulation environments from a fast functional simulator to
a detailed, out-of-order issue processor that supports non-blocking caches, speculative
execution, and state-of-the-art branch prediction.
5.2 D C S A rchitectural Sim ulator
First, in order to evaluate and analyze the performance of DCS, an architectural simulator
has been designed by modifying the Simplescalar 3.0 simulator [5, 3].
In the DCS simulation, the processor core of the Simplescalar acts as the host processor
of the DCS and the memory model is modified to play the role of CIMM. The Simplescalar
annotation technique is used to add instructions that are used for communication between
the host processor and CIMM.
Once the simulator encounters the new instruction which signals the CIMM to begin
execution, the control goes to the CIMM to execute the kernel operations in each ACME
and the host processor waits until the “DONE” signal is received from CIMM.
While executing the kernel operations, the CIMM keeps track of the cycle counts of
the kernel operations. W hen finished with the kernel operations, the CIMM multiplies
the cycle count by 8 and returns to the host processor along with the “DONE” signal.
The host processor adds the cycle count to its cycle count when the control was shifted
to the CIMM and resumes its operations.
85
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
5.3 Required Tasks Prior to A ctual Sim ulations
Since the DCS computing paradigm is different from the conventional computing para
digm, there are tasks that have to be processed prior to the actual simulations. Those
tasks are as follows:
1. Understanding the application & : Identifying the kernels,
2. Understanding the source code of the applications and dividing the program into
the hardware section and the software section,
3. Designing FSM prototypes for the kernels & Arranging, the data in ACME-memory
4. Interfacing CIMM with the host processor,
5. And finally, simulating.
According to the above list of tasks, Figure 5.1 shows how the program is modified so
as to be executable within the bounds of the DCS computing paradigm. The left side of
the middle arrow shows the division of a program into hardware (kernels) and software
sections (pre-processing and post-processing). The right side represents the program flow
for DCS computing. The pre-processing and the post-processing are slightly modified to
store data into the ACME memory and retrieve the results from the ACME-memory. The
synchronization instructions are also added to communicate between the host processor
and the CIMM.
Once the program is ready, it is compiled with the Simplecalar compiler, including
the annotations to understand new instructions. Once the binary is ready, it is executed
on our DCS architecture simulator.
86
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
5.4 Sim ulation Steps for Com paring Perform ance
A series of simulations were performed to evaluate the performance. First, the perfor
mance of the plain Simplescalar architecture ( configuration) with the benchmark
programs were measured.
Second, the performance of DCS with CIMM executing the kernels for the benchmark
programs were measured. In this first DCS architecture, the host processor sequentially
delivers shared data to multiple ACMEs. This configuration is referred as ‘^dcs-no^dist”
configuration.
Third, DCS with data distribution inside of CIMM {‘^dcs^disf’ configuration) was
evaluated. In ‘^dcs-disf' configuration, the data distributor distributes shared data to
multiple ACMEs simultaneously.
87
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
Pre-processing slightly
modified to sto re d ata to
ACiME-memory
Preparation for
CIMM
Kernel Execution
Begin
Post-processing slightly
modified to receive
resu lts from ACME-
memory
Dividing a program
into hardw are (kernel)
/ softw are (the rest)
sectio n s
Figure 5.1: Program Modifications to be Executed on DCS Computing Paradigm
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
Chapter 6
Sim ulation R esults and A nalysis
Let us present the simulation results and analysis for each benchmark applications. First,
let us go into the details of performance evaluation for motion estimation. We will then
show the performance and show the additional steps we took to improve the performance.
Then, we present the performance for BLAST kernels and show the analysis. Finally, we
summarize this chapter and demonstrate the advantages of the DCS computing paradigm.
6.1 M otion E stim ation
For motion estimation, six different sizes (352 x 288,640 x 480, 720 x 480,800 x 640,1280 x
720,1920 X1080) of frames motion estimations for each system configuration are simulated
and examined to evaluate the performance. The simulated macro block size is 16 x 16
and displacement is 16. W hen dividing a frame into ACMEs, it is divided into 3 x 3
macro blocks sized sub-frames. The operating frequency of CIMM (and therefore, for
each ACME) is one-eighth of the frequency of the host processor.
89
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
f 0.5
B 352x288
■ 640x480
□ 720x480
□ 800x640
■ 1280x720
m1920x1080
no _ d cs VS d cs_ n o _ d isl no _ d cs VS d c8_dist
Comparing System Configurations
Fram e Size
No. of
ACMEs
352 X 288 48
640 x 480 140
720 X 480 150
800 X 640 238
1280 X 720 405
1920x1080 920
Figure 6.1: Performance Improvement for Motion Estimation
The performance improvement from using the DCS computing paradigm for motion
estimation is shown in Figure 6.1. The set of left bars represents the performance im
provement from the no.dcs configuration to the dcsjno.dist configuration. The set of
bars on the right represents the performance improvement from the noAcs configuration
to the dcs-dist configuration. The table to the right of the graph shows the number of
ACMEs executing motion estimation in parallel.
There is an obvious performance improvement from the left set of bars to the right set
of bars. This improvement comes from using a data distributor inside the CIMM. The
performance improvements was about 300% for 352 x 288 frame sized motion estimations,
100% for both 640 x 480 and 720 x 480, 200% for 800 x 640, 190% for 1280 x 720, and
190% for 1920 x 1080. Therefore, by using the data distributor, the performance could
be boosted for motion estimations for the DCS architecture.
One noticeable thing about both sets of bars is that the 1280 x 720 frame sized motion
estimation showed less performance improvement when compared to the 800 x 640 frame
90
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
sized motion estimations. This is apparently odd since larger frames are using more
ACMEs to exploit more parallelism, and therefore, there should have been a greater
performance improvement for 1280 x 720 sized frames compared to the 800 x 640 frames.
Repeated simulations showed the same behavior. Therefore, an in-depth analysis
of the programs and careful simulations were undertaken to uncover where most of the
cycles were spent. This led to the discovery that when dividing a frame into sub-frames,
1280 X 720 sized frames were particularly expensive for the frame dividing algorithm used
for DCS computing paradigm.
However, overall, it is quite disappointing to see the simulation results. W hen the
dcsjao-dist configuration is used (compared to the conventional execution), there is al
most no performance gain when the frame sizes are less than 720 x 480. There is barely
over 1.5 X of performance improvement for the larger frames. There is even a negative
performance improvement for a 352 x 288 sized motion estimation.
Even when the dcs-dist configuration is used, the performance gain is nominal (maxi
mum 3.3x). This is quite a disappointing performance improvement even considering the
slower clock frequency for CIMM architectures. For instance, for a 352 x 288 sized motion
estimation, although 48 ACMEs are executing in parallel, the performance improvement
was negative whether the distribution was used or not.
Considering the fact that the FSM of ACME was specifically tuned to execute the
operations involved in motion estimation, considering the fact that by removing unneces
sary memory transactions to load frame data to execute the operations, and considering
the fact that hundreds of ACMEs were executing in parallel, the expectation for the
performance is, indeed quite high.
91
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
B 352x288
640x480
□ 720x480
□ 800x640
1280x720
1920x1080
no _ d cs VS d cs_ n o _ d ist no _ d cs VS d c s_ d ist
Comparing System Conflguratlons
Fram e Size
No. of
ACMEs
352 X 288 48
640 x 480 140
72 0 x 4 8 0 150
800 X 640 238
1260 X 720 405
1920x1080 920
Figure 6.2: Memory Access Reduction for Motion Estimation
However, in smaller frames, there are negative performance improvements and for
larger frames, the performance improvements are quite small. The maximum performance
improvement gain is about 3.3x when 920 ACMEs were executing motion estimation in
parallel for 1920 x 1080 sized frame.
Figure 6.2 shows the memory access reductions. The left set of bars and the right
set of bars represent the same system configuration comparisons as the performance
improvement in Figure 6.1.
This also shows quite unexpected results. Considering the fact that the memory
access instructions were removed (because the operations were happening where data
are located) and considering the fact that each ACME accesses its own ACME-memory,
the memory access reduction is expected to be high. However, the number of memory
accesses conversely increases for a 352 x 288 sized motion estimation. The reductions for
other sizes are less than 5.2x.
92
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
Further simulations and analysis show that the amount of computations (which in
volves many calculations and memory interactions) needed to find the proper ACME-
memory addresses and data grouping forms another big portion of the overhead for DCS
computing paradigm.
This is indeed what was not anticipated. In trying to boost the performance by
executing the kernels in parallel, we neglected to observe that the overhead involved in
preparing the execution of the kernels in parallel was so much greater th at it ruins the
potential improvement from having parallel execution.
We therefore designed a special-purpose piece of hardware to generate the initial
ACME-memory address for each pixel data that the host processor has to store into
each ACME-memory for the kernel executions. This hardware is placed on top of the
CIMM tree before the data distributor so that the address generator properly generates
an address for a data and then sends it to the data distributor so that the distributor can
properly generate other addresses.
Therefore, by having this address generation hardware, the workload of generating the
addresses is moved from the host processor to the data distributor so that the preparation
for parallel execution is almost transparent to the host processor. This configuration, a
DCS architecture with a CIMM equipped with data distribution and address generation,
is referred to as the ^ ‘dcs^dist^addrgeri’ ^ configuration.
The host processor has to signal the CIMM that from this point on, pixel data are
being stored to the memory and that the CIMM should receive each pixel data and
generate an ACME-memory address to store the pixel data. While the host processor is
93
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
sending the pixel data, it has to exclusively own the system bus to send all the pixel data.
Alternatively, each pixel store instruction can be specifically distinguished by using a bit.
The flow for storing a frame data into the CIMM (eventually to each ACME-memory)
with address generation and data distribution is as follows:
1. The host processor sends the necessary information to the CIMM. The CIMM
receives the information and saves it into registers used for its address generator.
This information is sent only once at the beginning of MPEG encoding and includes:
• The frame size (width, length)
• The sub-frame size (width, length)
• The number of sub-frames in the frame
• The macro block size (width, length)
2. The host processor signals the CIMM that it is about to send the frame data.
3. The memory controller lets the CIMM receive the data. CIMM continues to receive
data from this moment to the end of transmission.
4. The CIMM receives the data and generates according ACME-memory address for
each pixel.
5. The address generator sends the data to the data distributor for further processing.
The host processor just needs to know where in the CIMM (in each ACME) the
motion vectors are stored. The host processor does not need to know how and where the
frames are stored in the CIMM.
94
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
Talla [48] has pointed out that in order to improve the performance of media appli
cations, address generations and data grouping should be handled more efficiently. Our
findings are somewhat analogous to his report. However, his focus was to add hardware
in between the processor and the media extension units so that the address generations
can happen in the hardware rather than in the host processor. In such configurations,
the system still suffers from the memory wall problems.
Since the DCS computing paradigm exploits much greater amount of parallelism, the
amount of calculations involved in the address generation and data grouping is much
greater. Therefore, the benefit from using hardware address generator is much greater
for the DCS computing paradigm. The simulation results show a huge performance
improvement by using this technique for motion estimation executions.
The performance improvements after applying the address generation technique is
shown in Figure 6.3. There was up to 439x of performance improvement when DCS ar
chitecture with address generation is used compared to the conventional no^dcs execution.
The rightmost bars show the performance improvement from no-dcs to dcs-dist.addrgen.
There was up to 2034 x improvement in reduction of the number of memory accesses
from the host processor to memory. This is shown in Figure 6.4.
Now, let us explain how the address generator generates ACME-memory address
for a pixel. As pixels are received, the address generator keeps the track of number of
pixels received. At the same time, it identifies which ACME the data belongs to. It
also identifies which portion of the sub-frame (that belongs to the ACME) for the data
belongs to. Then, it finds the ACME-memory address by adding the ACME address.
95
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
1000
100
< u
o
c
C O
E
■ §
C D
Q .
10
0.1
no.dcs vS
dcs no dist
no dcs VS dcs dist no.dcs VS
dcs_dlst_addrgen
Comparing System Configurations
■ 352x288
■ 640x480
□ 720x480
□ 800x640
■ 1280x720
■ 1920x1080
Figure 6.3: Performance Improvement for Motion Estimation with Address Generation
the portion address and the tracking number. Figure 6.5 shows the CIMM with data
distribution and address generation hardware embedded.
The hardware required for computing the address generations involves quite a bit of
computations including divisions, multiplications and modular operations. It consumes
about 100,000 TRs. It also consumes 15 CIMM cycles to generate an ACME-memory
address for each pixel. The number of transistors is not a concern because each ACME
has that many number of transistors as spare (and recall that there are 512 ACMEs) and
the address generator is only needed at the top of the tree.
96
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
10000
g 1000 h
E
0
>
o
£
o
tr
o
Q -
< 0
< f i
0
O
o
<
O
£
0
100
10
0.1
no_dcs VS
dcs_no_dist
x Q E
no. dcs VS dcs_dist
■ 352x288
■ 640x480
□ 720x480
□ 800x640
■ 1280x720
■ 1920x1080
no..dcs VS
dcs_dist_addrgen
Comparing System Configurations
Figure 6.4: Memory Access Reduction for Motion Estimation with Address Generation
However, the number of cycles is a tremendous overhead to the ACME computations.
Therefore, the address generation is pipelined so that at each cycle an ACME-memory
can be generated. The number of transistors includes this pipelining.
Until now, the evaluation was focused solely on motion estimation. Now, let us take a
look at the overall performance improvement for M PEC encoding. Assuming that motion
estimation takes up about 90% of the total execution time of M PEC encoding, the overall
performance improvement for executing in DCS computing paradigm is about 9.7x. This
is calculated by using the Amdahl’s law [42] (shown in equation 6.1).
97
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
\ 7 ° \ 7
CIMM
Address
Generator
Motion
Data Estimation
Distributer W Distribution
Table
a c m e acm^
Figure 6.5; CIMM with D ata Distribution and Address Generation
98
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
Speedup^erall + 7 p T S : “
6.2 BL A ST K ernels
Since BLAST is too big a program to fully simulate with a software architectural simula
tor, we have reduced the problem size. A small database (ecoli.nt) which is recommended
by NCBI to examine their programs is used for the benchmark. It is assumed that there
are 560 query words that are scanned against the database sequences. Other simulation
inputs are taken from NCBI. An example test sequence (560 bytes) given by the NCBI
documentation (README for stand-alone BLAST) is used as the query sequence. The
BLAST program also is simplified to more focus on the execution of the algorithm itself
rather than supporting other additional features.
The simulation results with above configuration are shown in Table 6.1. It shows the
number of memory references and simulation cycles taken by each different architecture
configurations. The numbers for the first three columns are from a general superscalar
processor {no-dcs configuration) and the fourth column is from executing with the DCS
(without distribution) and the last column is from, executing with the DCS (with distri
bution).
The simulation results show that the execution of the match extension section takes
up almost all (97%) of the overall execution time of the BLAST execution. It also shows
that about two-thirds (66%) of the total instructions were memory access instructions.
When executed with the DCS computing paradigms, there is about a 242 x performance
99
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
Table 6.1: Performance Comparisons for BLAST Kernels Executions for Various System
Configurations
System
Configuration
NO_DCS
DCS
(NO_DIST)
DCS_DIST
No. of
Instructions
No. of mem ory
references
(Instructions)
Simulation
Cycles
CIMM FSM
Execution
(Cycles)
CIMM FSM
Execution
(Cycles)
D atabase_Scannlng 23.9 B (8.8%) 6.6 B (4%) 14.6 B (2.4%)
0.95 M
(0.0004%)
0.95 M
(0.0004%)
M atch_Extension 244 B (90%) 160 B (95%) 583 B (97%) 2.3 B (93%) 2.3 B (93%)
Overall Execution 270 B (100%) 168 B (100%) 600 B (100%)
2.48 B
(100%)
2.47 B
(100%)
improvement and a 1410 x reduction in the number of memory accesses. This is graphi
cally shown in Figure 6.6.
In Figure 6.6, again, the left bar represents the performance improvement when com
paring the conventional superscalar processor {no-dcs configuration) to a DCS without
data distribution {dcsjno_dist configuration). The right bar represents the reduction in
memory accesses reduction from using the data distribution (broadcasting for BLAST
kernels) in DCS architecture. Both DCS designs show a huge amount of performance
improvement compared to the conventional superscalar processor execution.
The number of memory access is also hugely reduced as can be seen in Figure 6.7.
The left bar and the right bar show the same system configurations comparisons as was
the case in Figure 6.6.
These simulation results are contrary to the experience we gleaned from the evaluation
of motion estimation [23]. In BLAST simulations, it turns out broadcasting does not yield
as much performance improvement for the common data (this comparison is from each
1 0 0
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
O 240
a
E
(0
E
I
o
a.
■ . -
' ■ ■ 'iV
- ■ 'i
....
■ ........—
no.dcs V S dcs_no_dist no.dcs V S dcs.dlst
Comparing System Configurations
Figure 6.6: Performance Improvement for BLAST using DCS Computing
left bar to each right bar in Figure 6.6). This is because the m ajority of the overall
memory references are coming from executing the kernels in each ACME instead of the
memory references (for storing the common data) happened prior to the kernel operations.
Sequentially storing the common data is not as great a factor in terms of the total number
of memory accesses. In motion estimation, there is the overhead of sending a data to
multiple ACMEs in variable locations. This creates a huge amount of computations
and data groupings/ungroupings which negatively affected the performance. Worse yet,
these multiple variable data replications happen repeatedly for each frame. However, in
BLAST, the common data distribution (broadcasting) takes place only once (just before
the execution of the kernels). Compared to the memory accesses happening during the
execution of the kernels, the broadcasting overhead is nominal. Additionally, the fact
that each ACME is storing the common data to the same addresses (in its own ACME-
memory) costs a nominal overhead to calculate the addresses.
101
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
1450
1400
1350
£ 1300
1250
,2 1200
(0 1150
( 0
O 1100
1050
no.dcs VS dcs_no_di8t no.dcs VS dcs.dlst
Comparing System Configurations
Figure 6.7: Memory Access Reduction for BLAST using DCS Compnting
Figure 6.8 shows the performance improvement according to the number of ACMEs
executing the match extension operations in parallel. When there is only one ACME
executing, the performance shows a negative improvement. This is due to the fact that
the CIMM is running at one-eighth of the frequency of the speed of the general-purpose
processor. If only raw cycle counts were considered, the FSM for ACME would execute
five times faster. After having two ACMEs running in parallel and more, the perfor
mance increases linearly (as much as ACMEs are running in parallel). The bars shown
in Figure 6.8 show the performance improvement when the numbers are 1, 128, 256, 384,
and 512, respectively from left to right. The curved line is to show the trend of the
performance improvement. Of course, the maximum performance gain can be achieved
when the kernel is executed in all 512 ACMEs. In the example simulation (Table 6.1),
there were 400 ACMEs running in parallel.
1 0 2
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
1000
£ 4 3
I 100
128 256 384 512
Number of ACMEs Executing in Paraliei
Figure 6.8: Performance Improvement by Number of ACMEs Executing in Parallel for
BLAST Kernels
Now, let us look at how much faster the FSMs in each ACME can execute the kernel
operations for a given problem size. The analysis shows that the FSM for the match
extension executes about five times faster compared to the general-purpose execution (the
FSM for the database scanning executes about 38 times faster but since database scanning
consumes only about 2.4% of the total execution time, the overall performance highly
depends on the performance of the match extension by Amdahl’s law [42]. Therefore,
the improvement from database scanning is ignored). Although this is an advantage, the
frequency disadvantage for being a DRAM reduces the speedup. However, as shown in
Figure 6.8, the performance goes up as the number of ACMEs executing the kernels in
parallel increases.
Now, let us look at the performance improvement for the overall execution of BLAST.
The overall performance, according to Amdahl’s law, depends on how much of the overall
execution time does this performance improvement happen. The analysis shows that
103
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
the match extension takes up about 97% of the total execution time. However, as the
input/output handling of BLAST has been simplified and the focus is on the algorithm,
this percentage will be slightly lower due to the time required to process input/output
handling. Altschul’s analysis [2] (which state th at match extension takes up about 92%
of the total execution time of BLAST) of computation complexity of BLAST somewhat
shows analogous results.
W ith all the above three factors combined, the overall performance improvement by
using the DCS computing paradigm can be calculated. Assuming that 92% (the portion
th at is executed for match extension) of BLAST is executed by the CIMM and the rest
(8%) is executed sequentially by the host processor and that all of 512 ACMEs are running
in parallel, according to Amdahl’s law [42], as shown in equation 6.1, the Fractionenhanced
is .92 and the Speedupenhanced is 324 and Speedupgyerall becomes about 12. Therefore, the
gain is about 12 x in performance improvement when using the DCS computing paradigm.
6.3 A nalysis Sum m ary
The simulation results show that each benchmark program displays different character
istics. Both motion estimation and BLAST kernels have data that are shared by many
ACMEs. However, the amount and the occurrence of sharing greatly affect the perfor
mance.
For motion estimation, as each frame arrives, the data sharing happens. Although
the data sharing is regular, it still requires quite a bit of computations to generate the
address for sharing. However, in BLAST kernels, the data distribution only happens at
104
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
the very beginning before the BLAST kernels and it never happens again. Furthermore,
the distribution happens to the same ACME-memory address for each ACME therefore,
requires little computations.
Therefore, the simulation results show that data distribution boosts performance for
motion estimation. The performance improvement varies depends on the frame size but it
is observed that there is at least 100% (Figure 6.1) performance improvement. However,
the simulation results for BLAST kernels show that only 0.2% (Figure 6.6) performance
improvement is achieved by having data distribution.
However, BLAST kernels already gain over 240 x performance improvement just us
ing the primitive DCS computing paradigm {dcsjno.dist configuration). On the other
hand, motion estimation gains only below 3.3x performance improvement and minus
improvements for some cases.
This shows that each different application should be treated differently. T hat is why
we took one more step to design the address generation unit inside of CIMM to truly
benefit from executing motion estimation in parallel among many ACMEs. Further,
the address generator greatly helps to improve the performance for motion estimation.
However, obviously, the data distributor and the address generator are not a significant
merit to BLAST kernels.
105
R eproduced wiffi perm ission of ffie copyrigfif owner. Furtfier reproduction profiibifed wiffiouf perm ission.
6.4 A dvantage of DCS C om puting Paradigm
The DCS computing paradigm is different from the conventional hardware/software ap
proach in the sense that the conventional approach suffers from the drawback of lim
ited memory bandwidth caused by the limited number of pins and insufficient storage
space. The CIMM architecture is also different from conventional PIM architectures
(IRAM [39, 13, 29, 41, 40], FlexRAM [24], DIVA [16]) since their approaches were to
replace the existing GPPs whereas the goal of the CIMM architecture is to assist the host
processor by executing the key operations inside of memory.
The DCS computing paradigm can be used to significantly accelerate a large class of
applications in a cost-effective manner due to the following features.
• Significantly Increased M em ory Bandw idth:
- First, the DCS computing paradigm significantly decreases the average pro
cessor and memory latency by using a completely-integrated PIM technique.
- Second, the DCS computing paradigm further increases the effective memory
bandwidth by accessing the ACME memory in each ACME in parallel. At
the same time, it reduces the latency again since each ACME contains smaller
memory elements which in turn makes the travel of the data shorter.
- Third, the DCS computing paradigm significantly reduces the memory ac
cesses by removing the data-intensive computations from the host processor.
It migrates such computations where the data are located.
106
R eproduced witti perm ission of ttie copyrigfif owner. Furtfier reproduction profiibifed witfiout perm ission.
• Faster C om putation:
— First, the DCS computing paradigm accelerates the computations by employ
ing kernel-specific hardware to execute the kernels. This hardware execution
is generally much faster than the software execution of the host processor.
— Second, the DCS computing paradigm further accelerates the computations
by exploiting the inherent parallelism of the kernels among many ACMEs.
— Third, because of the above two factors, the amount of computation performed
by software on the host processor decreases, thereby accelerating the operations
performed in software.
• Efficient C o-D esign of Software and Hardware:
— First, the DCS computing paradigm allows efficient Co-Design of software
implementation (to be executed on the host processor) of those operations of
the given applications that cannot be cost-effectively moved to the CIMM,
with hardware implementation (within CIMM) of the remaining operations.
— Second, when not executing the kernel operations in each ACME in the CIMM,
the host processor accesses the data stored in any ACME-memory by viewing
the modules as a large memory. The time required to access memory in the
CIMM, if properly designed, can be made comparable to that required for a
traditional memory access, which is also typically organized into submodules
to minimize delays. This allows the host processor fast access to memory in
CIMM which enables efficient implementation of operations in software that
are exceedingly expensive to implement in hardware. This also provides a great
107
R eproduced with perm ission of the copyrighf owner. Further reproducfion prohibited without perm ission.
deal of flexibility and hence makes the DCS computing paradigm suitable for
a wide range of applications.
108
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
C hapter 7
Sum m ary and Future O pportunities
In this dissertation, we have presented an efficient PIM architecture which is highly
efficient in executing the kernel operations of data-intensive applications.
We have shown that data-intensive applications have quite different characteristics
when compared to conventional applications and that indeed, general-purpose processors
are not quite suitable to execute them. We have also shown that the memory wall problem
is extremely critical to the data-intensive applications. Further, we have demonstrated
that inside such data-intensive applications, there are kernels that occupy over 90% of
the total execution time.
Therefore, we have seen that using efficiently designed PIM architectures to execute
such kernels is one efficient approach we have presented a new hardware/software co
design computing paradigm which uses a family of PIM architectures to accelerate the
processing of the kernels.
109
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
The major findings of this dissertation are thus as follows:
• Data-intensive applications, and especially the kernels of such applications, have
characteristics that are quite inappropriate for general-purpose processing. Impor
tant characteristics of the kernel operations are:
- They consist of simple operations
- They are based on small operands
- They inherently contain a tremendous amount of parallelism
- They entail a huge amount of data accesses
- They follow a stream-based style execution
Motion estimation of MPEG encoding consists of additions, subtractions, absolute
value findings, and accumulations on 8-bit pixel data. It possesses an inherent
tremendous amount of parallelism (over 104 million pixel comparing operations can
be executed in parallel) and it constantly receives new frame data. BLAST kernels
consists of additions, subtractions, comparisons, and table look up operations on
8-bit characters. It also has a tremendous amount of parallelism (each sequence can
be searched in parallel and there are 24 million of them. Also, each query word can
be search in parallel as well) and it constantly searches new sequences.
• The current processor and memory structure is limited in that it cannot easily
support larger-sized motion estimations in real-time. Current memory technology
can support only up to 800 x 600 sized frames motion estimations. However, a
1 1 0
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
CIMM architecture can provide sufficient memory bandwidth (for even HDTV-sized
motion estimations) by accessing each ACME-memory in a parallel fashion.
• By executing the data-intensive kernel operations inside the CIMM rather than
bringing data to the processor, there is a significant reduction in the number of
memory accesses. The analysis shows that there are about 1410x fewer memory
accesses for the BLAST kernels and there is a reduction by a factor of about 2034 x
for motion estimation.
• W hen executing the kernel operations in parallel among many ACMEs, some data
are shared by many ACMEs which is a problem. There can be up to 60% more
processor and memory interactions if the host processor has to deliver the shared
data to multiple ACMEs. By having an efficient data distribution mechanism inside
of CIMM, this overhead was removed. The mechanism was limited but cost-effective
for simple and regular access pattern of motion estimation and BLAST kernels.
• Detailed analysis shows that for motion estimation, the computations required for
address generations are a big overhead for DCS computing paradigm. However, the
performance of the BLAST kernels are not much affected by the address generations.
By having an efficient address generation hardware inside of CIMM, there is a
huge performance boost. The performance improvement is up to 439x for motion
estimation and 324x for BLAST kernels.
This dissertation leads to many different future research opportunities for DCS archi
tectures:
• Performance Evaluations using a Variety of Benchmark Programs
111
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
As we learned from motion estimation [23] and BLAST kernels, different bench
marks will exhibit different execution behavior. By studying more applications, the
CIMM architecture can better support variety of data-intensive applications. We
will be applying our methodology to such benchmarks as ATR (Automatic Target
Recognition) and ocean current simulation.
• VLSI Implementation
Implementation is an important step for the CIMM architectures because the physi
cal information such as die area, transistor count, power consumption and operating
speed should be part of the analysis.
Our next plan would thus be to design the CIMM using FPGA fabric. Designing
using FPGA synthesis is very useful for the following two main reasons:
- The prototype design can be quickly implemented to evaluate performance and
cost
- Since we are considering reconfiguration in the CIMM architecture, implement
ing the CIMM architecture using FPGA fabric is a good experience and may
have implication in terms of reconfigurability
• Architectural Issues Study
Several architectural issues need to be further scrutinized. First, efficient reconfigu
ration mechanisms for the DCS computing paradigm should be analyzed. Support
ing several different applications at the performance of ASIC modules and without
the execution overhead of general-purpose processing is attractive.
112
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
It would be also attractive if several different kernels in an application can be
supported with a limited amount of hardware. This is especially so when the kernels
have local dependencies in an ACME. For instance, both of BLAST kernels had
dependencies and if the same hardware can be used for both kernels, it would be
more efficient and cost-effective. Again, for MPEG encoding, since the output of
motion estimation is used by the DCT and IDCT operations, reconfiguring the CE
to DCT and IDCT operations after executing motion estimation would not only
eliminate the interactions between the host processor and CIMM (for the results of
motion estimations) but it would also bring the performance that only optimized
hardware (for DCT and IDCT) can provide.
However, an efficient mechanism to support reconfiguration efficiently should be
studied so that the reconfiguration penalty problems that Active Pages encountered
can be avoided. The module-level reconfiguration should be thoroughly evaluated.
Second, the granularity of ACMEs and the hierarchy of the memory tree have to be
investigated further. Different applications require different amount of memory and
logic to execute their operations. Therefore, when supporting multiple applications,
the efficient memory structure and the granularity of ACMEs are important.
Third, supporting unavoidable communications between ACMEs is another issue
that needs some investigation. Current limited data distributor is suitable for
simple and regular access pattern of example applications. However, if complex
113
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
communications are necessary, CIMM needs to be equipped with a dedicated com
munication protocol. It also needs to be evaluated if having such a communications
is really cost-effective for data-intensive applications.
Fourth, the interactions between the cache in the host processor and the PIM mod
ule should be efficiently controlled. The majority of data-intensive applications are
stream-based and seldom reuse the data that applications have already operated
upon. However, current cache block replacement scheme is not suitable for the char
acteristics of the applications. Therefore, an efficient cache control mechanism is
necessary. By having such a scheme, the cache can be better utilized for containing
useful data and prevented from doing wasteful works.
• Compiler Support
The CIMM architecture relies on the programmer to divide the work between the
host processor and the CIMM. Auto-partitioning of the work between the host
processor and the CIMM will be very convenient for the programmers. This is
also very im portant because it will draw strong attention from the applications
programmers.
Automatic kernel to FSM translation would be very useful also. If the program
mer identifies the kernels. If the translator generates the FSM and the required
functional units (hardware), it would help the users to use the DCS computing
paradigm.
• CIMM Module Compiler Implementation
114
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
Since a CIMM is a kind of memory, if a CIMM_module compiler (similar to a
memory compiler used by many design companies), is designed , the CIMM modules
can be automatically generated for each different set of applications.
Therefore, by having several different kinds of CIMM modules, different kinds of
data-intensive applications can be efficiently supported. The CIMM modules are
also used as ordinary memory when the computations inside are not used.
115
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
Bibliography
[1 ] s. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman. Basic Local
Alignment Search Tool. Journal of Molecular Biology, 215:403-410, 1990.
[2] S. F. Altschul, T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller, and
D. J. Lipman. Gapped BLAST and PSI-BLAST: A New Generation of Protein
Database Search Programs. Nucleic Acids Res., 25:3389-3402, 1997.
[3] T. Austin, E. Larson, and D. Ernst. SimpleScalar: An Infrastructure for Computer
System Modeling. IEEE Computer, 35(2):59-67, 2002.
[4 ] R. Bhargava, L. K. John, B. L. Evans, and R. Radhakrishnan. Evaluating MMX
Technology Using DSP and Multimedia Applications. In International Symposium
on Microarchitecture, pages 37-46, 1998.
[5 ] D. Burger, T. M. Austin, and S. Bennett. Evaluating Future Microprocessors: The
SimpleScalar Tool Set. Technical Report CS-TR-1996-1308, University of Wisconsin-
Madison, 1996.
[6] D. Burger, J. R. Goodman, and A. Kagi. The Declining Effectiveness of Dynamic
Caching for General-Purpose Microprocessors. Technical Report UWMADISONCS
CS-TR-95-1261, University of Wisconsin-Madison, January 1995.
[7 ] D. Burger, J. R. Goodman, and A. Kagi. Memory Bandwidth Limitations of Future
Microprocessors. In Proceedings of the 23rd International Symposium on Computer
Architecture, pages 78-89, 1996.
[8 ] D. Burger, J. R. Goodman, and A. Kagi. Limited Bandwidth to Affect Processor
Design. IEEE Micro, 17(6):55-62, 1997.
[9] A. Cataldo. MPU Designers Target Memory to Battle Bottlenecks. EE Times,
30(9):43-45, October 2001.
[10] F. Cavalli, R. Cucchiara, M. Piccardi, and A. Prati. Performance Analysis of MPEG-
4 Decoder and Encoder. In International Symposium on Video/Image Processing and
Multimedia Communications, 2002.
[11] J. Corbal, R. Espasa, and M. Valero. Implementation of Motion Estimation IP Core
for MPEG Encoder. In International Technical Conference on Circuits/Systems,
Computers and Communications, pages 72-80, 1999.
116
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
[12] K. Diefendorff and P. K. Dubey. How Multimedia Workloads Will Change Processor
Design. IEEE Micro, 30(9):43-45, 1997.
[13] R. Promm, S. Perissakis, N. Cardwell, C. E. Kozyrakis, B. McGaughy, D. A. Patter
son, T. E. Anderson, and K. A. Yelick. The Energy Efficiency of IRAM Architectures.
In Proceedings of the 2fth International Symposium on Computer Architecture, pages
327-337, 1997.
[14] D. L. Gall. MPEG: A video compression standard for multimedia applications. In
Communications of the ACM, volume 34, pages 47-58, April 1991.
[15] E. Glemet. LASSAP: A LArge Scale Sequence compArison Package. Bioinformatics,
32(2):137-143, 1997.
[16] M. Hall, P. Kogge, J. Roller, P. Diniz, J. Chame, J. Draper, J. LaCoss, J. Granacki,
A. Srivastava, W. Athas, J. Brockman, V. Freeh, J. Park, and J. Shin. Mapping
Irregular Applications to DIVA, A PIM-based Data-intensive Architecture. In Pro
ceedings of the Supercomputing 1999, 1999.
[17] Hitachi Ltd. Hitachi Combines DRAM and Logic on a Single Silicon, 1997.
[18] E. Hjalmarson. High-Speed Gigabit Memories, 2001.
[19] IBM Microelectronics. Blue Logic SA-27E ASIC, 1999.
[20] IBM Microelectronics. IBM Chip Advance Spurs (System-on-a-Chip) Products,
1999.
[21] K. Itoh. Limitations and Challenges of Multigigabit DRAM Chip Design. IEEE
Journal of Solid-State Circuits, 32(5):624-634, 1997.
[22] ITRS. International Technology Roadmap for Semiconductors 2002 Update Overall
Roadmap Technology Characteristics. ITRS, 2002.
[23] J.-Y. Kang, S. Shah, S. Gupta, and J.-L. Gaudiot. An Efficient PIM (Processor-In-
Memory) Architecture for Motion Estimation. In IEEE 14th International Confer
ence on Application-specific Systems, Architectures and Processors, pages 273-283,
June 2003.
[24] Y. Kang, M. Huang, S. Yoo, Z. Ge, D. Keen, V. Lam, P. Pattnaik, and J. Torrel-
las. FlexRAM; Toward an Advanced Intelligent Memory System. In International
Conference on Computer Design (ICCD), 1999.
[25] J. Kneip, S. Bauer, J. Vollmer, B. Schmale, P. Kuhn, and R. Bosch. The MPEG-
4 V id eo C oding S tandard - A V L SI P oin t o f V iew . In 1998 IE E E W orkshop on
SIGNAL PROCESSING SYSTE M S (SiPS): Design and Implementation, pages 43-
52, Octover 1998.
117
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
[26] P. Kogge, J. Brockman, and V. Preeh. PIM Architectures to Support Petaflops Level
Computation in the HTMT Machine. In 1999 International Workshop on Innovative
Architecture for Future Generation High-Performance Processors and Systems, pages
35-44, 2000.
[27] P. Kogge, T. Sunaga, H. Miyataka, K. Kitamura, and E. Retter. Combined DRAM
and Logic Chip for Massively Parallel Systems. In Proceedings of the Sixteenth
Conference on Advanced Research in VLSI, pages 4-16. IEEE Computer Society
Press, Los Alamitos, CA, 1995.
[28] C. E. Kozyrakis. A Media-Enhanced Vector Architecture for Embedded Memory
Systems. Technical Report UCBCSD-99-1059, University of California, Berkeley,
July 1999.
[29] C. E. Kozyrakis, S. Perissakis, D. Patterson, T. Anderson, K. Asanovic, N. Card-
well, R. Promm, J. Golbus, B. Gribstad, K. Keeton, R. Thomas, N. Treuhaft, and
K. Yelick. Scalable Processors in the Billion-Transistor Era: IRAM. Computer,
30(9):75-78, 1997.
[30] D. Lavenier. SAMBA : Systolic Accelerator for Molecular Biological Applications.
Technical Report RR-2845, Irisa, 1996.
[31] D. Lavenier and J.-L. Pacherie. Parallel Processing for Scanning Genomic Data-
Bases. In Parallel Computing: Fundamentals, Applications and New Directions,
Proceedings of the Conference ParCo’ 97, 19-22 September 1997, Bonn, Germany,
volume 12, pages 81-88, 1997.
[32] Mitshbishi Semiconductor. Mitshbishi Semiconductor eRAM/System Integration,
1999.
[33] MPEG.ORG. http://w w w .m peg.org/M PEG /index.htm l.
[34] National Genter for Biotechnology Information. NGBI News, Spring 2003. N C BI
News, 25:8, 2003.
[35] NGBI. National Genter for Biotechnology Information,
http://w w w .ncbi.nlm .nih.gov/.
[36] Neomagic. The Neomagic MiMagic Product Pamily, 2001.
[37] M. Oskin, P. T. Ghong, and T. Sherwood. Active Pages: A Computation Model
for Intelligent Memory. In The 1998 Annual International Symposium on Computer
Architecture, pages 192-203. IEEE Computer Society, June 1998.
[38] M. Oskin, J. Hensley, D. Keen, P. T. Chong, M. Parrens, and A. Chopra. Exploiting
ILP in Page-Based Intelligent Memory. In The 1998 International Symposium on
Microarchitecture, pages 208-218. IEEE Computer Society, June 1999.
[39] D. Patterson, T. Anderson, N. Cardwell, R. Promm, K. Keeton, C. Kozyrakis,
R. Thomas, and K. Yelick. A Case for Intelligent RAM. IEEE Micro, 17(2):34-
44, 1997.
118
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
[40] D. Patterson, K. Asanovic, A. Brown, R. Promm, J. Golbus, B. Gribstad, K. Keeton,
C. Kozyrakis, D. Martin, S. Perissakis, R. Thomas, N. Treuhaft, and K. Yelick.
Intelligent RAM (IRAM): the Industrial Setting, Applications, and Architecture. In
ICCD ’ 97 International Conference on Computer Design, 1997.
[41] D. Patterson, N. Cardwell, T. Anderson, R. Fromm, K. Keeton, C. E. Kozyrakis,
R. Thomas, and K. Yelick. Intelligent RAM (IRAM): Chips T hat Remember and
Compute. In ICCD ’ 97 International Conference on Computer Design, pages 224-
225, 1997.
[42] D. A. Patterson and J. L. Hennessy. Computer Architecture: A Quantitative A p
proach. Morgan Kaufmann, San Mateo, CA, 2002.
[43] J. Rabaey. Digital Integrated Circuits: A Design Perspective. Prentice-Hall, Febru
ary 2002.
[44] H. Schmit. Incremental Reconfiguration for Pipelined Applications. In IEEE Sym
posium on FPGAs for Custom Computing Machines, pages 47-55. IEEE Computer
Society Press, 1997.
[45] H. Singh, M.-H. Lee, C. Lu, F. J. Kurdahi, N. Bagherzadeh, and E. M. C. Filho. Mor-
phoSys: An Integrated Reconfigurable System for Data-Parallel and Computation-
Intensive Applications. IEEE Transactions on Computer, 49(5):465-481, 2000.
[46] M. Sternberg. Protein Structure Prediction: A Practical Approach. Oxford Univer
sity Press, Oxford, England, January 1997.
[47] M. Sun and K. Yang. A Flexible VLSI Architecture for Full-search Block-Matching
Motion Vector Estimation. In IEEE Int. Symp. on Circuits and Systems, pages
179-182, May 1989.
[48] D. Talla. Architectural techniques to accelerate multimedia applications on general-
purpose processors. Technical Report Ph.D. thesis, University of Texas at Austin,
2001 .
[49] Y. Tian, E. Sha, C. Chantrapornchai, and P. Kogge. Efficient D ata Placement for
Processor-in-Memory Array Processors. In Ninth ISATED International Conference
on Parallel and Distributed Computing and Systems, 1997.
[50] K. Yang, M. Sun, and L. Wu. A Family of VLSI Designs for Motion Compen
sation Block Matching Algorithm. IEEE Transactions on Circuits and Systems,
36(10):1317-1325, 1989.
119
R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Architectural support for efficient utilization of interconnection network resources
PDF
Architectural support for network -based computing
PDF
Adaptive dynamic thread scheduling for simultaneous multithreaded architectures with a detector thread
PDF
Alias analysis for Java with reference -set representation in high -performance computing
PDF
I -structure software caches: Exploiting global data locality in non-blocking multithreaded architectures
PDF
A framework for coarse grain parallel execution of functional programs
PDF
Encoding techniques for energy -efficient and reliable communication in VLSI circuits
PDF
Automatic code partitioning for distributed-memory multiprocessors (DMMs)
PDF
Intelligent image content analysis: Tools, techniques and applications
PDF
Decoupled memory access architectures with speculative pre -execution
PDF
Deadlock recovery-based router architectures for high performance networks
PDF
Automatic array partitioning and distributed-array compilation for efficient communication
PDF
Contributions to content -based image retrieval
PDF
Energy -efficient strategies for deployment and resource allocation in wireless sensor networks
PDF
Energy and time efficient designs for digital signal processing kernels on FPGAs
PDF
Functional testing of constrained and unconstrained memory using march tests
PDF
Consolidated logic and layout synthesis for interconnect -centric VLSI design
PDF
Cost -sensitive cache replacement algorithms
PDF
Induced hierarchical verification of asynchronous circuits using a partial order technique
PDF
Clustering techniques for coarse -grained, antifuse-based FPGAs
Asset Metadata
Creator
Kang, Jung-Yup
(author)
Core Title
Efficient PIM (Processor-In-Memory) architectures for data -intensive applications
School
Graduate School
Degree
Doctor of Philosophy
Degree Program
Electrical Engineering (Computer Engineering)
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
engineering, electronics and electrical,OAI-PMH Harvest
Language
English
Contributor
Digitized by ProQuest
(provenance)
Advisor
Gaudiot, Jean-Luc (
committee chair
), Gupta, Sandeep (
committee member
), Shahabi, Cyrus (
committee member
)
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c16-514563
Unique identifier
UC11335945
Identifier
3140490.pdf (filename),usctheses-c16-514563 (legacy record id)
Legacy Identifier
3140490.pdf
Dmrecord
514563
Document Type
Dissertation
Rights
Kang, Jung-Yup
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the au...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus, Los Angeles, California 90089, USA
Tags
engineering, electronics and electrical