Close
About
FAQ
Home
Collections
Login
USC Login
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Vectorized interprocessor communication and data movement in shared-memory multiprocessors
(USC Thesis Other)
Vectorized interprocessor communication and data movement in shared-memory multiprocessors
PDF
Download
Share
Open document
Flip pages
Copy asset link
Request this asset
Request accessible transcript
Transcript (if available)
Content
VECTORIZED INTERPROCESSOR COMMUNICATION AND
DATA MOVEMENT IN SHARED-MEMORY MULTIPROCESSORS
by
Dhabaleswar Kumar Panda
A Dissertation Presented to the
FACULTY OF THE GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
DOCTOR OF PHILOSOPHY
(Computer Engineering)
December 1991
Copyright 1991 Dhabaleswar Kumar Panda
UMI Number: DP22828
All rights reserved
INFORMATION TO ALL USERS
The quality of this reproduction is dependent upon the quality of the copy submitted.
In the unlikely event that the author did not send a complete manuscript
and there are missing pages, these will be noted. Also, if material had to be removed,
a note will indicate the deletion.
Dissertation Pi.b..stKng
UMI DP22828
Published by ProQuest LLC (2014). Copyright in the Dissertation held by the Author.
Microform Edition © ProQuest LLC.
All rights reserved. This work is protected against
unauthorized copying under Title 17, United States Code
ProQuest LLC.
789 East Eisenhower Parkway
P.O. Box 1346
Ann Arbor, Ml 48 1 0 6 - 1346
UNIVERSITY OF SOUTHERN CALIFORNIA
THE GRADUATE SCHOOL
UNIVERSITY PARK
LOS ANGELES, CALIFORNIA 90089-4015
This dissertation, w ritten by
........P h sb alesw ar _ Kumar ^Panda ...............................
under the direction of hxs D issertation
Committee, and approved by all its members,
has been presented to and accepted b y The
Graduate School, in partial fulfillm ent of re
quirem ents for the degree of
Ph.D.
CpS
*9/
D O C TO R OF PHILOSOPH Y
Dean of Graduate Studies
D aie ^ 1991
DISSERTATION COMMITTEE
xatrperson
DEDICATED TO
my loving parents
Bapa and Maa
for all their encouragements and sacrifices
A ck n ow led gem en t s
• First of all, I would like to thank my thesis advisor Prof. Kai Hwang for all his guid-j
jance, encouragements, and support. It has been a great learning experience to work I
with him as a research and teaching assistant. His continuous encouragement andi
critical feedback to my research ideas have always driven me to accept challenging
problems and to provide simple solutions to them. His strong confidence in me and
expectation of seeing me as an established researcher have made me work harder;
through out my Ph.D. student life. My deepest gratitude to have him as my teacher j
and a good friend.
I am indebted to Prof. Viktor Prasanna for giving me an opportunity to pursue
my Ph.D. career at USC, providing me with initial support and guidance during my
[early work for the Ph.D., exposing me to the field of theoretical computer science,j
and building a strong research foundation with me. I am also indebted to my thesis i
committee members Prof. Michel Dubois, Prof. C. S. Raghavendra, and Prof. Ming-j
Deh Huang for their help, criticism, ideas, and suggestions. I would like to thank I
members of VISCOM project team for their critical comments towards my initial!
I
ideas and work, which later gave rise to this thesis. ;
Being an international student, life during graduate study goes with frustration,'
agony, and pain. I was fortunate to have many friends who used to come forward
when needed to extend help in spite of their own difficulties. This friendly help and
discussions have made my USC life full of memorable moments. My sincere gratitude
I
to all these friends: Suresh, Rajagopal, Sunanda, Saumya, Sharad, Rachna, Meera, |
Santosh, Mao, Rajendra, and Aarti. This thesis will remain incomplete w ithout!
I
mentioning my deep gratitude to Lucille Stivers and Gandhi Puvvada. I am veryi
indebted to them for providing help which nobody else other than family members j
i
would have done. 1
I gratefully acknowledge the financial support provided by NSF grant No. MIPS]
89-04172. |
My parents and family members have provided continuous encouragement a n d '
moral support to me during my higher studies including my stay at USA. I feel very
delighted to fulfill their dream of seeing me with a doctorate degree. It has been!
very memorable to have Debashree coming to my life two and half months before my i
graduation as my life partner. It has been only her love, affection, companionship,
jand understanding, which has given a big push to this thesis work towards its later
stage. I feel very happy to share my joy of writing this thesis with my wife, parents,
and other family members.
C on ten ts
D ed icatio n ii ‘
A cknow ledgem ents iii
L ist O f F ig u res viii
L ist O f T ables xij
A b s tra c t xiij
I
1 In tro d u c tio n 1 j
1.1 Data Movement in M ultiprocessors........................................................... 1,
! 1.2 Vectorized Memory A ccess............................................................................. 2;
j 1.3 On-the-fly Data M an ip u latio n ....................................................................... 3j
1.4 Communication V ecto rizatio n ..................................................................... 5]
1.5 Organization of the T h e s is ........................................................................... 7 1
i
j2 In terlea v ed M em o ry A ccess in M u ltip ro c esso rs 9i
| 2.1 In tro d u ctio n ........................................................................................................ 9}
j 2.2 Three Multiprocessor Configurations .......................................................... 9^
2.2.1 Bus-based Multiprocessor ................................................................. 11 j
2.2.2 Crossbar-Connected M ultiprocessor.............................................. 11'
I 2.2.3 Orthogonally-connected M ultiprocessor........................................... 11!
| 2.3 Data Allocation in Shared M e m o ry ............................................................... 12 i
! 2.4 Interleaved Memory Access Types . . . i ..................................................... 131
! I
;3 D a ta M a n ip u la tio n H a rd w a re 19
i 3.1 In tro d u ctio n .......................................................................................................... 19
! 3.2 Vector Register W indow s................................................................................... 19
3.2.1 Organization ........................................................................................ 20
i 3.2.2 R econfigurability................................................................................... 20
3.2.3 Data C oherency..................................................................................... 23 ;
3.3 Index M a n ip u la to r............................................................................................. 24
I
|
I
1 V 1
37371 Organization ...................................................................................... 24
3.3.2 Interleaved-Read-Write with On-the-fly Indexing......................... 25
3.4 Parallel Data Movement and M anipulation............................................... 27
3.5 Software Interface and Programmability ................................................... 31
F ast D a ta M a n ip u la tio n w ith O n -th e-fly In d e x in g 32
4.1 In tro d u ctio n ....................................................................................................... 32
4.2 Equivalence to a Generalized N etw ork......................................................... 32
4.3 D ata Movement Cost M o d e l......................................................................... 34
4.3.1 Similarity Between OMP and M C C ............................................... 35
4.3.2 Using Clos Network to Analyze C o s t ............................................ 35
4.3.3 A Modified Clos Network ............................................................... 38
4.4 Data Movement Complexity A n a ly sis......................................................... 40
4.4.1 Comparing OMP with C C M ............................................................ 41
4.4.2 Reduction of Data Movement Steps On O M P ................................. 44
4.4.3 Comparing OMP with M C C ............................................................ 48
4.5 Simulation Experiments and Results ......................................................... 51
4.5.1 A CSIM-based Multiprocessor S im u la to r..................................... 52
4.5.2 Simulation E x p e rim e n ts................................................................... 53
4.5.3 Simulation Results and Im plications............................................... 53
V ectorized In te rp ro c e sso r C o m m u n icatio n 58
5.1 In tro d u ctio n ....................................................................................................... 58
5.2 From Message Passing To Shared-Variable C o m m u n icatio n s.................. 59
5.2.1 Message-Passing S t e p s ...................................................................... 59
5.2.2 Primitive Message-Passing P a tte rn s ............................................... 61
5.2.3 Shared-Variable Mailbox Com m unication..................................... 62
5.3 Vectorizing Communication P a t t e r n s ......................................................... 64
5.3.1 Operational D igraphs.......................................................................... 67
5.3.2 The Vectorization P rocedure............................................................ 68
5.3.3 An Example Program C onversion.................................................. 74
5.4 Communication Bandwidth ......................................................................... 78
P ro g ra m C onversion F rom M u ltic o m p u te rs to M u ltip ro c e sso rs 80
6.1 In tro d u ctio n ....................................................................................................... 80
6.2 Converting Hypercube Programs ............................................................... 81
6.2.1 Vectorizing Communication P a t te r n s ............................................ 81
6.2.2 Hypercube Communication C om plexity........................................ 83
6.2.3 Vectorized Communication C om plexity........................................ 86
6.2.4 Reduction in Communication Complexity .................................. 87
6.3 Simulation Experiments and Results ......................................................... 91
6.3.1 Simulation Experiments P e rfo rm e d ............................................... 91
6.3.2 Simulation Results and Im plications............................................... 93
67373 TfaBeoffs~in~Computations and Communications
6.4 Converting Mesh Programs ..............................................
6.4.1 Mesh with Boundary W ra p -a ro u n d ....................
6.4.2 Mesh with Generalized W ra p -a ro u n d .................
7 C onclusions and Suggested Future Research
7.1 Summary of Research Contributions ..............................
7.2 Suggestions for Future R esearch ........................................
Bibliography
A ppendix A: A rchitecture M odeling CSIM M acros
A ppendix B: Sim ulation Program Listings
L ist O f F igu res
1.1 A typical shared-memory multiprocessor configuration with shared-
interleaved memories......................................................................................... 2
1.2 Use of vectorized data access in multiprocessors in supporting (a)
computation, (b) data manipulation, and (c) interprocessor commu
nication................................................................................................................. 3
1.3 Traditional scalar mailbox approach for interprocessor communication
in shared-memory multiprocessors................................................................. 5
2.1 Three shared-memory multiprocessor configurations using interleaved
memory organizations: (a) single bus-based multiprocessor, (b) crossbar-
connected multiprocessor, and (c) orthogonally-connected m ultipro
cessor. ................................................................................................................ 10
2.2 Shuffled data allocation of an (8 x 8) m atrix onto a (4 x 4) mem
ory array in example multiprocessor organizations with 1-D and 2-D
memory interleaving.......................................................................................... 14
2.3 Possible different scalar and vector memory accesses................................ 15
2.4 Bus protocols and corresponding access time to implement different
scalar and vector memory accesses................................................................ 16
3.1 Block diagram organization of vector register windows with on-the-fly
index m anipulator..................................................................................................21
3.2 Reconfigurability in vector register windows................................................ 22
3.3 Address mapping scheme for vector register windows....................................22
3.4 Functional organization and operating principles of an example index
manipulator with 4 memory modules on an interleaved bus....................... 26
3.5 Illustrative examples of selected classes of data movement operations
with a (4 x 4) m atrix data structure.................................................................29
4.1 Equivalence of on-the-fly indexing scheme to a generalized switch. . . 34
4.2 Logical connectivity in the distributed memories of a mesh-connected-
computer.............................................................................................................. 35
~ 473~ ~ Clos network models-to analyze data movements on a (4 x4) OMP (a)
a 3-stage Clos network model supporting alternate row and column
operations, (b) a modified Clos network modeling for row, column,
row-column, and column-row operations...................................................... 37
4.4 Basic data movement steps in the orthogonal multiprocessor and the
associated state transition............................................................................... 40
4.5 Single step or multiple substeps realization of an intra-column data
movement operation on an example 3— processor crossbar-connected
multiprocessor with 9 memory modules and 3 interleaved buses. . . . 42
4.6 Source tag generation for linear array data routing in MCC from Clos
network switch settings 49 I
4.7 Factors of reduction in (a) computation overhead (instructions used J
for computation and data manipulation) (b) total execution time by ;
using on-the-fly index manipulation to shift 5 columns of a (P x P) i
matrix on different orthogonal multiprocessor configurations.................54'
4.8 Comparison of (a) instruction overheads and (b) total execution time
in shifting a (P x P) m atrix by 5 columns on a 16-processor orthogonal
multiprocessor 56 |
4.9 Comparison of (a) computational overhead (b) total execution time i
in performing row and column shifts of a (P x P) m atrix by using
on-the-fly index manipulation on OMP and CCM.................................... 57
I
5.1 Message-passing steps in a sample program for a 4-node multicomputer. 60
5.2 (a)Communication Graphs for primitive message-passing patterns and i
(b) transformation to memory-write and memory-read graphs 63 1
5.3 Memory access types and corresponding memory-read and memory-
write graphs 65 !
5.4 (a) Connectivity graph and (b) operational digraphs of crossbar-connected ;
multiprocessor configuration........................................................................... 68
6.1 Different clustering options for a 32-node hypercube to convert its |
program to run on a 16-processor multiprocessor 82 I
6.2 Comparison of communication complexities on 16 processor systems I
for various problem sizes 95 i
6.3 Comparison of timing complexities for three problems on different ;
hypercube, orthogonal multiprocessor, and crossbar-connected multi- i
processor configurations 98 ■
6.4 Comparison of timing complexities for two problems on different hy
percube, orthogonal multiprocessor, and crossbar-connected m ulti
processor configurations......................................................................................100
6.5 Converting a (4 x 4) mesh with wrap-around connections onto a 4- !
processor OMP......................................................................................................102
ix
5 7 6 Converting a (4 x 4) generalized mesh program onto a 4 processor :
OMP by mapping each column of 4 nodes to run on a single processor
of the OMP 103 ;
x
L ist O f T ables
2.1 Possible Scalar and Vector Memory Accesses with Interleaved Mem
ory Organization and Corresponding Access Time................................
3.1 Reconfigurability in vector register windows for variable m atrix sizes.
4.1 Comparison of Maximum Time Overheads For Implementing Data
Movement Operations with n 2 data...........................................................
5.1 Possible Message-Passing Patterns.............................................................
5.2 Memory-access Types Leading to Minimal Access Time for Imple
menting Primitive Patterns on Three Multiprocessors Using Shared-
Variable Communication Vectorization.....................................................
5.3 Estim ated Vectorized Communication Bandwidth in Mbytes/sec on
| a 32 x 32 Orthogonal Multiprocessor for varying message lengths. .
j 5.4 Estim ated Communication Bandwidth in M bytes/sec for Different
! OMP Sizes........................................................................................................
^ 6.1 Equivalent Memory Accesses to Implement Primitive Message P at
terns on Multiprocessors by Communication Vectorization.................
. 6.2 Communication Complexities for Primitive Patterns on a m-node Hy-
j percube using Circuit-Switched Communication....................................
| 6.3 Time Complexities and Associated Parameters to Implement Primi-
! tive Patterns on Two Multiprocessors using Communication vector-
| ization...............................................................................................................
i 6.4 Analytically Estimated Asymptotic Percentage Reductions in Com-
! munication Complexity by Converting a 16-processor Hypercube Pro-
j gram Onto 16-processor Multiprocessors..................................................
! 6.5 Percentage Reductions in Communication Complexity Derived by Sim-
| ulation while Converting a 16-processor Hypercube Program to run
j on 16-processor Multiprocessors...................................................................
A b stract
i !
Vectorized memory access schemes have been traditionally used in multiprocessors
to enhance computational efficiency. However, applications requiring dense commu
nication and data manipulation are unable to take advantage of these memory access
schemes. In this thesis, we take a new approach to vectorized shared-memory access
[With an objective of implementing processor-memory data movement, memory-to- [
| '
memory data manipulation, and processor-processor communication; all in vector- j
ized manner. I
This thesis has two major contributions. The first contribution lies in developing !
i
a novel vectorized memory access scheme to blend with interleaved memory orga- \
I 1
nization. During vector data transfer between processor and interleaved memory j
system, this scheme allows data elements of a vector to be manipulated on-the-fly •
under program control. Using this scheme, we develop a new concept of atomic vector
read-modify-write cycle and demonstrate parallel data m anipulation with minimal
overhead from processors. W ith two-dimensional interleaved memory organization, :
jwe demonstrate up to 75 % savings in computational bandwidth in implementing
m atrix shifts and rotations. This scheme demonstrates potential to achieve concur
rent computation and data manipulation.
The second contribution is in developing a new concept of memory-based vec-
iorized interprocessor communication on multiprocessors with interleaved shared
memories. We configure this shared-memory as a collection of vector mailboxes.
I ^ »
W ith a suitable allocation of these mailboxes, we demonstrate that processors can
|
exchange messages by vector memory-write and memory-read accesses. Similar to
vectorizing computational steps, this approach allows communication steps of a par
allel program to be vectorized. We present a communication vectorization scheme. ,
This scheme vectorizes interprocessor communication steps of a distributed-memory
multicomputer programs and implements them on a shared-memory multiprocessor, j
'Due to vector-oriented communication, sucli program conversion leads to a signifi
cant reduction in communication complexity. Three multiprocessor configurations
axe evaluated in their capabilities to support this vectorization. Communication
complexities in these multiprocessors are compared with those of a hypercube sys
tem using circuit-switched message passing. For applications requiring all-to-all type
[of dense message patterns, communication complexity reduces by a factor of two to
1
four when a hypercube system is compared with a shared-memory multiprocessor of
,the same size. I
C h ap ter 1
In tro d u ctio n
1.1 D a ta M ovem en t in M u ltip ro cesso rs
Efficient processor-memory data movement often leads to significant performance
enhancement in shared-memory multiprocessors [Bai87, BC90, YTL87]. Traditional
approach in multiprocessor architecture development has been to implement high
bandwidth and low latency processor-memory data movement. This objective has
been geared towards matching computational bandwidth of the system with memory
bandwidth and achieving high-performance computation. However, many applica-j
I
tions like numerical simulations, large-scale m atrix computations, and image process- j
ing demand operations like shift, rotate, and row-column exchanges on matrix/image!
data. These data movement operations either occur between computational steps I
or are required between consecutive phases of a multiphase algorithm while solving
a problem. Data exchange between processors during computation is well known]
as interprocessor communication. The later type of data movement occurs when
a given data allocation in memory does not work as an optimal allocation for the
parallel program to operate on. This requires a change in data allocation. In the
absence of any computation or processing, these types of data movement are known
as data manipulation. Hence, data movement plays an im portant role in all three
1
factors of multiprocessing such as computation, interprocessor communication, and
data manipulation.
1.2 V ectorized M em ory A ccess
Interleaved memory organization supports vectorized data movement between mem
ory modules and processors. Vector supercomputers use this memory interleaving
extensively to implement pipelined data transfers between interleaved memory mod
ules and pipelined computational units. Memory interleaving is used in multipro
cessors to support high bandwidth and low latency shared-memory access.
Consider multiprocessor configurations supporting interleaved shared memories
as shown in Fig. 1.1. Each processor has its own local memory. System intercon
nect facilitates vectorized data transfer between processors and interleaved memory
modules. These vectorized memory accesses have been used traditionally to read
vector data from memory, compute on the data, and write back results to memory
as vectors. Such processor-memory vector data transfer facilitates computation as
indicated by Fig. 1.2a.
Shared Memory
Processors
System
Interconnect
interleaved
memory modules local memories
Figure 1.1: A typical shared-memory multiprocessor configuration with shared-
interleaved memories.
In this thesis, our goal is to investigate effective usage of vectorized memory
access in implementing efficient memory-to-memory data manipulation (Fig. 1.2b)
2
and processor-processor interprocessor communication (Fig. 1.2c). We consider mul
tiprocessors with one and two-dimensional memory interleaving. Multiprocessors
with single or multiple buses support one-dimensional memory interleaving. Two-
dimensional memory interleaving is supported by orthogonally-connected multipro
cessor [HTK89]. Using these multiprocessor configurations as test beds, we compare
and contrast the effect of memory interleaving in supporting vectorized data m anip
ulation and inter processor communication.
(a) computation
(b) data manipulation (c) interprocessor Communication
P - processor subsystem I - system interconnect M - memory subsystem
Figure 1.2: Use of vectorized data access in multiprocessors in supporting (a) com-!
putation, (b) data manipulation, and (c) interprocessor communication.
1.3 O n -th e-fly D a ta M a n ip u la tio n
One easier approach in using vectorized memory access in data manipulation is
to use a three step sequence: processors reading data from shared-memory using
vector-read accesses, manipulating data in their respective data buffers by instructionj
execution, and finally writing m anipulated data back to shared-memory as vector-
write accesses. In this sequence, processors take active part in data manipulation
operation and it undercuts processors from regular computational duties. Moreover,
data m anipulation operations are themselves defined as operations requiring neither
3
processing nor computation. This leads to a problem of determining whether it is
possible to implement data manipulation operations using vectorized memory access
and minimal (or none) participation from processors.
We take an on-the-fly indexing approach to solve this problem. During a vector-
read access, data elements fetched from interleaved memory modules are w ritten to
data buffers associated with respective processors. Similarly, a vector-write access
reads associated data buffers and writes their contents to interleaved memories. We
introduce a scheme of indexing these data buffers. During vector-read and vector-
write accesses, the indexing scheme provides flexibility to select appropriate data
buffers. This selection is programmable and gets implemented on-the-fly during
vector-read and vector-write accesses.
Using this scheme, we develop a concept of vector-read-modify-write access. W ith
n-way memory interleaving, this scheme reads n data elements in one interleaved-read
cycle and writes them back with any desired mapping in the following interleaved-
write cycle. These two back-to-back cycles implement an atomic vector-read-modify-
write access. Data manipulation operation using this atomic access is initiated by
processor costing only few instructions. Rest of the operation does not need at
tention from processor and gets implemented concurrently with computation. We
show that such atomic access provides functionality of a generalized interconnec
tion network, characterized by Thompson [Tho78]. The effectiveness of this scheme
is evaluated by analyzing time complexity in realizing permutations and general
ized mapping. Two-dimensionally interleaved orthogonal multiprocessor is shown to
be powerful with this scheme compared to one-dimensionally interleaved crossbar-
connected multiprocessor. Simulation experiments indicate that as the degree of
memory interleaving increases, on-the-fly indexing becomes more powerful and pro
vides significant savings in computational bandwidth.
4
1.4 C om m u n ica tio n V ecto riza tio n
Processors of a shared-memory system have traditionally used memory-based mail
boxes [Zho90] to communicate with each other by passing messages as shown in
type of scalar communication works well for parallel programs requiring only one-to-
one or permutation type of communication. However, most scientific and numerical
applications use dense message patterns like broadcast, multicast, and personalized
multicast [HJ89, Kum88, LEN90, LN90]. Several routing schemes have been pro
posed in the literature [CE90, LS90] for implementing these communication-intensive
message patterns efficiently on distributed-memory multicomputers.
These patterns can easily be implemented through memory-based message pass
ing by breaking them into multiple scalar communication steps. However, this ap
proach limits the communication efficiency due to network and memory-access con
flicts [Map90]. In this thesis, we raise a challenge in finding out new schemes for
fast implementation of broadcast and multicast type of message patterns on shared-
memory multiprocessors.
Fig. 1.3. These communication operations are primarily of scalar type in nature and
have been used for synchronization and semaphore implementations. The mailbox
Processors
P.
M em ory m odules
m ailbox (i,k)
m ailbox (j,k)
Figure 1.3: Traditional scalar mailbox approach for interprocessor communication
in shared-memory multiprocessors.
5
We solve this problem by using memory-based communication and vectorized
memory access techniques. A methodology is provided to convert send and receive
operations of an interprocessor communication step into equivalent write and read
memory-based mailbox accesses. Instead of scalar mailbox, we develop a new concept
of vector mailbox. These vector mailboxes are allocated to interleaved memory
modules in such a way that maximum number of mailbox accesses are implemented
as vector accesses. This facilitates fast implementation of broadcast and multicast
type of message patterns.
We determine primitive message patterns used in parallel programs. We present
a communication vectorization scheme. This scheme analyzes processor-memory
interconnection network and interleaved memory organization of a shared-memory
system, allocates vector mailboxes to memory modules, and determine optimal vec
tor and scalar memory accesses to implement a message pattern with minimal time.
Compared to link-based communication, memory-based message passing requires
low software overhead [WB91]. Vectorized message-passing still reduces communi
cation overhead of a parallel program by taking advantage of pipelined messagej
i
transfer to vector mailboxes. This leads to the use of shared-memory as an effective^
interprocessor communication medium. Processors considered in our multiprocessor
models have local memories attached to them. Hence, communication vectorization
concept leads to new ways of implementing distributed-memory parallel programs
on shared-memory multiprocessors by using shared-memory to implement vectorized
communication. This broadens the scope of a shared-memory system as a flexible
system architecture [Gha89] and supports both shared-memory and message-passing
models of parallel computation [BM89].
6
We demonstrate such program conversion from various distributed-memory mul
ticomputers such as hypercube and mesh to run on crossbar-connected and orthog
onally connected multiprocessors. We concentrate more on circuit-switched hyper
cube [Bok91a], due to its popularity among parallel computing researchers. We
convert several hypercube programs representing primitive message patterns. The
associated reductions in communication complexity are determined both analytically
and through simulations. Significant reduction in communication complexity are ob
served by converting hypercube programs using one-to-all, all-to-one, and all-to-all
dense interprocessor communication.
1.5 O rgan ization o f th e T h esis
This dissertation is organized into six chapters. Chapter 2 emphasizes on interleaved
memory access in multiprocessors. We present three multiprocessor configurations
using one-dimensional and two-dimensional memory interleaving. Different data
allocation and memory access schemes with interleaved memory organization are
discussed. Access times for these memory accesses are derived.
Data manipulation using vector-read-modify-write access requires efficient data
buffer organization associated with processor. We present such a hardware buffer
organization, defined as vector-register-windows, in Chapter 3. This data buffer
organization is reconfigurable to m atch with degree of memory interleaving. We also
discuss about the capability of this buffer organization to support data coherency in
multiprocessors with shared-memories. Hardware design and programmable controls
of an index manipulator to support on-the-fly indexing and vector-read-modify-write
access are presented. We illustrate few parallel data movement and manipulation
operations on crossbar-connected and orthogonally-connected multiprocessors using
on-the-fly indexing scheme.
7
Chapter 4 emphasizes on analyzing data manipulation capability of on-the-fly
indexing scheme. We demonstrate that vector-read-modify-write access is func
tionally equivalent to realizing interconnections with a generalized interconnection
switch box. We analyze data manipulation capability of crossbar-connected and
orthogonally-connected multiprocessors using this indexing scheme and compare it
with that of a mesh-connected-computer with comparable size. We take a Clos net
work modeling approach in this comparison. Related simulation experiments and
results are presented.
Chapter 5 centers around the theme of vectorized interprocessor communication.
Modeling of interprocessor communication steps and determining primitive message
patterns are presented. We introduce vector mailboxes and associated memory-
based communication. A communication vectorization methodology is presented
to determine optimal vector and scalar memory accesses to implement a message
pattern with minimal access time. Conversion of multicomputer programs to run
on multiprocessors is emphasized in Chapter 6. Analytical and simulation results
indicate reductions in communication complexity.
Chapter 7 concludes this dissertation and suggests future research directions.
The first part emphasizes on summary of results and contributions. The second
part indicates future short-term and long-term research directions.
8
C h ap ter 2
In terlea v ed M em ory A ccess in M u ltip ro cesso rs
2.1 In tro d u ctio n
Memory interleaving leads to pipelined data transfer between processor and memory
modules. For a multiprocessor with interleaved memory organization, it is critical
that data be allocated efficiently to these interleaved memories to facilitate compu
tation. It is also im portant to identify vector and scalar access types feasible with
a given memory interleaving. In this chapter, we present three multiprocessor con
figurations. The first two use one-dimensional memory interleaving, while the third
one uses two-dimensional memory interleaving. We discuss allocating m atrix/im age
data to these two different interleaved memory organizations to facilitate data ac
cess by rows and columns. We analyze basics of memory interleaving, identify all
possible memory accesses, and determine their respective access times.
I
2.2 T h ree M u ltip ro cesso r C on figu ration s
Figure 2.1 illustrates three shared-memory configurations: single bus-based, crossbar-
connected, and orthogonally-connected. All these configurations use interleaved
shared memories.
9
Interprocessor interrupt bus
p° < a p‘(oi
Processors v p 'c p
^ r - 1 - !
• n-1
MCn MC MC,
izrzi
n-1
6 6 o
Bo
M em ories
M0 M, »V:
(a) Single bus-based
Processors
M emories
MC,
0,n-l '0,0
System
Inter
connect MC
1, 11-1
(C rossbar)
‘ n-1
n-1
(b) Crossbar-coimected
RB,
MC,
MC
l,n -l
n-1
CB,
(c) Orthogonally-connected
Interprocessor Interrupt bus
Figure 2.1: Three shared-memory multiprocessor configurations using interleaved
memory organizations: (a) single bus-based multiprocessor, (b) crossbar-connected
multiprocessor, and (c) orthogonally-connected multiprocessor (P,-: Processor, M ,-
or Memory module, 13,: Interleaved access Bus, M Cf. Memory Controller).
10
2.2.1 B u s-b a sed M u ltip ro cesso r
Consider an n processor system with n memory modules as shown in Fig. 2.1(a).
The memory modules, M0, M i,..., Mn_1; are connected to a single bus B 0. The
processors, P0,P i ,. . . ,P n- i, are connected to the bus through n identical memory
controllers, M C q, M C\,... ,M C n- i- In addition to scalar memory access, this mem
ory controller supports vector-read and vector-write accesses. Memories are n-way
one-dimens ion ally interleaved. The system provides fully shared-memory capabil
ity. An optional interprocessor interrupt bus is used to enable fast synchronization
among the processors.
2.2.2 C ro ssb a r-C o n n ected M u ltip ro cesso r
Figure 2.1b represents an n processor system with n 2 memory modules, Mo,o, Mo,i,
..., Mn_ iin_i. These memory modules are connected to n parallel buses. The
processors are connected to these n buses through a system interconnect. This
system interconnect can be a crossbar, multiple buses, or any multi-stage intercon
nection network. W ithout any loss of generality, we assume a crossbar interconnect
for our analysis. This configuration supports one-dimensional memory interleaving
and provides fully shared-memory capability. We identify such configuration as a
crossbar-connected multiprocessor (CCM) through out the thesis.
2 .2.3 O rth o g o n a lly -co n n e cted M u ltip ro cesso r
Figure 2.1c shows an orthogonally-connected multiprocessor configuration with n
processors, 2n buses, and n2 memory modules two-dimensional memory interleav
ing. This architecture concept was originally reported in [HTK89]. A design imple
m entation of this orthogonal multiprocessor was reported in [HDP+90]. We iden
tify this configuration as either orthogonally-connected multiprocessor or orthogonal
11
multiprocessor (OMP). A group of n row buses, RBo, R B i, ..., R B n- \ , are directly
connected to the processors in the horizontal dimension. The remaining n column
buses, C B q,C B \, ... ,C B n-!, are distributed across the two-dimensional memory
organization in an orthogonal way. Consider indices in the range 0 < i, j, k < n — 1.
A memory module M ij is connected to two buses, R B i and C B j. Both R B i and CBi
are controlled by the same memory controller MC{. Both row and column buses
support n-way interleaving. This configuration allows a memory module to be
shared between two processors Pi and Pj.
A processor Pk is capable of accessing n memory modules Mk,j (or M jtk) in row (or
column) mode. This orthogonal access mode underlines the basic principle behind
the OMP and provides a conflict-free shared-memory access [HTK89, SM89]. This
memory organization makes the multiprocessor a partially shared-memory system.
In this thesis, we emphasize only on two-dimensional OMP. But the principle of
orthogonal access mode can be extended to higher dimensions. Readers are referred
to [HDP+90, HK88] for OM P(n, &), a multidimensional orthogonal multiprocessor
architecture with dimension n and multiplicity k along each dimension.
2.3 D a ta A llo ca tio n in Shared M em o ry
Each interleaved bus accesses a group of n memory modules in a pipelined manner.
The S-access interleaving, with a fixed address offset to all n memory modules on the
bus, corresponds to an implicit vector load/store operation with a vector length of
n. We define the content of a memory location as an element. During a vector-read
(write) access, a group of n elements forming a vector are read from (written to) n
memory modules in a pipelined manner.
Both the crossbar-connected multiprocessor and the OMP use a two-dimensional
(n x n) memory organization. Consider a (p x p) m atrix A = (ajj) with p >
12
n. Allocation of the p2 m atrix elements to the (n x n) memory modules can be
made in several ways. A uniform allocation distributes (p2/n 2) elements to each
memory module. Figure 2.2 illustrates a shuffled partition approach in distributing
the elements of an (8 x 8) m atrix onto a (4 x 4) memory array. Each row of the
target m atrix is folded onto two consecutive address offsets. In this example, the
capacity of each memory module is assumed to be four words.
Figure 2.2 a shows the effect of 1-D row interleaving corresponding to the crossbar-
connected multiprocessor. A vector-read row access with address offset 1 results in
vectors with elements (ao,4, ao,5> ao,6, ao,7) an< l (a 2,4, ° 2,5, a2,6 , “2,7) for processors Pq
and P2 respectively. Figure 2.2 b shows 2-D row and column interleaving. Similar to
the row access, a vector-read column access with address offset 1 results in fetching of
two vectors with elements (ao,4, < * 1,4, < 22,4, 03,4) and (< * 0,6 , 0 1 , 6 , 0 2,6 , 0 3 ,6) for processors
P0 and P2, respectively.
2.4 In terlea v ed M em o ry A ccess T y p es
Design of memory sub-system in all three multiprocessor configurations is assumed
to support both scalar and vector memory accesses. This flexibility gives rise to
six different types of memory write accesses as shown in Fig. 2.3. W ithout loss of
generality, we assume the bus width to be equal to width of a memory word. Scalar-
write allows a processor to write one data to a fixed memory module in a single
access step. Block-scalar-write allows to write a block(/) of data to consecutive I
addresses of a single memory module. A single data is written to identical addresses
of all memory modules on a bus using broadcast access. Block-broadcast facilitates a
block of data to be broadcasted to all memory modules.
Multiple data (or a vector of data) are w ritten to the same address of a set of
memory modules connected to a single bus by vector-write access. This access is
13
0 ^
1
2
3
0
1
2
3
£
2
3
0
1
2
3
(a) O ne dim ensional row interleaving in crossbar-connected m ultiprocessor
£
_ i
2
3
0
1
R ow 2
3
address
offset
[7
2
3
0
1
2
3
(b) T w o dim ensional row and colum n interleaving in orthogonal m ultiprocessor
Figure 2.2: Shuffled data allocation of an (8 x 8) m atrix onto a (4 x 4) memory array
in example multiprocessor organizations with 1-D and 2-D memory interleaving.
14
* 0 , M
2 3
C olum n address offset
0 1 2 3 0 1 2 3
ST,
Q A .
0,1
2 a
a0 , « o,
0 1 2 3
0,3 0„
0,5
a
0 ,0 a0,7
*1,W
|
i i j p i l
4,0
4*
4,1
4
5,0
a5J
1 3 1 ,1 MltJ
a l,5
a 5,l
_______^5J
ST
2,V H i
§ !
4,2
l *4.(
*4,3
3 ,1
5,2
a5.d
al,3 M l,l
1,7
5,3
ac-
IGTl
' 2, 1........
*24 a 2 :S ' a 2,7
‘6.0 6,1
a6 ‘
4 3it
Iff
3,1 *3,1
“ 7 1 7
3, 1
"a ,5
'7.0 7,1
7 ‘
f t # : *
6.2 6,3
a*-
7 17
3, 1 3,3
" 7 1 7
3,5
13,7
'7,2 a7,3
7,(
ao,o a0,l ^0.7
a 0 4 a0,5
a4,0 a4,l
a4.4 a4,5
a i,o M l ,t 31,1 Ml ,l
a l,4 al,5
a5,0 a5,l
a5,4 a5,5
a2,0 ^2.C a2,l M2.1
a2,4 a 2.s
a6,0 a6,l
a6.4 a6.5
a3,0 M3,t a3,l M3,l
a3,4 a3,5
a 7,0 a7,l
a7.4 a7,5
a0,2 a0,3 m Gj
a0,6 a()7
a4,2 a4,3
a4,6 a4,7
3 1,2 ^7,2 3 1,3 M7,3
a l,6 a l,7
a5,2 a5,3
a5.6 a5.7
a2,2 -**2,2 a2,3 M2,3
3 2.6 a 27
a6,2 a6,3
a6.6 a6.7
a3,2 M3,2 a3,3 ^3,3
a3,6 a3,7
a7,2 a7,3
a7.6 a7.7
alternatively referred as interleaved-write access. This access is extensively used in
traditional vector supercomputers to support high-bandwidth data transfer between
pipelined computational units and memory system. Block-vector-write access or
Block-interleaved-write access allows multiple vectors to be w ritten to consecutive
addresses of interleaved memory modules. Such a memory sub-system also supports
corresponding read accesses: scalar-read, block-scalar-read, vector-read, and block-
vector-read. The vector read accesses are also referred as interleaved-read and block-
interleaved-read in this thesis.
© — [
4 c>
..6
(a) scalar-write
- i )
6 - o-
(b) block-scalar-write
......£
(c) broadcast
......4
(d) block-broadcast
■ ••• ......
(e) vector-write or
interlea ved-write
(f) block-vector-write or
block-interlea ved-write
D o
ontr
Figure 2.3: Possible different scalar and vector memory accesses.
processor bus controller interleaved personalized broadcast
memory module data data
These memory accesses take different amount of access time. Consider a conflict-
free scalar memory access by a processor P, to an interleaved memory module M i< 0 .
As shown in Fig. 2.4a, there is a fixed tim e a to initiate this memory access. This
time includes time to activate memory controller, memory controller putting required
address on the bus and address propagation delay on the bus. Let /? represents time
15
needed for memory access and data propagation delay on the bus. Thus, a single
scalar memory access (either read or write) takes (a + (3) time.
a
P
Initiating a
a
P
T T I X X
I
memory r6ad/write
access
memory module: 0
(a) scalar-read or scalar-write
memory access
read/write
memory modules: Mjq Mj ^ ........... M
i,n-l
(b) vector-read or vector-write
a
P
T T X X
8
T X X I T 1 T
1 1
Initiatin« a , ................. read/write
memory access
Mi,0 Mi,l
^incrementing............... read/write............... ►
the address
M-
i,n-l
vector:
Mi,0 Mi,l
1 1 1 0 0 0
(c) block-vector-read or block-vector-write
• • ^ i , n - l
1
Figure 2.4: Bus protocols and corresponding access time to implement different
scalar and vector memory accesses.
Consider a vector access as shown in Fig. 2.4b. The vector length is assumed
equal to the degree of memory interleaving. For n-way interleaved memories, this
is n. For a read access, it takes (a + (3) time to read the first data element from
the memory module M,i0. Successive data elements from other memory modules are
streamed out of the bus in (n — 1) minor cycles with a pipelined cycle time of r.
Similarly for a write access, data elements are first streamed into registers associated
with memory modules via the bus. A parallel write control is activated afterwards
to implement vector-write. Time for this parallel-write access is included in /? as
the memory access time. Thus, the overall time to perform a single vector-access is
{a + (3 + (n - l ) r ).
Figure 2.4 c shows the protocol for a block-vector access. A block of I vectors are
written to consecutive addresses of the memory modules in consecutive I interleaved
16
cycles. We assume an overhead of 8 to increment subsequent addresses of the memory
modules. For a typical implementation, t < 8 < f3 < a. Thus, a block-vector-read
or a block-vector-write access takes (a + (/ — 1)5 + (nl — l) r ) time.
Consider a case where a processor reads n data elements from an interleaved
bus and writes them back immediately to the same address offset after rearranging
data elements. W ith appropriate hardware support, these two interleaved (vector)
accesses can be implemented as one atomic step. The bus protocol for this cycle
is similar to that of Fig. 2.4c with a value of I = 2. The tim e 8 is used to switch
■from read to write access and change an index set (we define a concept of index
set in Chapter 3). The first memory access is an interleaved-read and the following
access is an interleaved-write. Similar to a read-modify-write scalar access, this cycle
implements vector-read-modify-write access. We define this atomic vector access as
an Interleaved-Read-Write (IRW) step. The overall time to perform an IRW step is
(a + fi + (2n — 1 ) r + 8).
Table 2.1 identifies all possible memory accesses and their associated access time.
Scalar/vector accesses with single word are identified as A1-A5. Corresponding block
memory accesses are identified as BA1-BA5. Memory access A6 represents atomic
IRW step. Using the current bus technology, one can have r = 50 nsec, a = 800
nsec, /? = 200 nsec, and 8 = 200 nsec for n < 32 [HDP+90].
17
Table 2.1: Possible Scalar and Vector Memory Accesses with Interleaved Memory
Organization and Corresponding Access Time (a = fixed access time, j3 = memory
access and bus transfer time, r = pipelined bus transfer time, and 6 = address
change overhead in block access, I = number of words/vectors in a block access, n
= vector length and degree of memory interleaving).
Identifier Access Types Access Time
A1 scalar-read a + /?
A2 scalar-write
BA1 block-scalar-read a + 1 /3
BA2 block-scalar-write
A3 broadcast a + (3
BA3 block-broadcast a + 1 /3
A4 vector-read or interleaved-read a + (3 + (n — l) r
A5 vector-write or interleaved-write
BA4 block-vector-read a + (3 + (nl — 1 )r + (I — 1)6
BA5 block-vector-write
A6 interleaved-read-write a + (3 + (2 n — 1 )r + 6
18
C h a p ter 3
D a ta M a n ip u la tio n H ardw are
3.1 In tro d u ctio n
Interleaved memory organization provides vectorized data access to and from mem
ory modules. D ata buffers, attached to processors, are used to store these vector data
during memory access. Functionality associated with data buffer design facilitates
efficient vector access. We take such an approach in this chapter in designing a new
type of data buffer organization, defined as vector register windows. Design of an
associated index manipulator is presented to implement on-the-fly data manipula
tion during vector memory access. We illustrate parallel data movement operations
on multiprocessors using this data m anipulation hardware.
3.2 V ector R eg ister W in d o w s
Consider an n-way interleaved memory organization and processors with vector com
putation capability. If vector data resides in shared-memory then all arithmetic-logic
operations on vector data are done from vector registers to vector registers, attached
to the processors. This input vector operand must first be loaded from memory to
a vector register before it can be used by a processor. Similarly, an output vector
operand must be stored into a vector register first and w ritten to memory later.
19
We consider a data buffer organization consisting of several windows of such vector
registers. We define this organization as Vector Register Windows (VRWs). The
elements of VRWs are accessed by a processor in its local memory address space.
The only memory transfers possible with VRWs are from memory to vector registers
(vector-read) and from vector registers to memory (vector-write). A read (write) ac
cess to (from) a designated vector register window from (to) the shared memory is
specified by the programmer and is implemented as a DMA-like operation.
3.2.1 O rga n iza tio n
The VRW’s organization, associated with each processor, is shown in Fig. 3.1. There
axe two components of this organization. We first consider the data buffer organi
zation. The index manipulator is discussed later. Each interleaved memory access
results in a vector of n elements. A set of v row or column vectors is grouped to
gether to form a window of vectors. The VRWs consist of w windows. Consider a
multiprocessor system with n = 16 and memory elements to be 4 bytes long. Using
64 KBytes of static RAM memory, one can implement VRWs with a capacity of IK
vectors.
3.2.2 R eco n fig u ra b ility
Vector registers are dynamically reconfigured in variable-size windows as illustrated
in Fig. 3.2 and Fig. 3.3. For a target application using (p x p) matrices, any m atrix
row or column is allocated to interleaved memories as row or column vectors as shown
in Fig. 2.2. A row or column of this m atrix becomes equivalent to [£•] vectors. Hence,
v = l"nl Provides a natural window size for the VRWs. The address mapping scheme
for vector register windows works as follows: The two least significant bits, A 0 and
Aj, are used to select one of the four bytes within an element. The four bits A 2 — A 5
20
Local Data Bus
Address Bus
Data Buffers
Index Memory
Interleaved Bus
Index Manipulator
Processor
Bus
Interface
Logic
Configuration
and Control
Registers
v-X
Buffer n-1
Buffer 0
Buffer 2
ATU: Address Translation Unit TRS: Transceiver MC: Memory Controller
Figure 3.1: Block diagram organization of vector register windows with on-the-fly
index manipulator.
are used to select an element within a vector. The remaining 10 bits, A 6 — A i5, are
dynamically partitioned between the vector field and the window field. Depending
on the size of the application m atrix, log2 [£] bits are used for identifying a vector
within a window. The remaining 10 — log2f^] bits are used to identify a window.
Consider using VRWs with 64 KBytes capacity. Table 3.1 shows the number of
available windows (iu) and window size (i>) on the example multiprocessor system for|
f
a wide range of application m atrix sizes. The VRWs provide a sufficient number ofj
windows for handling large matrices to encapsulate locality of data references. T he
dynamic reconfigurability of window size supports programming flexibility in appli
cations involving multiple matrices with different dimensions. Detailed hardware
design to support this reconfigurability feature is presented in [PHRH90].
21
vectors
windows I elements
0
n-1
. . . .
2 1 0
1 n-1 • • •
2 1 0
: :
v-1 n-1 . . .
2 1 0
0
n-1 • . . .
2 1 0
1
n-1 . .
?, 1 n
:
v-1 n-1 . . . 2 1 0
i
0 n-1 . . . .
2 1 0
1 n-1 . . . .
2 1 0
•
:
v-1 n-1 . .
2 1 0
Figure 3.2: Reconfigurability in vector register windows.
A15 * •
Window Vector Element Byte
<----- Reconfigurable partition
Figure 3.3: Address mapping scheme for vector register windows.
Table 3.1: Reconfigurability in vector register windows for variable m atrix sizes,
(window size = number of vector components in each window).
Matrix Size Window
Size (v)
Number of
Windows
16 x 16 1 1024
32 x 32 2 512
64 x 64 4 256
128 x 128 8 128
256 x 256 16 64
512 x 512 32 32
1024 x 1024 64 16
2048 x 2048 128 8
4096 x 4096 256 4
8192 x 8192 512 2
22
3.2.3 D a ta C o h eren cy
From a processor’s perspective, the VRWs are partitioned into two separate address
spaces to treat global read-write and global read-only data, separately. These two
address spaces separate the VRWs to noncacheable and cacheable spaces, respec
tively. The caching boundary allows a user to partition the VRWs into cacheable
space. The VRWs addresses falling below the caching boundary are cacheable. This
boundary is programmable and can be dynamically moved during the execution of
an application program [PHRH90],
If a vector operand corresponds to global read-only data type, it is read into a
vector register in the cacheable space of the VRWs. Otherwise, it is read into the
noncacheable space. So, when the processor accesses elements of a global read-only
vector from the VRWs, the elements are allowed to be cached to the internal data
cache of the processor, if available. Noncacheable data are directly accessed from
the VRWs and are bypassed by the internal cache. This scheme serves the following
advantages:
1. If a read-only vector is used many times by an application, the computation
becomes fast by keeping the data in the internal cache of the processor.
2. In case of context switching, all global modified register windows in the non
cacheable address space of the VRWs are written back to the shared memory.
The cacheable register windows containing global read-only data are not writ
ten back to the shared memory.
3. Only few selective windows need to be flushed for small applications. This
provides a program-controlled, selective, and fast flushing scheme.
23
3.3 In d ex M an ip u lator
3.3.1 O rg a n iza tio n
Figure 3.1 shows the block diagram of an index m anipulator hardware. An index
memory, m apped to the processor’s local address space, allows a number of index
sets to be resident at any point of time. The frequently used index sets, once
generated, remain resident in this index memory until overwritten by other index
sets. These index sets are computed off-line during compile tim e and down loaded
to the processor before the program execution starts. An index set is required to be
resident in the index memory before its associated bus transfer starts.
The index manipulator operates in three modes. The first mode allows a pro
cessor to access the index memory in its local memory address space to store index
sets. The second mode corresponds to pipelined bus access, where an index set con
trols on-the-fly memory read/w rite operation. The third operational mode allows
the processor to access (both read and write) the data buffers in its local address;
space as data buffers without any indexing. The second mode allows entries of the
required index set to be read out from the index memory by a log2 n-bit counter
during n minor cycles of the interleaved access. These selected index set entries pass
through address translation logic to generate effective address for data buffers. The
accesses to index memory and data buffers are pipelined, i.e. the address generated
by index memory during one minor cycle is used as the address to the data buffers
in the next minor cycle. Hence, the pipelined cycle time r is totally constrained by
the bus propagation delay. It does not include index memory access time, counter
incrementing time, and etc.
The present technology allows to build such an index m anipulator for r = 50 nsec
using data buffers and index memory with fast static RAMs of 25 and 15 nsec access
24
times respectively [HDP+90, HP91]. In this paper, we only consider a single set of
n data buffers with one word each. Multiple sets of data buffers can also be used to
implement complex data movements. Under this circumstance, the indexing scheme
demonstrates flexibility to index data buffers from different sets. For large data
sizes, p > n, the index manipulation concept can be easily extended to achieve data
manipulation across several rows or columns mapped to the same set of interleaved
memory modules. Readers are referred to [PH90] for a reconfigurable vector register
windows organization supporting such index m anipulation for large data sizes.
3 .3 .2 In terle a v ed -R ea d -W r ite w ith O n -th e-fly In d e x in g
Consider pipelined data transfers during an interleaved-read access by a processor.
The n elements are loaded from n memory modules to a set of n data buffers asso
ciated with the processor. During a load operation, the data elements are written
to different buffers. Similarly, during a store operation, the data elements are read
from different buffers and w ritten to the memory modules. Let e, 0 < e < n — 1, be
the indices of these buffers. Consider any perm utation or mapping (p) on the index
set E = { 0 ,1 ,..., n — 1} with these indices.
As the data elements are transm itted (or received) to (or from) the interleaved
bus in a pipelined manner, the source (or destination) data buffers can be selected
based on p. We define p as an index set and the operation of selecting appropriate
buffer during a load/store operation as indexing. These index sets are generated
off-line during compile tim e as specified by the programmer and stored in index
memories associated with VRWs. During each memory access, a desired index set
is selected from the index memory to implement the required data movement.
Figure 3.4a shows the functional organization of index m anipulation logic. The
interleaved bus is assumed to have 4 memory modules and each processor having
25
Buffers
Processor
Programmable
On-the-fly
Index Manipulator
Buffer 0
Buffer 1
Interleaved Bus
Buffer 2
Buffer 3
buffer index
0
1
2
3
£
h
(a) functional organization
Address Memory Data Contents
offset
1 i i M i i f h
(b) a read operation from address offset 1 with index set {0,3, 1,2}
buffer index
0
1
2
3
(c) a write operation to address offset 1 with index set {1,1,2,0}
Figure 3.4: Functional organization and operating principles of an example index1
manipulator with 4 memory modules on an interleaved bus.
26
4 data buffers. Figure 3.4b shows the on-the-fly index m anipulation scheme during
an interleaved-read access with an index set {0,3,1,2}. As the data elements are
read from the pipelined bus in sequence, they are w ritten into the buffers 0,3,1,
I
and 2 respectively. Similarly, during an interleaved-write access with an index set I
{1,1,2,0}, the elements are read from the buffers 1,1,2, and 0 respectively and
w ritten to the memory modules as shown in Fig. 3.4c.
These two interleaved operations can be combined together as a single IWR
step producing a m anipulated output data set {g, g, h, d} from the input data set
{d,f,g,h}. For a given data m anipulation to be performed, various combination of|
{
index sets can be used during read and write accesses. W ithout any loss of generality,!
j
we assume that interleaved-read access is always performed with an identity index
set. An appropriate index set is used with interleaved-write access for the desired
data manipulation.
This indexing scheme requires least intervention from the processor to implement
data manipulation on a vector data. The overhead 6, as discussed in section 2.4, j
is the minimal tim e to change an index set in the index memory. This allows con
current computation and data manipulation in the system. This scheme avoids the'
costly procedure of duplicating data into different buffers for broadcast and arbitrary^
broadcast types of data manipulation. The duplication is done on-the-fly under thej
control of an appropriate index set. Hence, fast implementation of these data ma-j
I
nipulation operations are possible using the indexing scheme.
i
i
|3.4 P a ra llel D a ta M o v em en t and M a n ip u la tio n
The proposed on-the-fly indexing scheme provides substantial power to overlap CPU
computation with data manipulation. In the absence of such a scheme, the required
j manipulation can still be achieved by the CPU. In this case, the CPU after an
27
interleaved-read access has to m anipulate data elements in its data buffers through
instruction execution and then write them back to memories. On the worst case,
this m anipulation requires 2(n + 1) accesses to data buffers. The present scheme
completely avoids this CPU involvement. The CPU only initiates data movement |
I
operation by executing few instructions.
The concept of index sets for data m anipulation is identical to routing control
steps in an interconnection network. Based on a desired m apping/perm utation, the
associated index sets for processors are computed off-line during compile tim e and
stored to their respective index memories. Hence, the overhead associated with index
set generation does not affect the performance of on-the-fly indexing.
For structured data movements like shift and rotate, index set generation by a
compiler is straight forward, because interm ediate data movement steps are easy to
!
determine. In this section, we provide two examples of parallel data movement to
j
illustrate this concept. For non-structured data movements like arbitrary permu- j
itation or generalized mapping, it is necessary to determine the interm ediate data!
movement steps first. We present a theoretical framework in the next chapter toj
generate such interm ediate data movement steps.
Consider a (4x4) m atrix data as shown in Fig. 3.5. Four different data movement
operations are illustrated. We consider a (4 x 4) memory organization with 4 inter
leaved buses and 4 processors. We assume the m atrix being stored in the memory,
I
as shown in Fig. 2.2, with an address offset AO. We use another address offset A1
as tem porary locations. We present two algorithms. Algorithm 1 demonstrates aj
shift-left m atrix operation on a crossbar-connected multiprocessor using 1-D mem- j
ory interleaving. The system is assumed to consist of 4 processors and 16 memory!
modules. Two different index sets, identity and shift-2, are used. The complexity of!
I
this shift operation is one IRW step. i
Algorithm 1: M atrix Shift on a Crossbar-Connected M ultiprocessor
;the algorithm shifts a (4 x 4) matrix by two columns left
;the following index sets are used:
;C0 = { 0,1,2,3 } indicating an identity permutation
;C1 = { 2,3,c0,c0 } indicating a shift-2 mapping, cO is constant 0
begin
Parbegin j
For each processor P,, i=0 to 3, do •
begin I
interleaved-read from bus B{ with address offset A O
and index set CO ;
interleaved-write to bus P, with address offset A O
and index set Cl;
end;
Parend;
end. !
I
*0,0 a 0,l a 0,2 a 0,3 a 0,2 a 0,3 0 0 0 a 2,0 a 2,l a 2,2
'1,0 a u a l,2 a l,3 a l,2 a l,3 0 0 0 a 3,0
a 3,l
a 3,2
‘ 2,0 a 2,l a 2,2 a 2,3 a 2,2 a 2,3
0 0 0 0 0 0
‘ 3,0
a 3,l
a 3,2 a 3,3 a 3,2
a 3,3 0 0 0 0 0 0
(a) O riginal M atrix (b) Shift-left by (c) Shift-up-right by
2 colum ns 2 row s and 1 colum n
>0,0 a 0 ,l a 3,0 a 0,3 a 2,0 a 2,l a 2,2 a 2,3
ll,0 a l .l a 3,l a l,3 a 0,0 a 0,l a 0,2 a 0,3
'2,0 a 2,l a 3,2 a 2,3 a 3,0 a 3,l a 3,2 a 3,3
‘ 3,0
a 3,l
a 3,3 a 3,3 a l,0 a w a l,2 a l,3
✓ , (e) R ow perm utation
(d) Select 3rd row to / i
2nd cohinrn ('0- ^ 3- ^ )
|Figure 3.5: Illustrative examples of selected classes of data movement operations!
•with a (4 x 4) m atrix data structure.
Algorithm 2 implements a m atrix manipulation on a 4-processor OMP. A select-
row-to-column operation (selecting 3rd row to 2nd column) on an OMP using 2-
I
D memory interleaving is demonstrated. Two different index sets are used. The!
manipulation takes place in two phases. The first phase implements shifting 3rd row
to 2nd row. Then processor P2 moves second row to second column. This operation
takes 2 IRW steps.
Algorithm 2: Matrix Manipulation on An Orthogonal Multiprocessor
;the algorithm selects the 3rd row to 2nd column of a (4 x 4) matrix
;the following index sets are used:
;C0 = { 0,1,2,3 } indicating an identity permutation
;C1 = { c0,c0,3,c0 } indicating a select-2 mapping, cO is constant 0
begin j
Parbegin 1
Switch to column mode;
For each processor P,, i=0 to 3, do
begin
interleaved-read from bus CP, with address offset A O
and index set CO ;
interleaved-write to bus CP, with address offset A1
and index set Cl; j
!
end;
Parend; j
For processor P2 do
begin
Switch to row mode;
interleaved-read from bus R B 2 with address offset A1
and index set CO ;
Switch to column mode; 1
l
interleaved-write to bus C B 2 with address offset A O J
and index set CO ; I
’ 1
end; i
7 I
end. 1
30.
|3^5 Softw are In terface an d P ro g ra m m a b ility
jData elements, once resident in the VRWs, are directly accessed by processor in
jits local address space in a pipelined fashion. These data accesses bypass index
I
j manipulator and associated index m anipulation logic. During these accesses, the
!elements of the VRWs are physically addressed by processor based on the array
i
addressing mechanism. To access an element in the VRWs, the window number, the
vector number, and the element number are made implicit in the processor’s address.
For an example, consider a multiprocessor with 16-way memory interleaving and v
vectors/window. To access the ith element of jth vector in kth window, the effective
address generated by the processor (with assistance from compiler) is
(k x 16 x v) -j- (j x 16) + i + base address of VRWs.
By adding an offset argument to the effective address computation, several ab
stract data types are supported in the VRWs. This feature is used by the program
mer to define overlapped-window access operation with VRWs [PH90]. A window
provides a one-to-one mapping to a row (or column) of the target application ma
trix. Several smaller windows can be grouped to form a larger window. This mech-j
i
anism provides scope to define higher-level abstract data structures, like block-rowsi
I
or block-columns. The flexibility also includes overlapped window access to define
data structures on a set of contiguous elements of adjacent rows (or columns) of:
an application matrix. A window can be divided into m ultiple smaller windows to!
support sub-vector data structures. This modular and structured abstraction of thej
VRWs allows a programmer of a multiprocessor system to operate on a large appli-'
I
cation m atrix by defining the m atrix as a small collection of vector windows rather'
than a large collection of shared memory row (or column) vectors.
C h a p ter 4
i j
F ast D a ta M a n ip u la tio n w ith O n -th e-fly In d ex in g |
i
!
4.1 In tro d u ctio n
Vector register windows with on-the-fly indexing support efficient data manipula
tion. We illustrated this concept in last chapter through examples. However, a
natural question arises about how to determine its capability in performing general I
classes of data movement like perm utation and generalized mapping. In this chapter,
j
we take a theoretical approach in determining the capability of index manipulation I
hardware. Using on-the-fly data manipulator on interleaved bus, we perform a time
complexity analysis for different types of data movement. We compare the complex
ity with that on a mesh-connected-computer. Performed simulation experiments and
results are presented to indicate capability of such data manipulation in preserving
computational bandwidth.
4.2 E q u ivalen ce to a G en era lized N etw o r k j
The index m anipulation scheme is powerful due to its programmability feature.;
Consider an (n x n ) switch network. Let the n elements fetched during an interleaved- !
read access be designated as inputs and the n elements stored during the following!
interleaved-write access with indexing as outputs of this switch network. The index:
m anipulation scheme has capability to map any input to arbitrary number of outputs j
as long as each output receives only one input. Thus, this switch realizes generalized!
i
interconnections from inputs to outputs as suggested by Thompson [Tho78]. Thisj
switch can also be viewed as arbitrary many-to-one connectivity from the outputs
to the inputs [Kum88]. We denote such a switch with generalized interconnection
capability as a generalized switch network. The capability of such a switch network
is much more powerful than that of a crossbar switch.
Theorem 1 A ny one of the n n possible mappings over n data elements can be
carried out in one interleaved-read-write step using on-the-fly index manipulation
scheme with an n-way interleaved memory organization.
Proof: Consider a set D = {d0, d i,. . . , dn_i} of n data elements. Let an one
dimensional interleaved memory organization be having n memory modules M 0, M i,
. . . , Mn_1. We distribute these data elements across the memory modules such thatj
di is stored in M i, 0 < i < n — 1, at a fixed address offset A. Let J = { 0 ,1 ,..., n — 1} j
be an ordered set of n indices. Consider a mapping / from the set D to J as f(D) = ,
{* | di — » j, Vy € J, di G D}. Clearly there are nn possible such mappings. Using
i
the on-the-fly index manipulation scheme, we perform an interleaved-read access to
address A and fetch n data elements to n data buffers with identity perm utation.
Subsequently, we perform an interleaved-write access to address A with an index set
f{D). Now, the data elements are stored in the memory modules with the desired
mapping f(D). ■
Corollary 1 The on-the-fly indexing scheme with an n-way interleaved memory
organization is functionally equivalent to realizing an (n x n) generalized switch
network as suggested by Thompson [Tho78].
In addition to realizing any of the nn arbitrary mapping in a single IRW step,
the indexing scheme provides flexibility to write back one or more memory locations
with single (multiple) constants like 0,1, etc. W ithout any loss of generalization,
we consider duplication of only one constant 0, identified by the symbol cO. This
duplication is quite useful in parallel processing for filling patterns during shift and
rotate operations without any wrap-around. Hence, one step IRW operation can be
represented as an (n + 1) x n generalized switch network with a capability of realizing
any of the (n + l) n mappings. Figure 4.1 a illustrates the potential of a (5 x 4) switch
1
box to implement any of the possible 5 4 mappings. Figures 4.1 b and c show switch |
settings to realize mappings (2,0,c0,3) and (0,c0,0,c0) respectively. These switch j
settings are nothing but the respective index sets in the indexing scheme needed to ^
realize the mappings.
cO
cO
Input Output
(a) Generalized switch I
connections using (b) A mapping (2,0,c0,3) (b) A mapping (0,c0,0,c0) !
on-the-fly indexing
Figure 4.1: Equivalence of on-the-fly indexing scheme to a generalized switch (cO is;
a constant pattern 0). j
4.3 D a ta M o v em en t C ost M o d el
From the point of view of data allocation and data movement, similarity betw een1
I
OMP and MCC is shown. D ata movements on the OMP using only intra-row and
1
intra-column operations are first analyzed using a Clos network modeling. This Clos*
network is then modified to reflect row-column and column-row operations on the:
OMP. !
4 .3.1 S im ila r ity B e tw e e n O M P an d M C C
The two-dimensional memory organization of an OMP resembles in many ways to
an MCC organization as shown in Fig. 4.2. Each node, MP, of the MCC consists
|of local memory and processor. These nodes are organized in a grid structure with
j
'horizontal and vertical links. The horizontal links, n of them in a row, are joined
together to form a virtual row bus in an OMP. Similarly, the vertical links are
joined together to form a virtual column bus. The processing tasks on n nodes in a
I . .
'single row of MCC are assigned to a single processor in the OMP. D ata in the local
memory of node M P ij of MCC are mapped to memory module M ij of OMP, where
0 < i , j < n — 1. Thus, for two-dimensional data allocation and data movement,J
these two architectures are functionally identical.
MP,
MP,
MP,
O
Memory
4-switch
| | Processor
1 Horizontal
link
Vertical
link
n-1,0
n-1,n-1
Figure 4.2: Logical connectivity in the distributed memories of a mesh-connected-
computer (MP = combined Memory and Processor module).
4 .3 .2 U sin g C los N etw o rk to A n a ly z e C ost
We first analyze basic data movements on the OMP in terms of MCC data move
ments. An IRW operation in an OMP can be implemented either in row or in column
mode. A single IRW step, due to its generalized data m anipulation capability, cani
35
i ___i
simulate up to (n — 1) row or column shift operations of an MCC. Consider data
movements to realize a permutation. Based on a 3-stage Clos network analysis,
Raghavendra and Prasanna Kumar [RK86] have dem onstrated that any perm uta
tion can be realized in 3 stages consisting of 3(n — 1) steps on an (n x n) Illiac-IV
type network. Based on the rearrangeability property of a 3-stage Clos network,
the result remains valid for an MCC. Hence, any perm utation over n 2 data can be
I
realized in a maximum of 3 row and column operations on an (n x n ) MCC. Each)
operation may take one or more row (column) shift steps. We consider this time!
overhead aspect in the next section.
Consider a 3-stage Clos(n,n,n) network as shown in Fig. 4.3a with (n x n ) perm u
tation switch boxes. Let these perm utation switch boxes be replaced with generalized
switch boxes, which functionally represent on-the-fly indexing scheme. We number
the inputs and outputs of the switch boxes of these 3 stages in row,column, and row
m ajor respectively. Let the switch boxes be numbered as S(i,j),0 < i < 2,0 < j < |
i
n — 1. Since a Clos(n,n,n) network is rearrangeable, it can realize any perm utation!
between its n 2 inputs and n2 outputs. The switch settings of the first stage,
reflect the ordering of data within jth row. Similarly, the switch settings of the
second and third stages reflect the ordering of data within j th column and jth row
respectively. Let the switch setting of S(i,j) be the index set for processor j on
ith stage. D ata movement corresponding to the switch settings of one stage can
be realized in a single IRW tim e step by n processors working in parallel with the|
i
[associated index sets. Hence, we claim that any permutation over n 2 data can be,
I
I
realized in a maximum of 3 IR W steps on an (n x n) OMP using alternate row and\
column operations.
Next, we consider any arbitrary generalized mapping of n 2 elements on an (n x n)
OMP. Using (n x n) switches with general broadcast capability, Kum ar [Kum88];
RRRW CRCW RRRW
src indices
(0 ,0 )-
(0 ,1 )-
(0 ,2 )-
(0 ,3 )-
S(1,0)
(1 ,0 )-
(1,1>—
(1,2)—
(1 ,3 )-
S (l,l)
(2 ,0 )-
(2,1 y-
(2,2 y-
(2,3 y-
S(l,2)
(3,0 y-
(3.1)— J
(3 .2 ) -
(3 .3 ) -
S(l,3)
- (1,1)
Row major
ordering
stage 1
Row major
ordering
step k
destn indices
(0,0)
(0,1)
(0,2)
(0,3)
- (2,0)
- (2,1)
(2,2)
(2,3)
h < 3>°)
- (3,1)
(3.2)
(3.3)
Column major
ordering
Row major
ordering
stage 2 stage 3
(a) 3-stage Q os network model
S(k+1,0) S(k+2,0)
S (k+ l,l) S(k+2,1)
S(k+1,2) S(k+2,2)
S(k+1,3) S(k+2,3)
Column major
ordering
step k+1
Row major
ordering
step k+2
row/column operation ........... row-column/column-row i
operations j
(b) 3-stage modified Clos network model j
Figure 4.3: Clos network models to analyze data movements on a (4 x 4) OMP (a)i
a 3-stage Clos network model supporting alternate row and column operations, (b ).
a modified Clos network modeling for row, column, row-column, and column-row <
operations. \
has shown th at any generalized mapping can be realized on a 5-stage Clos(n,n,n)
network. The first 3 stages of this network perform the required data duplication
and the next 2 stages implement perm utation.
T h e o re m 2 Any generalized mapping over n2 data elements can be realized in a
maximum of 5 IR W steps on an (n x n) OMP using alternate row and column
operations.
jProof: Consider the 5 stage Clos network and the associated switch settings as
Iproposed in [Kum88]. The switches, considered in this network with general broad-
! i
jcast capability, are equivalent to our generalized switch associated with the indexing!
scheme. The switch setting of S(i,j) can be treated as the index set of processor'
j to realize ith stage of data movement. Each stage of this Clos network can be I
i
realized in a single IRW step on an OMP using the indexing scheme.
i
i
C o ro lla ry 2 Any generalized mapping of n2 data can be realized in a maximum of
5 row and column operations on an (n x n) MCC.
But, the question remains how many row (column) shift steps are needed to;
'implement an intra-row (intra-column) operation on the MCC and how does the
j ;
lassociated time overhead compare to that of an IRW step. We emphasize on this'
1
aspect in the next section. Until then, we refer MCC data movements in term s of!
i
operations and OMP data movements in term s of IRW steps. I
4 .3 .3 A M o d ified C los N etw o rk
I
So far we have only considered IRW steps on an OMP, which are equivalent to MCC
intra-row and intra-column operations. These row and column oriented IRW steps
on an OMP can be defined as Row-Read-Row-Write (RRRW) and Column-Read-
| Column-Write (CRCW). An OMP, due to its two-dimensional memory organization, j
provides flexibility of two additional IRW step types known as Row-Read-Column-
Write (RRCW ) and Column-Read-Row-Write (CRRW).
Consider the Clos network shown in Fig. 4.3a. The inputs and outputs of each
stage of a Clos network are numbered based on alternate row and column m ajor
ordering. Based on this ordering, data movements corresponding to RRCW and
CRRW operations provide straight links between adjacent stages of a Clos network.
Figure 4.3b provides a modified Clos network model to analyze all possible data
movement steps on an OMP. From a given step (stage), it allows two options to
go to the next step (stage). Any data movement operation on an OMP can start
with any of the RRRW ,CRCW ,RRCW , or CRCW basic steps and follow through a
sequence of these steps.
Though it is logically possible to go from one basic step to any other step, some
of these sequences are redundant. For example, a RRRW step followed by another
RRRW step can be combined together to a single RRRW step. Figure 4.4a shows in
dependent two-step sequences. Figure 4.4b shows valid transition sequences between
i
these basic steps. Any data movement operation on an OMP makes finite number of'
transitions (5 transitions for generalized mapping according to Theorem 2) through
this state-transition diagram.
The reducible steps shown in Fig. 4.4a are valid only when all the processors are
active. The equivalence conditions do not hold good in case only a set of processors
are active or the set of active processors in two consecutive steps are different. For
example, if processor Pq only performs a RRCW operation in step k and processor
P3 only performs a CRRW operation in step (k + 1), then these two steps are
independent. These two steps can not be reduced to a single RRRW step. We
denote such reductions as conditional reductions. Figure 4.4b also distinguishes
39
[between these two types of transitions. The solid line transitions are always valid.
The dotted line transitions are valid conditionally.
RRRW.
RRRW = RRRW
RRCW = RRCW
CRRW
CRCW
RRCW
RRRW
RRCW
CRRW
CRCW
CRRW.
RRRW
RRCW
CRRW
CRCW
CRRW
CRCW CRCW
RRRW
RRCW
CRRW
CRCW
RRRW
RRCW
CRRW
CRCW
(a) Reducible two step sequences
CRCW RRRW RRCW CRRW
ST
(b) Valid transition sequences
ST- Start state
Any other state is a final state
»— always valid
i- conditional valid
Equivalent operations
available in MCC
RRRW - Row Read Row Write
RRCW - Row Read Col Write
CRRW - Col Read Row Write
CRCW - Col Read Col Write
Figure 4.4: Basic data movement steps in the orthogonal multiprocessor and the
associated state transition.
[4.4 D a ta M o v em en t C o m p le x ity A n a ly sis
!
;We compare the complexity of data movement on the OMP and the CCM. A re-
!
iduction technique is proposed to reduce certain classes of data movement operations
'into fewer steps on an OMP. Finally, we compare the OMP with the MCC.
I
4 .4 .1 C o m p a rin g O M P w ith C C M
Consider the 4 basic data movement steps RRRW, CRCW, RRCW , and CRRW
on an OMP. The results derived in Theorem 2 use only RRRW and CRCW steps, j
Both the OMP and the crossbar-connected multiprocessor, as shown in Fig. 2.2,
exhibit identical row interleaving. Hence, both multiprocessors perform equally well
jwith respect to a RRRW step. They differ in their capabilities to implement an
intra-column CRCW step. Hence, we analyze the involved complexity on a crossbar-
connected multiprocessor to implement data movement equivalent to a CRCW step
on an OMP.
Since the crossbar-connected multiprocessor does not have independent column
buses, parallel intra-column data movement is not guaranteed to be implemented in
a single IRW step. We illustrate this idea by a simple 3 processor example as shown
in Fig. 4.5. The switch settings of S(i,j ) on ith stage of the Clos network model
corresponds to intra-column operations f(j) to be implemented by processors P j
I
during ith step of the data movement. If all f(j) are identical as shown in Fig. 4.5a, |
then there is no bus conflict. Hence, only one IRW step is sufficient to implement!
such intra-column data movement. If they are different, as shown in Fig. 4.5b,!
m ultiple substeps are needed to avoid conflict by processors in accessing the buses |
during each substep. j
i
The number of substeps required to implement an intra-column step depends on :
jwhether the desired data movement operation is a perm utation or generalized map-
i
■ping. If the desired operation is a perm utation, the required intra-column operations I
are also perm utation of the respective column data. In case of generalized m apping,,
the intra-column operations may be either perm utation or generalized m apping.!
Additionally, the interleaved memory organization does not allow scalar access and
41
—
s
0 __
f(2)
(a) Identical intra-column
operations implemented
in a single step
(b) Different intra-column operations
implemented in multiple substeps
Figure 4.5: Single step or multiple substeps realization of an intra-column data
movement operation on an example 3— processor crossbar-connected multiprocessor
with 9 memory modules and 3 interleaved buses (f ( j ), 0 < j < 2, reflects intra
column operation belonging to column j).
restricts all memory accesses to be interleaved (vector) accesses. Algorithm 3 im
plements an arbitrary intra-column generalized mapping on the crossbar-connected
multiprocessor. Two sets of n data buffers, VO and VI (each data buffer consisting of
a single word), are used for every processor. The elements of the index set are iden
tified by a notation Vx.y to index yth data buffer of the set Vx. The intra-column
operations are defined by functions f(k), 0 < k < n — 1.
Algorithm 3: Intra-column Generalized Mapping on the
Crossbar-Connected M ultiprocessor
1 parbegin
|2 For each processor Pi, i = 0 ,1 ,..., n — 1, do
3 begin
4 For k = 0 to (n — 1) do
5 begin
6 Perform interleaved-read from bus Bi to data
buffer set V O with identity index set;
7 Perform interleaved-read from to
data buffer set VI with identity index set;
8 If (k — 0) then
9 Perform interleaved-write to Bi with
42 s
index set {VT.O, V0.1,..., VO.n — 1}
10 else
11 If {k — n — 1) then
12 Perform interleaved-write to Bi with index
set { V0.0,..., VO.n - 2, VI.n - 1}
13 else
14 Perform interleaved-write to Bi with index
set {V0.0,..., V0.& — 1, VI.k, VO.k + 1, • • •, VO.n — 1};
j15 end;
1 16 end]
! 17 parend]
T h e o re m 3 Using 2n data buffers per each processor, an (n x n ) crossbar-connected'^
multiprocessor takes up to a maximum of (n2 + 2n)/2 IR W time steps to implementj
an intra-column generalized mapping and 3n/2 IR W time steps to implement am
i
intra-column permutation, which are equivalent to a single C R C W operation on the!
OMP.
P ro o f: Consider the steps in algorithm 3. For perm utation, step 7 is implemented
in a conflict-free manner. In the presence of arbitrary broadcasting in a generalized
6
mapping, step 7 may take up to n interleaved-read steps to take care of bus conflicts, j
This arises because a single data might be required to be broadcasted to manyj
locations in a particular column. Steps 7 and 8 represent an atomic IRW step. The
overhead of interleaved-read operation in step 6 can be approximated to 0.5 IRW tim e
step for large n. Hence, algorithm 3 takes up to (n2+ 2 n )/2 IRW steps for generalized!
mapping and 3n/2 IRW steps for perm utation. For specific selected mappings such|
as complete broadcast and intra-column broadcast etc., the above complexity can bej
reduced to n or logn IRW steps. ■
i By providing up to n2 data buffers (n sets of n-data buffers each) for each pro-
‘cessor, the above algorithm can be improved. Each processor can read a maximum!
of n rows and then perform a final interleaved-write to its own row with the desiredj
I
data. According to earlier results, any perm utation (generalized mapping) over n 2 J
data can be realized on an OMP in 3 (5) alternate RRRW and CRCW steps. So,:
I
maximum 2 (3) of these steps can be CRCW steps. This leads to the following I
results on perm utation and generalized mapping of n 2 data elements on a (n x n )
crossbar-connected multiprocessor:
C o ro lla ry 3 (a). Using n 2 data buffers per each processor, it takes (n + l ) / 2 IR W
time steps to implement an intra-column permutation or generalized mapping.
(b). A ny permutation can be realized in a maximum of (n -f 2) IR W time steps with
n 2 data buffers per each processor. With limited 2n data buffers, the same operation
takes a maximum o f (9n + 4)/2 IR W time steps.
(c). A ny generalized mapping can be realized in a maximum o /(3 n -f7 )/2 IR W time
steps using n 2 data buffers per each processor. With limited 2n data buffers, the
same operation takes a maximum of (3n2 + 6 ra + 4)/2 IR W time steps.
4 .4 .2 R e d u c tio n o f D a ta M o v em en t S te p s O n O M P
According to Theorem 2, data movement operations corresponding to both permu-1
tation and generalized mapping can be implemented on the OMP using alternate!
I
RRRW and CRCW steps. We denote such sequences of RRRW and CRCW steps
using a sequence operator o. A RRRW step followed by a CRCW step is denoted
as RRRWoCRCW. Given a sequence, the steps are implemented from left to right.
Since any generalized mapping can be realized in a maximum of 5 intra-row and
intra-column steps, we consider sequences with a maximum of 5 steps. Possible se- 1
quences of RRRW and CRCW steps to realize any data movement operation on an
OMP are as follows:
Each of these sequences can be identified as a ^-sequence, where q ,l < q < 5,j
is the length of the sequence. For examples, RRRW oCRCW is a 2-sequence andj
RRRWoCRCWo RRRWoCRCW is a 4-sequence. The length q also reflects the
I
I
44 !
RRRW CRCW
RRRWoCRCW CRCW oRRRW
RRRW oCRCWo RRRW CRCWoRRRWoCRCW
RRRW oCRC Wo RRRWoCRCW CRCWoRRRWoCRCWoRRRW ;
RRRWoCRCWoRRRWoCRCWoRRRW CRCWoRRRWoCRCWoRRRWoCRCW
|
j corresponding number of intra-row and intra-column operations required on thej
I
JMCC to implement the required data movement.
The OMP provides additional RRCW and CRRW steps as discussed earlier.]
Hence, we consider possible reduction of 9 -sequences into smaller sequences using;
I
RRCW and CRRW steps. Consider a RRCW step on an OMP. D ata elements from I
one or m ultiple rows are read by the respective processors and w ritten to their own j
columns using the indexing scheme. Multiple RRRW and CRCW steps are needed
to implement such a transposition operation. This leads to the following sequencing)
result: '
I
(
T h e o re m 4 Each R R C W or C R RW step has an equivalent q-sequence, 2 < q < 5.!
I
i
P ro o f: Consider a RRCW or CRRW step on the OMP. This step might be corre-)
i
sponding to a perm utation or to a generalized mapping. In case of a perm utation,!
this step can be carried out in a maximum of 3 alternate RRRW and CRCW steps, j
One RRCW or CRRW step can never be equivalent to a single RRRW or CRCW
step. Hence, in case of a perm utation, there exists an equivalent 2- or 3-sequence for
each RRCW or CRRW step. Similar argument holds good for generalized mapping j
where the equivalent ^-sequence is either a 4-sequence or a 5-sequence. ■
Consider the 9 -sequences with single RRCW or CRRW equivalent. We define
these 9 -sequences as reducible sequences and categorize them into a reducible class
called 71. Though it is im portant to formally characterize the class 7Z, our objective
in this paper is targeted towards showing the principles of reduction. We only
45
RemonstrateThe existence of th e class 7Z and dem onstrate reduction techniques. For
certain classes of data movement operations, 2 -sequence steps are reducible as shown
below.
Consider indices in the range 0 < < n — 1. Let (i,j) and (i ',j ')
represent the indices of the elements of a two-dimensional data array before and
after a data movement step, i.e. the data element with index (i,j) moves to new
location with index (i',j'). A RRCW step can be functionally defined as i' — f ( i , j )
and j r = i. If f ( i , j ) is a symmetric function of i and j, then this RRCW step can
be decomposed into two steps: a CRCW step followed by a RRRW step. For an
example, consider f ( i ,j ) = (i + j + u) mod n for 0 < u < n — I. A CRCW step
moves data elements (i,j) to (iH ,j"), where j" = j and i" = (i -f j + u) mod n.
The following RRRW step moves data elements to where i' = i"
and j ' = {% " — j" — u) mod n. Similar symmetric property also makes a RRRW
operation followed by a CRCW operation reducible to a single CRRW operation.
A large number of characterizations can be used to dem onstrate 3— , 4— , and 5—
sequences as reducible.
We use a notation b to denote a potential reduction operation from a sequence
to an equivalent sequence involving fewer number of steps. A h B indicates that the
operation sequence A may be reduced to an equivalent operation sequence B. So the
reduction operations with 2 -sequence can be defined by the following two notations:
D1 : (C R C W o R R R W ) h R R C W
D2 : (R R R W o C R C W ) b C R R W
where D 1 and D 2 are tags to identify the respective reduction operations. Fig
ure 4.4a shows reducible steps like RRCW followed by CRCW is equivalent to RRCW
46
'etc. Some of these conditional reducible operations related to RRCW , CRRW op-j
i . '
jerations followed by either RRRW or CRCW operations can be defined using the*
following notations: '
D 3
D4
Db
D6
(R R C W o C R C W ) b R R C W
(C R R W o R R R W ) b C R R W
(R R R W o R R C W ) b R R C W
(C R C W o C R R W ) b C R R W
The reduction types D l and D2, combined together with D3 — D6, give rise to
chain-reductions. Depending on the type of data movement, sequences with alternate,
RRRW and CRCW steps can be replaced with shorter sequences involving RRCW
and CRRW steps. Sometimes, even a whole sequence of RRRW and CRCW steps
can be reduced to a single RRCW or CRRW step. An example of a 4— sequence
reduction (possibility of reduction is checked from left to right) is as follows:
R R R W l o C R C W , o R R R W 2 o C R C W 2 bm R R R W , o R R C W X o C R C W 2
b£,5 R R C W 2 0 c r c w 2
\-m r r c w 3
i Possibility of chain reduction totally depends on the data movement operation.
For certain data movement operations it might not be possible to reduce their q-
\
i sequences at all. In this case, the complexity of realizing perm utation (generalized
imapping) on an OMP remains 3 (5) IRW steps. But using this reduction m ethod
ology, data movement operations belonging to the reduction class 7Z can be imple
mented in fewer steps on an OMP involving RRCW or CRRW steps.
i
! Now we determine the tim e overhead to implement a RRCW or a CRRW op-
i
eration on an OMP in number of IRW tim e steps. In case of a RRCW or CRRW
operation, Tp is counted twice for two different addresses, one for interleaved-read and
lone for interleaved-write. Besides there is an additional overhead of 7 for synchro-1
1 :
nization before bus switching. Let the additional overhead compared to a normal
IRW cycle, T +^4.2m identified as e. Based on the OMP design param eters used
at USC [HDP+90], we have e = 0.35 for n = 32, Tp = 1000 nsec, r = 50 nsec,
/? = 300 nsec, and 7 = 500 nsec. This e reduces to 0.2 for n — 64. Hence, each
RRCW or CRCW operation on the OMP is equivalent to (1 + e) IRW tim e steps.
So, we have the following result:
C o ro lla ry 4 Data movement operations belonging to reducible class 1Z can be im-
\plemented on an OMP in (1 -f e) IR W time steps using row-column and column-row
| operations.
i
1
|4.4.3 C o m p a rin g O M P w ith M C C
\
i
We consider a centralized routing scheme in both architectures. For a 3-stage Clos
network model, switch settings can be generated for an arbitrary perm utation by
using well known Benes network algorithms. Similarly, switch settings on a 5-stage
]Clos network model for an arbitrary generalized mapping can be determined using
I
|the algorithm suggested in [Kum8 8 ].
i For an OMP, the switch settings of each stage of this Clos network get translated
jto appropriate index sets and gets implemented as alternate RRRW and CRCW
isteps. Depending on the type of data movement, some of these RRRW and CRCW
I
[steps dem onstrate potential to be implemented as RRCW or CRRW steps.
I
I In case of an MCC, each stage of the Clos network is equivalent to a single
I
m tra-row or intra-column operation. These operations are equivalent to linear array
1
^operations. Consider the associated tim e overhead to realize either a perm utation
|or a generalized mapping on a linear array as Tunear. Given a switch setting for a
linear array operation, the nodes on that linear array can be programmed to receive
I
I
I
48
"HataTffom designated source nodes. We analyze the complexity of this linear array'
routing to determine overall tim e complexity for MCC data movements.
Consider a switch setting S(k, r) in kth. stage of the Clos network reflecting
intra-row data movement operations in rth row of an MCC. Figure 4.6a shows an.
example for a (4 x 4) MCC. For a given switch setting, each destination node can be
programmed for its source-tag, i.e. the node number from which it will be receiving
jdata, as shown in Fig. 4.6b. Each source node attaches its own tag to the data
jbefore data movement starts. This is equivalent to a random access read operation
^suggested in [NS81]. We consider a linear array with n processors, P;, i = 0,1,..., ra
il. Each processor is associated with two buffers, left (L) and right (R), which can
I
;be accessed by its respective neighboring nodes. The following algorithm routes the
data to appropriate nodes in two passes both for perm utation and generalized
j mapping.
I
j S (k,r)
! node indices node indices
(r>0) nodes 0 1 2 3
— (r,l) (destination)
— (t,2)
(r 3 ) source tags 2 0 3 0
(a) A n example switch setting for rth row (b) Corresponding source tags
in £th stage o f a Clos netw ork for destination nodes
jFigure 4.6: Source tag generation for linear array data routing in MCC from Clos
'network switch settings.
i
I
jAlgorithm 4: Linear Array Data Routing for
Perm utation and Generalized Mapping
begin
> parbegin
For all processors do
, begin
(r,0) —
M ) —
(r,2) —
(r,3) —
desired
For k = 1 to (n - 1) do
begin
Read data from left neighbor’s buffer j R ;
Compare received source-tag with its own source-tag;
Keep the data in case of a match;
Write data to its own buffer R;
end;
end;
parend;
parbegin
For all processors do
begin
For k = 1 to (n — 1) do
begin
Read data from right neighbor’s buffer L;
Compare received source-tag with its own source-tag;
Keep the data in case of a match;
Write data to its own buffer L;
end;
end;
parend;
end.
Theorem 5 The time overhead to realize an arbitrary generalized mapping over n 2
data elements on an (n x n ) M CC using alternate row and column operations is
comparable to 5 IR W time steps on an (n x n) OMP. Any permutation realization
is comparable to 3 IR W time steps.
P ro o f: Algorithm 4 requires 2 (n —1 ) routing steps and 2 (n —1 ) compare operations.
Let the tim e overheads for a near-neighbor link communication and a compare oper
ation in MCC be r' and c respectively. Hence, Tunear = 2(n — l)c' + 2(n — l ) r '. The
minor cycle tim e r of an interleaved bus operation on the OMP is dependent only on
the bus propagation delay. Based on the present day bus technology [Bal84] and for
50
jn < 64, it can be assumed that r ~ t and Tp ~ nc . This leads to Tirw — Tunear-
j Since a maximum of 5 intra-row or intra-column operations are involved on an MCC
for any generalized mapping, the overhead is equivalent to 5 linear array operations.
This is comparable to 5 IRW steps. Similar argument holds good for perm utation.
j Table 4.1 provides a summary of all the comparisons [PH91b] reported in this
section. The tim e overheads are normalized to IRW tim e steps. (1 + e) is the number
| of IRW tim e steps needed for a row-column or column-row operation on an OMP,,
I
I as discussed in section 6 .2 .
; Table 4.1: Comparison of Maximum Time Overheads (normalized to IRW tim e steps!
on OMP) For Implementing D ata Movement Operations with n2 data.
Data
Movement
Operations
(n x n)
MCC
(n x n)
OMP
(n x n )
CCM
n 2
nodes
n processors and
n2 memories
n processors and
n2 memories
n data buffers
per processor n2 data
buffers per
processor
2 n data
buffers per
processor without
reduction
with
conditional
reduction
Perm utation 3 3 1 + e (n + 2 ) (9n + 4)/2
Generalized
Mapping
5 5 1 + e
(3 n + 7)/2 (3n2 -f 6n -f 4)/2
4.5 S im u la tio n E x p e r im e n ts an d R e su lts
We carried out simulations to determine advantages of on-the-fly indexing. We had
objectives to observe the effect of vector length and the degree of memory interleaving
on data m anipulation using on-the-fly indexing.
4.5.1 A C S IM -b a sed M u ltip r o c e sso r S im u la to r
Our simulator runs on top of a simulation kernel called CSIM [Sch8 6 ]. CSIM is a
process-oriented simulation language based on C. It runs under UNIX 1 and sev
eral of its vendor-specific variations. Our simulator is capable of running on both
SUN-3 and SUN-4 platforms. This multiprocessor simulator is process-oriented and
algorithm-driven [HC91, MCH+90]. All system delays included in this simulator are
explicit values reflecting the hardware design. The simulator consists of three mod
ules: the user developed algorithm, an architecture specification file, and the CSIM
simulation kernel. The architecture specification file defines components of a system
in term s of CSIM facilities, events, and C data structures. For example, a hypercube
system is represented as a collection of node and link facilities. The communication
jbuffers are represented as C data structures. Events reflect data-ready conditions
indicating arrival of data from sources to destinations. Similar facilities are used to
model processors, buses, and memory-modules of a multiprocessor.
Macros in the architecture specification file reflect correct system operations. For
example, a circuit-switched communication is represented as a sequence of opera
tions: getting hold of required links, reserving it for a duration of communication
time, performing communication by duplicating data from source-node buffer to
jdestination-node buffer, setting the event of data ready for destination processor,
and releasing communication links in a reverse order they were reserved. In case
af conflicting access to a facility by multiple processes, the underlying CSIM kernel
grants it in a first-come-first-served basis. This makes the simulator completely de
terministic. Since this simulator is built using C, it produces both final results of a
program and the timing estimates.
UNIX is a registered trademark of AT&T Bell Laboratories
52
H O .2 S im u la tio n E x p e r im e n ts |
! !
< j
! We carried out two different sets of experiments to determine capability of on-the-fly:
indexing. The following two problems were taken: '
• Shifting a (P x P) m atrix by x, 1 < x < P — 1 columns.
i
• Shifting a (P x P) m atrix byx, 1 < x < P — 1 columns and y, 1 < y < P — 1:
j rows. 1
!
i
The first set of experiments implemented shifting a m atrix by 5 columns on
different OMP configurations. We considered shifting both by processor executing
instruction and by using on-the-fly data manipulation. OMP configurations with
8,16, and 32-processors were considered. These configurations provided different
degrees of memory interleaving and hence, different vector length. We carried out
6 experiments on each of this processor configuration for P = 32,64,128,256,512,
j and 1024.
■ The second set of experiments were carried out on 16 processor OMP and CCM
I configurations. We considered shifting both by processor executing instruction and
i
j by using on-the-fly data manipulation. For similar m atrix sizes, we considered shift-
| ing a m atrix by rows and columns.
I
i
i
j 4 .5 .3 S im u la tio n R e s u lts an d Im p lic a tio n s
I
i
, In the first set of experiment, we observed two parameters: reduction in number of
' instructions by using data m anipulator and reduction in total execution time. Fig-
\
\ ure 4.7a shows the reduction factor in number of instructions used for computation
i
! and data manipulation. This reduction increased with vector length. W ith a vector
I
t length of 32, we got up to a 75 % reduction. Figure 4.7b shows reduction factor
I in total execution time. This also increased with vector length. However, for small
!
i vector length, the reduction factor remained constant over different m atrix sizes.
I
' 75.0
| 67.5
[Reduction factor
I in num ber of 52.5 ■
I instructions
j used for
com putation
| and data
! m anipulation
I
45.0 +
37.5
30.0 +
22.5
15.0
7.5
shifting a matrix
> columns
0 32
-w
I
6 .0 *
5.4
4.8
Reduction factor 4.2
in total 3 6 -
execution tim e „ „
3.0
2.4 +
1.8
1. 2 -
0 .6 -
shifting a matrix
by 5 columns
o
★
OMP 32
OMP 16
OMP 8
64 128 256 512 1024
M atrix Size (P)
(a) instructions used for data m anipulation
o : OMP 32
• : OMP 16
* : OMP 8
32
-w
64 128 256 512 1024
J M atrix Size (P)
j (b) total execution tim e
I
^Figure 4.7: Factors of reduction in (a) com putation overhead (instructions used for.
I com putation and data manipulation) (b) total execution tim e by using on-the-fly
lindex m anipulation to shift 5 columns of a (P x P) m atrix on different orthogonal
multiprocessor configurations (OMP = orthogonal multiprocessor).
I
t
L
5 4
We compared overheads for computation, data m anipulation, memory access,
and synchronization. Figure 4.8a shows a comparison of instructions for various
jmatrix sizes. A significant reduction was observed in num ber of instructions used
'for m anipulating data. Figure 4.8b shows the respective execution times. Memory
i
access constituted a significant factor in overall execution time. This explained
reduction factor remaining constant over m atrix sizes as shown in Fig. 4.7b.
I The next set of experiments compared the effect of one and two dimensional
memory interleaving on data manipulation. The problem considered both row and
i
columnwise data movements. Figure 4.9a shows the number of instructions used. For
Itwo-dimensionally interleaved OMP, minimal number of instructions were used for
jperforming the task with on-the-fly indexing. Since CCM does not support memory
jinterleaving in column dimension, column-oriented data movements took place in
Ja scalar manner. This did not allow on-the-fly indexing to be used. Figure 4.9b
shows total execution tim e for the respective configurations. OMP with on-the-fly
indexing won in this case too.
i
i
I
I
!
i
I
I
i
i
I
!
I
55
shifting a matrix by 5 columns
on a 16-processor OMP
Computation +
Data manipulation
■ Shared memory
Access
■ Synchronization
a.
x
CL
©
N
'«0
X
L
< 3
E
1024 x 1024
^ without OIM
'— with OIM
512 x 512
256 x 256
128 x 128
o
©
o
o
o
C M
O
o
o
o
o
o
0 0
o o o
o
0 0
©
o
C M
# of instructions in thousands
(a)
CL
X
CL
©
N
S
x
‘ d
c a
E
1024 x 1024
512 x 512
256x256
128 x 128
^ without OIM
with OIM
+ H -
o
C M
H -
o
C O
o
'J-
o
i n
o
C O
o
h-
Total execution time in milliseconds
(b)
jFigure 4.8: Comparison of (a) instruction overheads and (b) total execution tim e in
^shifting a (P x P) m atrix by 5 columns on a 16-processor orthogonal multiprocessor.
56
Number of
instructions
used for
com putation
and data
m anipulation
(in thousands)
6000
5400
4800
4200
3600
3000
2400
1800
1200
600
shifting a matrix
by 5 columns and 3 rows
0 32 64 2*56 1024 128
M atrix Size (P)
(a) instructions used for data m anipulation
75.0
67.5
60.0-
52.5-
Total 45Q
execution
tim e 37-5 '
(in milliseconds) 30.0-
22.5-
15.0-
7.5-
shifting a matrix
by 5 columns and 3 rows
32
— * —
64
: OMP 16
: OMP 16
with OIM
: COM 16
: COM 16
with OIM
: OMP 16
: OMP 16
with OIM
: COM 16
: COM 16
with OIM
512 1024 128 256
M atrix Size (P)
1 (b) total execution time
Figure 4.9: Comparison of (a) computational overhead (instructions used for com-
jputation and data manipulation) (b) total execution tim e in performing row and
column shifts of a (P x P) m atrix by using on-the-fly index m anipulation on or-
jthogonal multiprocessor (OM P) and crossbar-connected multiprocessor (CCM).
C h a p ter 5
V ecto rized In terp ro cesso r C o m m u n ica tio n
5.1 In tro d u ctio n
'Memory-based mailboxes allow processors to communicate each other by writing
and reading messages through appropriate mailboxes. Vectorized memory access
schemes in multiprocessor provide high bandwidth and low latency data transfers
between processor and memory. In this chapter, we continue taking this vector-
oriented approach. We develop a general framework of communication vectoriza-
tion to implement memory-based interprocessor communication in shared-memory
multiprocessors. We configure interleaved shared memory as a collection of vector
•mailboxes. Prim itive message patterns are derived by modeling interprocessor com
m unication steps of a parallel program. A methodology is provided to convert send
and receive operations of message patterns into equivalent write and read memory-
based mailbox access steps. We investigate the effect of architectural param eters
of a shared-memory system like interconnection network, memory access tim e, de
gree of memory interleaving, and data contention on the efficiency of communication
vectorization.
58
5.2 F rom M essa g e P a ssin g To S h ared -V ariab le
C o m m u n ica tio n s
'Programs w ritten for a loosely-coupled m ulticom puter [LA81, Zho90] realize inter-
iprocessor communication by passing messages over communication links. Multicom
p u te r such as hypercube, torus, tree, and pyramid use these programs. Consider
■one such program consisting of m processes which are statically allocated to m nodes
jof a m ulticom puter. During the execution of this program, processors communicate
using send and receive communications. We assume this message passing scheme to
be unblocked send and blocked receive [Kar87]. W ith this scheme, the sender proces
sor sends message and the receiver processor waits for the message to arrive. The
program is assumed to be deadlock free so th at no two processors at any instant of
time wait for messages from each other.
5.2.1 M e ssa g e -P a ssin g S tep s
W ithout any loss of generality, we present an example for m = 4 on a 4-node
multicom puter as shown in Fig. 5.1. Each processor alternates between computation
and communication steps. W ith exchange of messages, processors get synchronized
and program execution proceeds in a wave like manner. Message-passing steps in
this program dem onstrate different characteristics. For example, the first step is an
one-to-one-personalized message exchange between nodes NO and N l. The second
step is a many-to-many-personalized message exchange between two sets of nodes,
(N1,N2) and (N3,N0). In the third communication step, node NO broadcasts a
message to rest of the nodes. Node NO receives personalized messages from all other
nodes in the fourth communication step. The last step combines two operations: a
59
multicast (broadcasting a message to selected others) operation to nodes N1 and N2
xom N3 and a many-to-onc operation from nodes N1 and N2 to NO.
NO N1 N2 N3
computation
•
•
computation
•
•
computation
•
•
computation
•
•
Sundunsj\N 1) Rccv(msg,N(l)
• •
•
•
computation
•
•
•
computation
•
•
•
•
•
•
•
•
•
Recv(msg,N3) Sun J( msg,N2) Reev(msg,Nl) Send(msg,NO)
•
•
computation
•
computation
•
•
computation
•
•
•
computation
•
Send(msg,all) Rcc\(msg,NU) Recv(msg,NO) Recv(msg,NO)
•
computation
•
•
computation
•
•
computation
•
computation
•
Scildl m.sg 1 ,N 1) Recv(msg,NO) Recv(msg,NO) Ricv (msg.NO)
Send(msg2,N2)
Send(msg3,N3)
•
computation
«
•
•
computation
•
•
•
•
computation
•
•
computation
.
A
Send( msg.NO) Send(msg,NO) Scnd(ni'.g,N 1 ,N2)
Recv(msgl,N 1)
Recv(msg2> N2)
Ri cv(msg N3j Recv(msg,N3)
r
•
•
• •
•
computation
computation
•
*
computation
•
•
computation
•
•
- communication steps
'Figure 5.1: Message-passing steps in a sample program for a 4-node m ulticom puter
(N0-N3: computing nodes).
Depending on source and destination node sets and types of communication, i.e.
broadcast or multicast, we categorize message-passing steps into various patterns.
Table 5.1 lists all possible message patterns together with their characteristics iden
tifiers, C1-C12. Use a notation (Nx,Ny) to represent a single message transfer from
•node Nx to node Ny. Notation ((N xl,N yl), (Nx2,Ny2)) represents m ultiple message
transfers from a set of source nodes, N xl and Nx2, to a set of destination nodes,
6 0
l , n s > 1) or (d > l,rid > I) € M A pv
(s = 1 ,n a > 1) or (d = I, rid > I) € M A V
(s > 1 , n, = 1) or (d > l,nd = 1) € M A ps
(s = 1 ,n s = 1) or (d = 1 ,nd = 1) € M A S
D e fin itio n 1 Based on Property 1, memory-write and memory-read graphs are
grouped into memory-access sets for scalar, parallel-scalar, vector, and parallel-vector
accesses as follows:
66
~~MAa = {M G lw,M G lr}
M A V = {M G3w,MG4:w,M G 5 r}
M A p3 = { M G2W , M G 5W , M G2r , M GSr , M G4r }
M A p lJ = {M G 6 ,,,, M G 7w,M G 6r,M G 7r}
5.3.1 O p era tio n a l D ig ra p h s
Consider a crossbar-connected multiprocessor as shown in Fig. 2.1b. The connec
tivity between n processors, n 2 memory modules, and n buses can be expressed by
a connectivity graph as shown in Fig. 5.4a. The bidirectional nature of the edges
reflects both read and write capabilities. This graph however does not reflect opera
tional characteristics of the system. Different scalar and vector memory accesses are
possible on this configuration. During each memory access, an electrical connectivity
is established between processors, buses, and the associated memory modules. This
connectivity constitutes a subgraph of the connectivity graph. For a conflict-free
memory access, the associated subgraph is identified as an operational digraph.
Consider such an operational digraph. This subgraph represents a conflict-free;
read or write memory access. For a read access, processor node has an indegree 1
and memory node has an outdegree 1. For a write access, memory node has an
indegree 1 and processor has an outdegree 1. The bus is connected to either a single
or multiple (maximum n, where n is the degree of memory interleaving) memory
nodes depending on whether the operational digraph represents a scalar or vector
access. This leads to the following property:
P r o p e r ty 2 An operational digraph satisfies the following constraints:
processor node : (indegree = 1) or (outdegree = 1)
bus node : (1 < indegree < n, outdegree = 1) or
(indegree = 1, 1 < outdegree < n)
memory node : (outdegree = 1) or (indegree = 1)
67
(a) Connectivity graph
(b) Few operational digraphs
i ) - processor ESSjSa - bus Q - memory
bred ad /w riteae.dgeS f ' w rite ed 8 e j ' read ed8 e
'Figure 5.4: (a) Connectivity graph and (b) operational digraphs of crossbar-
jconnected multiprocessor configuration.
Depending on the connectivity, an operational digraph represents one of the fol
lowing memory accesses: scalar-read, scalar-write, vector-read, vector-write, parallel-
scalar-read, parallel-scalar-write, parallel-vector-read, and parallel-vector-write. Fig
ure 5.4b represents few operational digraphs of crossbar-connected multiprocessor.
When nodes corresponding to processors, memories, and buses in an operational di
graph satisfy the above properties, the related memory access becomes conflict-free.
This leads to the following definition:
'D efinition 2 Each operational digraph represents a single step memory access.
5.3.2 T h e V e c to r iz a tio n P r o c e d u r e
We now emphasize on steps 2 and 3 of the vectorization methodology in this subsec
tion. Step 4 is illustrated in next subsection with an example program conversion.
For any multiprocessor configuration, memory graphs belonging to access set M A S
axe always implemented as scalar access. For other sets, M A pv,M A v, and M A ps, we
-first check whether their members can be implemented as conflict-free parallel-vector,
68
vector, and parallel-scalar accesses, respectively. In case of conflict, we determine
alternate memory-accesses with minimal increase in access time.
Consider a memory graph with potential for parallel-vector access. There are
three possibilities: (a) this graph can be implemented on a configuration in a conflict-
free manner, (b) it leads to conflict, or (c) it is not feasible to implement. Let Tpv
be the associated access tim e to implement it in a conflict-free manner. In the
presence of conflict or in the absence of feasibility, this access can be alternatively
implemented as multiple steps of vector, parallel-scalar, or scalar access. Define
N v, Nps, and N s being the required number of steps to implement this parallel-vector
access as alternate vector, parallel-scalar, or scalar accesses, respectively. Let Tv,Tps,
and Ts be the associated access times.
Lemma 1 A parallel-vector memory access step can be implemented as multiple
steps of vector, parallel-scalar, and scalar access with following time constraints:
Tpv < N ps ■ Tps < N v ■ Tv < N s ■ Ts or
Tpy < N v ■ Tv < Nps ■ TP s < Ns • Ts.
Proof: Consider a parallel-vector memory access involving m processors and each
processor accessing a vector of length n. Assume equivalent vector and parallel-
scalar accesses are feasible. We have N ps = n and N v = m. If neither parallel-scalar
nor vector access is conflict-free or feasible, it can alternatively be implemented
as N a = mn number of scalar access steps. Consider the respective access times,
estim ated in Table 2.1. We have (a + /3) > r. For m > n, we have the first constraint
satisfied. The second constraint holds good for m < n. ■
We identify a memory access as valid if the access is feasible and can be imple
m ented in a conflict-free manner. Otherwise, it is denoted as invalid. Lemma 1 leads
to the following theorem.
Theorem 6 Replacement of an invalid parallel-vector access with multiple steps of
valid parallel-scalar or vector access leads to minimal increase in access time. In case
69
both accesses are invalid, a parallel-vector access is replaced with multiple steps oj
scalar access. Replacement of invalid parallel-scalar and vector accesses with multiple
steps of scalar access leads to minimal increase in access time.
Next we provide a vectorization procedure to determine replacement for invalid
accesses on a given multiprocessor. Corresponding to a primitive pattern, this step
requires allocation of mailboxes to memory modules. W ithout loss of generality, we
give priority to write vectorization. This allows us to allocate mailboxes during a
write access so that it gets implemented as a vectorized access as far as possible.
Depending on this allocation, corresponding read access may or may not dem onstrate
vectorization.
We introduce four shared-variable communication sets: SCpv, SC ps, SC V, and
SC s. These sets encapsulate memory-write and memory-read graphs which can be
implemented in minimal tim e using parallel-vector, parallel-scalar, vector, and scalar
accesses, respectively.
V ectorization P ro ced u re
; Memory-access sets M A pv, M A ps, M A v, and M A S
; with memory-write and memory-read graphs as members are used as inputs.
; Ways to implement these memory graphs with minimal access time are put into
; shared-variable communication sets SCP „, SCps, SCV, and SCs.
; m = number of processors in a memory graph
; n= length of vector per each processor
begin
for 1 < i < 7 do
begin
for (graph = MGiw) and (graph = M G ir) do
begin
ST: determine the memory-access set to which graph belongs to;
if access-set is M A S
include the graph to SCs\
if access set is M Aps
70
begin
if it is a write access
replace it with multiple graphs of type T1 (Fig. 5.3);
else
replace it with graphs of access type T2;
allocate memory modules as mailboxes to support required access types;
check for the graph being an operational digraph;
if it is an operational digraph
include the graph to SCps with associated access type;
else
begin
delete the graph from access set M A ps;
if (m > n) (Lemma 1)
add it into access set M A V\
else
add it into access set M A S;
goto ST;
end
end
if access set is M A pv or M A V
begin
if access set is M Apv
x = multiple;
else
x = single;
if graph is a write operation
begin
if the write operation is a broadcast
replace it with graphs of access type T3;
else
replace it with graphs of access type T4;
end
else
replace it with graphs of access type T5;
allocate memory modules as mailboxes to support required access type;
71
check for the graph being an operational digraph;
if it is an operational digraph
begin
if access set is M A pv
include the graph to SCpv with associated access type;
else
include the graph to SCV with associated access type;
end
else
begin
if access set is M Apv
begin
delete the graph from access set M Apv;
if (m > n)
add it into access set M A pa\
else
add it into access set M Av;
end
else
begin
delete the graph from access set M A V\
if (m < n)
add it into access set M A ps\
else
add it into access set M A S;
end
goto ST;
end
end
end
end
72
Consider the above vectorization procedure. It determines replacement for in
valid memory graphs with minimal increase in access time. Consider an all-to-all
personalized message pattern being vectorized on a crossbar-connected m ultiproces
sor. This pattern C7 is first transformed into memory graphs M G7W and M G 7T.
According to definition 1, both these graphs belong to memory-access set M A pv.
Since M G 7W graph corresponds to personalized access, the vectorization procedure
replaces this graph with multiple graphs of access type T4. Next mailboxes are
^allocated to support this access type. Memory modules 0 < i ,j < n — 1, are
used as mailboxes between processors P t and P j. This allocation results in a valid
operational digraph. Hence, M G 7W gets included to SC pv.
For M G7Ti mailbox allocation for its corresponding write graph M G7W is used.
This allocation leads to a processor reading messages from its column memory mod
ules. Since a crossbar-connected multiprocessor does not provide column-oriented
vector access, this allocation of mailboxes does not give rise to an operational di
graph. Assuming m = n, this graph M G7r is deleted from memory-access set M A pv
and added to M A pa. This deletion and addition leads to minimal increase in access
time according to Lemma 1 . The algorithm finally includes this graph to set SC ps.
Similar analysis can be made for other memory graphs corresponding to primitive
[patterns.
T h e o re m 7 Memory-write and memory-read graphs corresponding to primitive pat
terns can be implemented on a crossbar-connected multiprocessor with minimal time
according to the following shared-variable communication sets corresponding to par
allel vector, vector, parallel-scalar and scalar accesses, respectively.
SC pv = { M G6W,M G 7W }
SC V = { M G3W,M GAW }
SC ps = { M G2w,M G 2r,M G 3r,M G 4r } and
{ M G5W , M G 6r , M G 7r }
73
SCs = ' { M G lw,M G lr,MG5r }
'Proof: Consider the communication vectorization procedure analyzing replacement
for memory graphs corresponding to primitive patterns on a crossbar-connected
(multiprocessor. For each memory-write and memory-read graph, this procedure ter
minates and determines the shared-variable communication^ set to which the graph
belongs to. This procedure starts its search with appropriate access types demon
strating potential for maximum vectorization and (or) parallelism, as determined
by Definition 1. The search is prioritized between alternative memory access types.
'During each search, minimum increase in access tim e is guaranteed by Lemma 1.
Thus, the final replacement always leads to minimal access time. Any other replace
m ent leads to increase in access time. ■
Similar vectorization determines ways to implement memory graphs on OMP
and single bus-based multiprocessors with minimal access time. The vectorization
results for primitive patterns on three multiprocessors are summarized in Table 5.2.
5.3.3 A n E x a m p le P ro g ra m C o n v ersio n
The communication steps of a m ulticom puter program get converted [PH91c, PH91a]
to memory write and read steps on a given multiprocessor configuration using the
■following steps:
1. Reduce communication steps of a m ulticom puter program in term s of primitive
patterns.
2. Implement communication vectorization of prim itive patterns on a given mul
tiprocessor configuration and determine shared-variable communication sets.
3. Convert communication steps as combination of write and read accesses corre
sponding to shared-variable communication sets.
4. R estructure memory-write and memory-read accesses by inserting synchroniza
tion primitives.
74
Table 5.2: Memory-access Types Leading to Minimal Access Time for Implementing
Prim itive Patterns on Three Multiprocessors Using Shared-Variable Communication
Vectorization (pv = parallel-vector access, v = vector access, ps = parallel-scalar
access, s = scalar access).
Access Types Single-bus Crossbar-
connected
Orthogonally-
connected
pv V ps s
PV
V ps s pv V ps s
one-to-one, write X X X
one-to-one, read X X X
perm utation, write X X X
perm utation, read X X X
one-to-all-broadcast, write X X X
one-to-all-broadcast, read X X X
one-to-all-personalized, write X X X
one-to-all-personalized, read X X X
all-to-one, write X X X
all-to-one, read X X X
all-to-all-broadcast, write X X X
all-to-all-broadcast, read X X X
all-to-all-personalized, write X X X
all-to-all-personalized, read X X X
75
We illustrate these steps by converting the sample program presented in Fig. 5.1
to run on three multiprocessors. First we consider the case for crossbar-connected
multiprocessor. The sample program consists of message patterns C l, C2, C3, C4,
C8 , and CIO. First the message pattern C8 and CIO are reduced to respective
primitive patterns C3 and C5. Theorem 7 leads to communication vectorization for
prim itive patterns on crossbar-connected multiprocessor.
Memory-write and memory-read accesses related to shared-variable communica
tion get implemented in a producer-consumer manner with explicit synchronization
between them . We assume a static barrier MIMD hardware synchronization scheme
[OD90] on the interprocessor interrupt bus. Each receiver processor executes a wait-
sync operation on the sender processor before a memory-read access. Similarly, every
sender processor executes an activate-sync operation after each memory-write access.
Since we assume our original m ulticom puter program to be deadlock-free, this syn
chronization scheme ensures no deadlock in the converted program. These synchro
nization primitives are included into the program code before execution starts. We
denote these synchronization operations as sync primitives between memory-write
and memory-read accesses. Memory graphs are identified by corresponding primi
tive patterns with subscripts w and r. Allocation of mailboxes to memory modules
are based on shared-variable communication sets derived in Theorem 7. This mail
box activity is reflected by identifying corresponding processor and memory module
numbers. The communication steps of the sample program are converted as follows1:
1 We use the following notations : (Px — * My) for scalar-write operation from processor Px to
memory module My, (Px — ► My , Mz, ..., Mw) for vector-write, ((Pxi — ► Myx), (Px2 — + My2)) for
parallel-scalar-write, ((Px — * My,Mz MW),(PX — * My, Mz,..., Mw)) for parallel-vector-write,
(My — * Px) for scalar read, (My, Mz,..., Mw — * Px) for vector-read, ((My — ► Px),(M y\ — ► Pxi)
for parallel-scalar-read, and ((My,Mz,...,Mw — * Pxi), (My\, Mz\ , ..., Mwi — * ■ P®i)) for parallel-
vector-read accesses.
76
A. Crossbar-Connected M ultiprocessor System:
Step 1: CIw{Pq M0,x), sync, C lr(Mo,x -> Px)
Step 2: C2w((Pi -> M1 < 2 ),(P3 -»• M3,0)), sync,
C2 r((M i)2 — > P2), (A^3,o — * Po)
Step 3: C3W(P0 - * ■ M0,x, Mq,2, M0,3), sync
C lr(M0ll - Px),
C lr(M0l2 -> P2), C lr(M0,3 -* P3)
Step 4: C4W(P0 -*• itf0,x, M0,2, M0,3), sync,
C lr(M0ll - P ^ ,
C lr(M0,2 - P2), C lr(Mo,3 -+ P3)
Step 5: C lw(Px — > Mi,0 ), Cl.u,(P2 — » M2,o), C3W(P3 — ► M3ti, M3t2), sync
C lr(Mx,0 - P0), C lr(M2,0 - Po),
C lr(M3,x - Px), C lr(M3 )2 - P2)
B. Orthogonally-Connected M ultiprocessor System:
Step 1: C lw(Po — ► M0,x), sync, C lr(M0,i — * ■ Px)
Step 2: C2^((Px -» Mll2),(P 3 -> M3,0)), sync
C2r ((M x ,2 -P 2 ),(M 3,0 - P o )
Step 3: C3w(Pq — ► Mo,x,Mo,2 ,M 0,3), sync,
C3r((Mo,x — » • Px), (Mo,2 — * ■ P2), (Mo,3 — ► P 3)),
Step 4: C4W (P0 -♦ M0)1, M0,2, M0,3), sync,
C 4r ((M 0)x — ► P i), (M0,2 — ► P2), (Mo,3 — ► P3))
Step 5: C5W ((P1 - Mj(0), (P2 - M2,0)), C3W (P3 - M3,i,M 3,2), sync,
C5r(Mx,o, M2,0 ~+ Po), C3r((M3,x — » ■ P\),{M3^ 2 — * P 2)
C. Single Bus M ultiprocessor System:
Step 1: Clu^Po — ► Mx), sync, C lr{M\ — > Px)
Step 2: C l^ P x -► M2), C 1 w(P3 -* • M0), sync
C lT(M2 - P2), CIt (Mq - Po)
Step 3: C3u,(P0 M \,M 2,M 3), sync, C lr{M\ -+ Px),
C lr(M2 -» P2), C lr(M3 - P3)
Step 4: C4lu(P0 — ► M \,M 2,M 3), sync, C lr{M\ -» Px),
C1P (M2 -► P2), C lr(M3 P3)
Step 5: C lw(Pi — M0), C lw(P2 M0),C3W (P3 -> Mx,M2), sync
C lr(M0 Po), C1P (M0 -► P0);
C lr(Mx - Px), C lrecv(M2 -> P2)
77
5.4 C o m m u n ica tio n B a n d w id th
i
i
We analyze communication bandwidth of vectorized shared-variable communication
between the processors of an n-processor OMP. Consider a message pattern having
maxim um num ber of destination processors from any source processor as d, 1 < d <;
in. Assume th at processors exchange messages with uniform length of I bytes each. I
j i
jThis message pattern can be defined as a subset of all-to-all pattern with vector!
i
length d. Memory sub-system design is assumed to support vector memory accesses
with vector length d, 1 < d < n. Hence, this message pattern can be implemented
as a block-vector-write access followed by a block-vector-read access. Since an ra-j
processor OMP can support parallel memory data transfer by n buses, maximumi
t
»
nd messages with a total capacity of ndl bytes can be transferred with an overhead;
of 2(a + {3 -\- {dl — 1 ) t + {I — 1)£) -f 7 , as derived in Table 2.1. The param eter 7 1
reflects bus switching and synchronization overhead. This leads to a communication,
bandwidth of i((^ j ZV^ {dT+ S)i)^ bytes/sec.
1
Consider the design param eters used in designing orthogonal multiprocessor pro
totype at USC [HDP+90]. Table 5.3 shows the available bandw idth on a 32-processor
OMP for varying message length and num ber of destinations. It has potential to
provide a raw bandwidth of 284 M bytes/sec for message length of 4K bytes. Thej
above expression has a fixed overhead of (a + /3 — r — 6 ). For large values of /, J
pipelined transfer overhead (dr + 8)1 becomes significant. This leads to an u p p er[
i
bound on the available bandwidth:
(5.2)
I
!
I
I
..78. J
71 dl
Peak O M P Bandwidth > C
2 ((a + /? — r — 8) + (dr + 8l)*y
nd
2 {dr + 8)
fo r large I
|Table 5.3: Estim ated Vectorized Communication Bandwidth in M bytes/sec on a
32 x32 Orthogonal Multiprocessor for varying message lengths (d=m axim um number
■ o f destinations in a message pattern).
d message length I in Bytes
16 64 256 1024 4096
1 52.24 60.59 63.11 63.78 63.94
2 89.82 101.89 105.43 106.36 106.59
4 140.27 154.57 158.61 159.65 159.91
8 195.05 208.45 212.09 213.02 213.26
16 242.37 252.45 255.10 255.78 255.94
32 275.82 282.24 283.89 284.31 284.41
2
Since d < n, the peak bandwidth is proportional to 2(n!r+ s) • communicati° n
bandwidth increases with message length I and number of destinations d in a mes
sage pattern. Communication bandwidth for different OMP size are estim ated in
Table 5.4. The peak bandwidth increases as 0 (n ) for large n. For each d pipelined
cycles, there is an overhead of 8 for changing the next address. W ith a fixed 8,
vectorized communication becomes efficient for larger values of d.
Table 5.4: Estim ated Communication Bandwidth in M bytes/sec for Different OMP
Sizes (n=O M P size, d=m axim um num ber of destinations in a message pattern,
d < n, NA =Not Applicable).
d OMP size n
4 8 16 24 32
1 8 .0 0 16.00 32.00 48.00 64.00
2 13.33 26.67 53.33 80.00 106.67
4 2 0 .0 0 40.00 80.00 1 2 0 .0 0 160.00
8 NA 53.33 106.67 160.00 213.33
12 NA NA 1 2 0 .0 0 180.00 240.00
16 NA NA 128.00 192.00 256.00
20 NA NA NA 2 0 0 . 0 0 266.67
24 NA NA NA 205.71 274.29
28 NA NA NA NA 280.00
32 NA NA NA NA 284.44
79
C h a p ter 6
P ro g ra m C o n v ersio n F rom M u ltic o m p u te r s to
M u ltip ro cesso rs
6.1 In tr o d u c tio n
Communication vectorization provides a mechanism to implement interprocessor
communication steps of a parallel program as vectorized shared-variable memory ac
cess steps. This allows a multiprocessor system to convert and implement distributed-
memory parallel programs by using shared-memory mailbox communication. It is
true th at such program conversion makes a shared-memory system a flexible archi
tecture by allowing it to support both shared-memory and message-passing m odels1
of parallel computation. However, question arises whether such program conversion
using communication vectorization is beneficial in term s of reducing communica
tion complexities. In this chapter, we emphasize on this aspect of communication
complexity reduction. First, we discuss issues in such program conversion. Since
hypercube is a popular m ulticom puter architecture, we take representative hyper
cube programs, convert them onto crossbar-connected and orthogonally-connected
multiprocessors, and compare the respective communication complexities involved, j
i
We present analytical and simulation results which confirm a significant reduction
80
in communication complexities for programs using dense interprocessor communica
tion. Besides hypercube, we also discuss about converting mesh programs to run on
orthogonally-connected multiprocessor.
6.2 C o n v ertin g H y p e r c u b e P ro g ra m s
6.2.1 V e c to r iz in g C o m m u n ic a tio n P a tte r n s
We consider a hypercube with circuit-switched routing [Bok91b]. Typical hypercube
system supporting this routing scheme is Intel iPSC-860 [Bok91a]. Since single
bus multiprocessor demonstrates lim ited vectorization capability, we consider only
crossbar-connected and orthogonally-connected multiprocessors. For simplicity, we
refer the com putational units of a hypercube as nodes and those of a multiprocessor
as processors. Programs developed to run on a m-node hypercube are converted to
■run on an n-processor multiprocessor, where m > n. For m = n, computational
task of each hypercube node is m apped to a single processor of the multiprocessor.
Inter-node message-passing steps are vectorized and mapped onto shared-memory
■mailboxes.
For m > n, we group m hypercube nodes into n clusters with (m /n ) nodes
assigned to each cluster. Assume m = 2P and n = 2q for some integer p and
q. Consider a mapping of com putational tasks (associated with respective data
allocation) of each cluster consisting of 2p~q nodes onto a single processor. This
collapses hypercube communication corresponding to (p — q) dimensions as intra
cluster communication. These intra-cluster communication steps are implemented as
memory-write and memory-read accesses to local memory attached to the processor.
By restructuring computational tasks of (m /n ) processes into a single process, these
81
local memory communication steps get integrated to com putation. This leaves inter-
jcluster communication steps belonging to q dimensions to be implemented using
shared memory. Figure 6.1 illustrates different clustering options. Based on the
program characteristics, frequently used (p — q) dimensions may be m apped as intra
cluster communication to achieve minimal communication complexity.
14
20 22 28 30
29
23 31
26 24'
27 17 19
A 32-node hypercube
+xl
-x2
-xO +x0
+x3
-x3
+x4i
link dimensions
Different Clustering Options
Options
In tra-cluster
Inter-cluster
1 +x0, -xO +xl, -xl, +x2, -x2, +x3, -x3, +x4, -x4
2 +xl, -xl +x0, -xO, +x2, -x2, +x3, -x3, +x4, -x4
3 +x2, -x2 +x0, -xO, +xl, -xl, +x3, -x3, +x4, -x4
4 +x3, -x3 +x0, -xO, +xl, -xl, +x2, -x2, +x4, -x4
5 +x4, -x4 +x0, -xO, +xl, -xl, +x2, -x2, +x3, -x3
•Figure 6 .1 : Different clustering options for a 32-node hypercube to convert its pro
gram to run on a 16-processor multiprocessor.
We consider hypercube messages to have uniform length of I bytes each. Com
munication vectorization implements message-passing through shared-memory mail
boxes. As derived in Table 5.2, each primitive pattern gets implemented as a pair
of memory-write and memory-read accesses. For m = n, these accesses are done in
82
blocks of I bytes. For m > n, intra-cluster communication accounts for all possible
message exchange between the nodes associated with the clusters. Thus, effective
message length becomes m ultiple blocks of I bytes. This effective message length
depends on primitive pattern and derived later in this section.
For larger message lengths, we use block accesses as shown in Table 2.1. Total
time to implement a memory graph depends on whether the access can be imple
mented as a single step or multiple steps. For an example, consider memory graphs
corresponding to all-to-all-personalized pattern. As shown in Table 5.2, the write
graph gets implemented as parallel-vector access both on crossbar-connected and
orthogonally-connected multiprocessors. Since all processors work in parallel, the
tim e to implement this access is same as that of a vector access, as shown Ta
ble 2.1. Its corresponding read graph gets implemented as a parallel scalar access
on crossbar-connected multiprocessor. Since (n — 1 ) processors are involved in this
parallel-scalar access, it gets implemented as (n — 1 ) scalar accesses. Thus, total
tim e to implement this access is (n — 1) times of th at of a scalar access. For each
memory graph, we use this approach in determining total access time. Table 6.1
shows equivalent memory access steps to implement prim itive message patterns on
crossbar-connected and orthogonally-connected multiprocessors.
6 .2 .2 H y p e r c u b e C o m m u n ic a tio n C o m p le x ity
Consider prim itive patterns being implemented on a m-node hypercube using circuit-
switched communication. We consider message transfers with unblocked send and
blocked receive characteristics [Bok91a]. Time complexity to communicate a
message of I bytes long to a node at a distance of d hops is:
Th = th + lrh + dv (6.1)
83
Table 6.1: Equivalent Memory Accesses to Implement Prim itive Message Patterns
on Multiprocessors by Communication Vectorization (entries against to each access
type indicates num ber of required memory access steps and the presence of parallel
access by the processors).
Commn
Patterns
Crossbar-Connected Orthogonally-Connected
One-to-one block-scalar-write (1 ) +
block-scalar-read (1 )
block-scalar-write (1 ) +
block-scalar-read (1 )
Perm utation block-scalar-write (l,par) -f
block-scalar-read (l,par)
block-scalar-write (l,par) +
block-scalar-read (l,par)
One-to-all
(broadcast)
block-broadcast (1 ) +
block-scalar-read (n-1 )
block-broadcast (1 ) +
block-scalar-read (l,par)
One-to-all
(personalized)
block-vector-write (1 ) +
block-scalar-read (n-1 )
block-vector-write (1 ) +
block-scalar-read (l,par)
All-to-one block-scalar-write (l,par) +
block-scalar-read (n-1 )
block-scalar-write (l,par) +
block-vector-read (1 )
All-to-all
(broadcast)
block-broadcast (l,par) +
block-scalar-read (n-1 ,par)
block-broadcast (l,par) +
block-vector-read (l,par)
All-to-all
(personalized)
block-vector-write (l,par) +
block-scalar-read (n-1 ,par)
block-vector-write (l,par) +
block-vector-read (l,par)
84
where th is the message start-up time, Th is data transfer tim e per byte, and v is aver
age circuit-switching tim e per each hop. These param eters on iPSC/860 hypercube,
empirically derived by Bokhari [Bok91a], are: th = 95.0, Th = 0.394, and v = 10.3
microseconds. We use these param eters in our analysis. This communication time
Th can be expressed as a function (ch + <4,Z ), where C h reflects fixed overhead inde
pendent of message length I. The param eter dh reflects variable overhead per each
byte of message.
Table 6.2 shows communication complexities for prim itive patterns. For one-to-
one pattern, we assume communicating nodes to be farthest apart with a distance
of logm hops. At least two communicating nodes with a distance of logm hops
are assumed in case of permutation. One-to-all communication for broadcast type of
message is implemented as a tree structure with logm height. This communication
for personalized type of message exchange is implemented as a sequence of (m — 1 )
one-to-one message-passing substeps. The message length remains same in each
hop of this pattern. This leads to total number of hops traversed in (m-1) substeps
as m log m /2. Average number of hops traversed per each substep thus becomes
■ ^ ^ 8 ™. All-to-one pattern also gets implemented as a hierarchical tree structure.
Nodes at a level merge messages from their two children and send to their respective
parents. Depending on the com putation involved, merging may not increase message
length in each substep. One such example is histogramming, which we consider in
the following section in our simulation experiment. There are logm communication
substeps involved to implement this pattern. Each communication substep covers
only a distance of one hop.
An optimal-circuit-switched procedure, developed by Bokhari [Bok91b], is used
to implement all-to-all communication for both broadcast and personalized types of
message exchange. This procedure gets implemented in (m — 1) substeps. During eth
________ 85
'substep, 1 < i < m —1 , node j , 0 < j < m — 1 sendsm essagetonode j® i, where ® is
bitwise exclusive-or operation. This procedure leads to conflict-free communication
■ in each substep. The message length remains unchanged in each substep. All pairs
of processors also rem ain at identical distances from each other. This leads to an
average num ber of hops in each transmission substep as hops.
Table 6.2: Communication Complexities for Prim itive Patterns on a m-node Hy
percube using Circuit-Switched Communication (m = num ber of nodes, th — com
munication start-up time, Th = data transfer tim e per byte, v = circuit-switched
communication tim e per hop, I = uniform message length, C h — fixed overhead, dh
= variable overhead per byte of message transfer).
Prim itive
Patterns
Total Complexity Param eters C h and dh
One-to-one th + lrh + log m - v ch = th + logm ■ v
dh = rh
Perm utation th + lrh + log m - v C h = th + log m • v
dh ~ Th
One-to-all
(broadcast)
log m (th + lrh + v ) ch = log m (th + v)
dh = log m • rh
One-to-all
(personalized)
(m - 1)(<» + lr„ + ^ ■ v) ch = ( m - l ) ( t h + ^ ^ - v )
dh = ( m - 1 )rh
All-to-one log m (th + lrh + o) ch = log m (th + v )
dh = log m • Th
All-to-all
(broadcast)
(personalized)
(m - l)(i* + In + ^ ■ v) ch = log m (th + v )
dh = log m • Th
6.2.3 V e c to r iz e d C o m m u n ic a tio n C o m p le x ity
Consider implementing primitive patterns of a m-node hypercube with message
'length /. Communication vectorization implements these patterns as pairs of mem
ory accesses as shown in Table 6.1. Consider a hypercube program solving a problem
with an initial data allocation to each of the m-nodes. In order to implement and
■map the same program on an n-processor system, we cluster and map ( ^ ) nodes to
86
!a processor. This mapping allocates data associated with ( ^ ) nodes to a single pro-
Jcessor. Shared-variable communication between processors now reflect inter-cluster
lypercube communication. This leads to different message lengths for > 1). We
denote this new message length as effective message length L. For example, we have
L = (~ )2l for one-to-all, personalized communication.
Similar to expressing hypercube communication complexity by two param eters
C h and dh, we introduce param eters cc,dc,c0, and da for crossbar-connected and
orthogonally-connected multiprocessors, respectively. The param eters cc and c0 re-
jflect fixed overheads; while dc and d0 reflect variable overhead per byte of of message
length. This leads to Tc = cc + dcl and Ta = c0 + dQ l, where Tc and T0 denote total
tim e complexity to implement a pattern on crossbar-connected and orthogonally-
connected multiprocessors, respectively. Table 6.3 shows complexities in implement
ing prim itive patterns on the two multiprocessors with communication vectorization.
The entries are computed from the results derived in Tables 2.1, 5.2, and 6.1.
6 .2 .4 R e d u c tio n in C o m m u n ic a tio n C o m p le x ity
'Now we compare communication complexities of hypercube communication with
jthose of vectorized shared-variable communication. The following theorem deter
mines reduction in communication complexities associated with vectorized shared-
variable communication.
T h e o re m 8 Vectorizing primitive patterns of an n-node hypercube program to run
on n-processor crossbar-connected and orthogonally-connected multiprocessors leads
\to asymptotic relative reductions in communication complexity of {{dh — dc)/ dh) and
[{dh — d0)/dh), respectively. For shorter messages, respective relative reductions are
((ch - cc)/c h) and {{ch - c0)/c h)).
P ro o f: Compare communication complexities derived in Tables 6.2 and 6.3. Each
time complexity has a fixed overhead and a variable overhead. The param eters Ch, cc,
8 L „
r
i
i
i
!
i t i
,Table 6.3: Time Complexities and Associated Param eters to Im plement Prim itive,
Patterns on Two Multiprocessors using Communication vectorization (m = size of.
hypercube, n = size of multiprocessors, I = message length in hypercube, L =
effective message length on multiprocessors, cc, c0 = fixed overhead, dc, d0 = variable
overhead, a = constant access tim e, 0 = memory access and bus transfer tim e, r =
pipelined bus transfer tim e, 6 = address change overhead in block access, and 6 =
.synchronization and bus-switching tim e).
Prim itive
Patterns
L Crossbar-connected
Multiprocessors
Orthogonally-connected
Multiprocessors
One-to-one I cc = 2a + 7
dc = 2 0
c0 = 2 a + 7
do = 20
Perm utation
m f
n
cc = 2 a + 7
dc = 2 ^ 0
c0 = 2 a + 7
dQ = 2 */J
One-to-all
(broadcast)
—l
n
cc = n a + 7
dc = rnfd
c0 = 2 a + 7
= 2 ^ 0
One-to-all
(personalized)
n*
cc = n a + 0 — r — 5 + 7
4 = ^ r + 3 ^ + ( n - 1 ) ^ / 3
ca = 2a + r — <5 + 7
J » 7l 2 — I m 2 I T O 2 /3
= — r + -^ 6 + -zr/ 3
All-to-one l cc = -na + 7
dc = n0
c0 = 2a — t — 6 + j
d0 = 0 + n r + 6
All-to-all
(broadcast)
m i
n
cc = n a + 7
dc = m 0
c0 = 2a + 0 + j — r — 8
do = ^ 0 + m r + £ 6
All-to-all
(personalized)
m* 1
n 2 1
cc = na + 0 — t — 6 + 7
4 = * ^ + £ « + ( » - ! ) £ / *
c0 = 2 (a + 0 - r — 6 ) + 7
d0 = 2 — r + 2 ^ 6
0 n 1 W*
8 8 . . .
and c0 reflect fixed overheads for hypercube, crossbar-connected multiprocessor, and
orthogonally-connected multiprocessors, respectively. The param eters, dh,dc, and
d0 multiplied with message lengths determine the respective variable overheads.
Consider vectorizing a hypercube message-passing step on orthogonally-connected
multiprocessor. Total reduction in communication complexity after vectorization is
((c/t — c0) + (dh — d0)l). Relative reduction is ((ch — c0) + (dh — d0)l)/(ch + dhl). For
long messages (large I), we have (dhl Ch), (d0l c0), and (dh — d0)l » (ch — c0).
This leads to an asym ptotic relative reduction of ((dh — d0)jdh). For short messages
(small I, I < 16), fixed overhead dominates and leads to a relative reduction of
((cfc — ca)/ch)). Similar arguments hold good for crossbar-connected multiprocessor
too. ■
Table 6.4 estim ates percentage reductions in complexity for m = n = 16. Hy
percube communication param eters used to calculate these reductions are based
on iPSC-860 hypercube system design [Bok91a] Memory access param eters used
in this estim ation are conservative param eters used in the prototype design of an
orthogonally-connected multiprocessor [HDP+90]. The comparison results are sen
sitive to memory access param eters a, 0,6, and r. However, for large I, it entirely
depends on r. We have taken a conservative approach in using r = 50 nsec cor
responding to 20 MHz data transfer on the bus. Current bus technology supports
considerably higher data transfers. Such high-performance bus designs will lead to
still higher reductions.
For all patterns, significant reductions are indicated for small I. This reduc
tion is mainly due to large communication start-up (th) tim e in hypercube com
pared to a small memory-access start-up tim e in multiprocessors. These reductions
vary for large I. Both one-to-one and permutation take almost equal overheads
in hypercube and multiprocessors. One-to-all-broadcast pattern indicate reduction
89
I Table 6.4: Analytically Estim ated Asymptotic Percentage Reductions (negative
(entries corresponds to an increase) in Communication Complexity by Converting
:a 16-processor Hypercube Program Onto 16-processor Multiprocessors (CCM =
i Crossbar-Connected Multiprocessor, OMP = Orthogonally-connected Multiproces
sor, I = message length).
Prim itive Patterns large I small I
CCM OMP CCM OMP
One-to-one -1.5 -1.5 94.7 94.7
Perm utation -1.5 -1.5 94.7 94.7
One-to-all
(broadcast)
-81.0 74.6 87.0 98.5
One-to-all
(personalized)
36.5 80.5 96.0 98.9
All-to-one -90.4 27.0 86.3 95.3
All-to-all
(broadcast)
49.0 79.8 96.7 98.9
All-to-all
(personalized)
32.0 6 6 .0 95.8 98.2
]on orthogonally-connected multiprocessor and an increase on crossbar-connected
multiprocessor. This increase is due to memory-read, access being implemented as
(n — 1) scalar accesses on crossbar-connected multiprocessor. Similar explanation,
’holds good for personalized type of one-to-all message exchange. Communication,
| vectorization leads to significant reductions for all-to-all pattern. For personalized
(type of message, both write and read steps can be vectorized on the orthogonally-
connected multiprocessor. This leads to a 66 % reduction in communication com
plexity. On crossbar-connected multiprocessor, the write step is vectorized and the
j read step is implemented as parallel-scalar access. This leads to a lim ited 32 % re-
I
iduction. For broadcast message pattern, these reductions are 79.8 % and 49.0 %,
i
Respectively. This indicates that this pattern can be implemented on multiproces-
'sors with a communication overhead which is two to five times less than th at of a
!
i
hypercube system. This table illustrates the advantage of shared-variable vectorized
communication for different message-passing patterns. |
|
' Next we consider reductions in communication complexity while converting hy-'
percube programs for m ^ $ > n. Consider vectorizing communication steps of a m-
t '
inode hypercube program to run on an n-processor orthogonally-connected multi-j
‘ I
Iprocessor. There exists communication reductions iff for a given n, ((c^ — c0) +.
, {dh — d0)l) > 0. Similar argument holds good for crossbar-connected multiprocessor
I
iiff for a given n, ((C h — cc) + (dh — dc)l) > 0. In this conversion, computational,
itask of (^)-nodes are m apped to a single processor of the multiprocessor system.
! This leads to an increase in com putational complexity. This increase affects total
x
J complexity based on the ratio of com putation to communication steps in that pro^
1
\ gram. Hence, the effectiveness of communication vectorization for m > n varies w ith1
I message-passing patterns. We emphasize on this aspect in the following section.
i
I
1 6.3 S im u la tio n E x p e r im e n ts an d R e su lts
i
i
j We have carried out simulation experiments to support our theoretical findings. Sev-
| eral programs involving primitive patterns were executed on a simulated hypercube
1 system. Then these programs were converted to run on multiprocessors using com
munication vectorization. These converted programs were executed on a simulated
■ crossbar-connected and on a simulated orthogonally-connected multiprocessors. The
respective communication complexities obtained from simulation are reported below.
[6.3.1 S im u la tio n E x p e r im e n ts P erfo rm ed
1
! The following five problems were considered by us in our experimentation:
i
j • Row-shuffle permutation of a (P x P) m atrix on a m-node hypercube system.
The problem uses permutation communication. Initially, (P /m ) consecutive
91
, rows of the m atrix are assigned to each node. Rows associated with the pro- j
J <
cessors are distributed among the nodes based on a perfect shuffle of the n o d e,
indices.
• Solving A Y — B on a m-node hypercube system using gaussian-elimination
|
| with back-substitution algorithm. The vector Y consists of P variables. Each
node is associated with (P/m) linear equations and (P/m) components of
j Y. The communication steps used in this algorithm for both elimination and
, back-substitution phases are of one-to-all-broadcast type.
• Performing histogram of a (1024 x 1024) image with B grey levels on a m-node
! hypercube system. Each node is associated with (P/m) rows of the im age.,
j Partial histograms are computed first at each node, in parallel. These partial
J histograms are then merged by performing all-to-one communication. We use'
a tree structure communication to implement this merge. This involves log m
substeps of near-neighbor communication.
• Multiplying two matrices A x B = C on a m-node hypercube system. The
! matrices are of dimension (P x P) each. (P/m) rows of A and (P/m) columns
!
of B are distributed to each node initially. Partial results are computed first at
! each node, in parallel. The columns of B are then communicated to all other
nodes in an all-to-all-broadcast manner.
• Transposition of a (P x P ) matrix on a m-node hypercube system. (P/m)
rows of the m atrix are distributed initially to the nodes. After transposition,
each node receives (P/m) columns.
Two sets of experiments were carried out. The first set considered identical sys
tem size and different problem sizes. The objective was to observe reduction in
communication complexities for varying message lengths by keeping com putational
complexities same for all systems. The programs were executed on a simulated
16-node hypercube, a simulated 16-processor crossbar-connected system and a sim
ulated 16-processor orthogonally-connected system. Experiments were conducted
I
for varying m atrix sizes; P — 32, 64, 128, 256, 512, and 1024. For histogramming
problem, we varied the number of grey levels B in a similar manner. Smaller P (B)
9 2
resulted in smaller message length during the communication steps. For 16-processor j
i
systems, a value of P = 1024 led to a message length of 64K bytes for several mes-'
i
*sage patterns. This reflected the effect of communication vectorization on longer
I
messages. Totally 84 experiments were performed in this set of experim entation.
i
| During the second set of experiments, we kept the problem size same and var
ied system sizes. We chose the largest problem size for P(B ) = 1024. The effect
I ■
;of communication vectorization in converting programs from large hypercube sys-
l
j terns to run on smaller size multiprocessors were observed. For each problem, we(
Sexecuted programs on simulated hypercubes with 32, 64, 128, 256, 512, and 1024!
i ,
i
processors. Converted programs were executed on simulated crossbar-connected
and orthogonally-connected multiprocessors with 16 and 32 processors. Additional
!
38 experiments were performed in this set of experimentation. Associated with these
program conversions, we observed tradeoffs in reduction of communication complex
ity and increase in com putational complexity.
i
i
■ 6 .3 .2 S im u la tio n R e s u lts an d Im p lic a tio n s
i
I
J Figure 6 .2 a shows the effect of communication vectorization for permutation type of
communication. Since this pattern is inherently scalar in nature, we could not vec-
t
torize it. The communication complexities were found to be almost equal for all three
systems. Figure 6.2b compares communication complexities for one-to-all-broadcast1
I
communication. Compared to hypercube, OMP performed very well in reducing
communication complexity. As message length increased with problem size, com-'
munication complexity in crossbar-connected multiprocessor increased. This was
due to conflict in memory-read access to column-oriented vector mailboxes. This
access was implemented as a sequence of (n — 1) scalar accesses. For longer mes
sages, communication vectorization was not found effective on crossbar-connected
multiprocessor for this message pattern.
Comparison of communication complexities for all-to-all-broadcast communica-
;ion is shown in Figure 6.2c. Both multiprocessors implemented memory-write ac
cess as a parallel-broadcast operation. Corresponding memory-read access got im
plemented on OMP in a vectorized manner. This led to a significant reduction
in communication complexity. The crossbar-connected multiprocessor implemented
;his memory-read as a parallel-write access. This led to less reduction in compu-
:ation complexity. Figure 6 .2d compares communication complexities for all-to-all-
personalized dense message pattern. This pattern got implemented as parallel-vector-
jtunYe followed by vector-read access on orthogonally-connected multiprocessor. It got
implemented as parallel-vector-write followed by parallel-scalar-read access on CCM.
Significant reduction in communication complexity was observed on OMP compared
to CCM.
Figure 6.2e compares communication complexities for all-to-one message pattern.
This was implemented as a single step parallel-scalar-write followed by a vector-read
access on orthogonally-connected multiprocessor. A significant reduction in com
munication complexity was observed in this case. For CCM, the receiving processor
performed (ra —1 ) scalar-read accesses to read messages from column-oriented (n — 1 )
mailboxes. As message length increased, this resulted in higher communication com-
Dlexity.
The results obtained in first set of simulation experiments are summarized in
Table 6.5. Experiments with P (B ) = 32 represented problems with smaller mes
sage lengths. Experiments with P {B ) = 1024 involved considerably larger message
engths. Consider Tj£, T /, and T* representing communication complexities derived
94
permutation one-to-all-broadcast
(a)
400
360
320
280
240
200
160
120
80
40
32 64 H28 256 5i2 1&24
m atrix size (P)
row-shuffle perm utation
all-to-all-broadcast
0
4 * -
32 64 128 256 512 1024
m atrix size (P)
(c) m atrix multiplication
all-to-one
64 128 256 512 1024
m atrix size (P)
(b) gaussian-elimination
all-to-all-personalized
128 256 512 1024
m atrix size (P)
(d) m atrix transpose
Y-axis: communication
tim e in milliseconds
o : OMP
CCM
* : Hyp
16 32 64 1282565121024
grey levels (B)
(e) histogramming
Figure 6.2: Comparison of communication complexities on 16 processor systems
for various problem sizes (OMP = orthogonal multiprocessor, CCM = crossbar-
connected multiprocessor, Hyp = hypercube). j
9 5 . J
from simulation experiments for hypercube, crossbar-connected, and orthogonally-
connected multiprocessors, respectively. We calculated percentage reductions as
(T£ — T*)/Tfl and (T1 / — T*)/T£. These reductions, summarized in Table 6.5, closely
m atch with asym ptotic relative reductions estim ated analytically in Table 6.4. How
ever, there was a variation with gaussian elimination example.
While deriving the results in Table 6.4, we assumed that the effective message
length becomes (m /n )/ after clustering (m /n ) hypercube nodes to a single proces
sor. This corresponds to combining messages from all (m /n ) nodes into a single
i
large message. However, gaussian elimination algorithm works differently. Messages j
corresponding to each of the (m /n) nodes are sent separately during row elimination I
i
and back-substitution phases. Hence there were (m /n ) broadcast operations. T h e 1
I
i
broadcast operation corresponding to j th row elimination broadcasted a message of I
I
length (P — j + 2),2 < j < P — 1. During back-substitution phase, message length:
remained restricted to one byte only. Combination of several non-uniform message
i
communication led to this variation from analytically predicted result. i
}
i
Table 6.5: Percentage Reductions (negative entries corresponds to an increase) in [
Communication Complexity Derived by Simulation while Converting a 16-processor i
Hypercube Program to run on 16-processor Multiprocessors (CCM = Crossbar- j
Connected Multiprocessor, OMP = Orthogonally-connected Multiprocessor, I = ■
message length).
Prim itive Patterns large I small I
CCM OMP CCM OMP
Perm utation -1.3 -1.3 81.8 81.8
One-to-all
(broadcast)
-8.1 87.7 89.1 98.7
All-to-one -72.2 39.5 88.4 97.1
All-to-all
(broadcast)
42.7 77.6 8 8 .8 95.5
All-to-all
(personalized)
52.4 67.2 98.2 99.3
6.3 .3 T rad eoffs in C o m p u ta tio n s an d C o m m u n ic a tio n s 1
l
Consider converting programs for a m-node hypercube to run on an n-processor'
multiprocessor for m > n. As we discussed earlier, this conversion leads to an
increase in com putational complexity by a factor of (m /n ). Now the question is,
w hether such conversion leads to significant reduction in communication complexity
iwhich has potential to reduce total execution tim e complexity. We observed this
[tradeoff in our second set of simulation experiments.
I
j Consider matrix-row-shuffle problem requiring permutation communication. Fig-
jure 6.3a indicates tim e complexities for various system sizes. This problem is ;
communication-intensive. W ith larger hypercubes, perm utation was observed to ’
be implemented faster. Hence, the program conversion was found not to be advan-:
tageous in this case. j
j Figure 6.3b shows tradeoffs for gaussian-elimination problem requiring one-to-all
I !
jcommunication. W ith larger hypercubes, there was an increase in communication'
[complexity and decrease in com putational complexity. Sixteen processor OMP and
j
CCM performed better than 32-processor hypercube. Compared to a 1024-node
hypercube, OMP 16 reduced communication complexity by a factor of 21. How
ever, com putational complexity increased by a factor of 59. This was expected
because each processor on OMP-16 performed com putational tasks of 64 hypercube
inodes. However, total execution tim e increased by a factor of 3.9 only. Considering
I
:(processor x tim e ) complexity measure, program conversion in this case was found
i
!to be very effective. !
I
Tradeoffs in implementing all-to-one communication in our histogramming prob
lem is shown in Figure 6.3c. Converting programs for 1024 hypercube onto 16-
processor OMP led to reduction in communication complexity by a factor of 4.1.
Since this problem is computationally-intensive, this communication reduction did
97
a lo..
( p e r m u ta tio n )
1 5 5 5 1
o
v H
T f
VO
X
n
r — i
3 3
v o r < j
* 0 r ~ *
rJ W * >
33 3 3
s
©
V O
o
m
o
VO
I
U
in
m
U
(a) permuting rows of a (1024 x 1024) matrix in a shuffle fashion
100 - -
( o n e - t o - a l l - b r o a d c a s t)
80 --
8
o >
8 1 4 0 -
3 5 E S BSE
a
o
IN
IN
tn
U U
(b) gaussian-elimination with back-substitution for
1024 variable linear equations
100
75 -|-
n
2
§50
Z J
25
( a ll - to - o n e )
o o
IN
a
o V C I N N - C N V) a VO IN « IN
t * co r-i
K K f f i K E C W f f i O O U U
(c) histogramming a (1024 x 1024) image with 1024 grey levels
| | - Computation
Communication Wait time in Hyp - Synchronization on
OMP and CCM
H 16 - 16 processor hypercube, O- OMP, C - CCM
Figure 6.3: Comparison of timing complexities for three problems on different hy- j
percube, orthogonal multiprocessor, and crossbar-connected multiprocessor config-;
urations. :
98
, S J
not help in reducing total execution tim e complexity. Tradeoffs for all-to-all-broadcast\
communication, used in m atrix m ultiplication problem, is shown in Fig. 6.4a. This
tradeoff was found similar to that of histogramming example. Though there was
significant reduction in communication complexity, this got hidden due to increase
jin com putational complexity. However, considering (processor x tim e ) complexity
^measure, this program conversion was found to be beneficial. Fig. 6.4b shows trade-
I
offs for m atrix transposition involving all-to-all-personalized communication. This
problem is more communication-intensive. For transposing a (1024 x 1024) m atrix,
64-node hypercube was found optimal. Both OMP 16 and CCM 16 performed well
compared to this optim al hypercube size. OMP 32 performed still better. Compared
to a 1024 hypercube, OMP 16 reduced communication complexity by a factor of 17.7
and com putational complexity by a factor of 7.8. This led to an overall reduction of
total execution tim e by a factor of 17.6. Program conversion was found to be most
effective for this dense message pattern.
I
6.4 C o n v ertin g M e sh P ro g ra m s
jin this section, we present conversion of mesh programs to run on OMP. We empha-j
jsize on mapping computational tasks to processors and vectorizing communication!
I
|steps by allocating mailboxes to shared-memory modules. Reductions in communi-
I
j cation complexity are not analyzed. These reductions can be determined analogous
jto hypercube program conversion.
!
6.4.1 M e sh w ith B o u n d a ry W rap -arou n d
We consider mapping a mesh with m nodes to an n-processor OMP. For m = n,
the mapping is straightforward. There is one-to-one mapping between the mesh
nodes and OMP processors. Memory modules < i ,j < n — 1, are used as
99
35
30-
25
K /i
8 20
“ 15
10
5
0 .
15
-a i< h
(all-to-all-broadcast)
V O < N
C O
tT
V O
00
( N
VO 04
V i v - 4
ra V )
^r
r4
o
v o < N
ro
VO C 'l
CO
C C E C S C cc o o
(a) multiplying two (1024 x 1024) matrices
(all-to-all-personalized)
VO es
ro
■ v i
v o
(b)transposing a (1024 x 1024) matrix
- Computation
- Communication
H 16 - 16 processor hypercube, O- OMP, C - CCM
Figure 6.4: Comparison of timing complexities for two problems on different hyper
cube, orthogonal multiprocessor, and crossbar-connected multiprocessor configura
tions.
100
mailboxes for communication between processors P, and Pj. Conversion of mesh
programs for m = n 2 is interesting. Consider an (n x n) mesh with wrap-around
torus interconnections. Four primitive communication steps are identified as east,
west, north, and south shift with rotation. The n2 nodes are first grouped into n
clusters either by rows or by columns. Each cluster is allocated to a single processor
of the OMP. Figure 6.5(a) shows an example of clustering by columns. The 16 nodes
axe grouped into 4 clusters and allocated to 4 processors of the OMP.
If nodes belonging to a column (or row) are grouped into a single cluster, all
jintra-column (or intra-row) communication reduces to intra-cluster communication
t
on the OMP. The other communication patterns are implemented as inter-cluster
communication. W ith the example shown, all north and south communications are
reduced to intra-cluster communication. The east and west communication steps are
implemented as inter-cluster communication.
Similarity of a mesh structure with that of an orthogonally-connected memory
organization leads to a simplified scheme to implement these inter-cluster com
munication steps. Similar to the IRW step discussed in section 2.4, we intro
duce an Interleaved-Write-Read (IWR) access. An interleaved-write followed by an<
I
interleaved-read constitutes this access. Fig 6.5(b) illustrates an example of imple-'
i |
jmenting a +2 east communication in two IW R steps. Let each mesh node contains,
1
a single data identified by its node number. In column-oriented clustering, each,
orthogonal processor contains n data associated with its column. During the first
; i
|IW R step, processors in parallel, perform an interleaved-write column access fol-
! i
(lowed by an interleaved-read row access. We identify these interleaved accesses as (
i '
[column-write and row-read accesses, respectively. During the second step, processors i
i
Iperform a row-write access followed by a column-read access. During this row-write
101
[access, on-the-fly indexing is used to m anipulate data based on the desired east or,
west communication.
Cluster 0 Cluster 1 Cluster 2 Cluster 3
(a) Grouping nodes in a column into a cluster
C p C p C p C ^ )
4 0 1 1 4 2 4 3
4 4 45 46 47
|
4 8 4 * 41 0
4 1 1
I1 2
J l 3 j l 4 4 1 5
p
©-
G>
©-
d>-
0 1 2 3 2 3 0 1
4 V 6 7 6 7* “t 5 *
10 10 11 8 ~9*
12 13 14 T s
— *
14 15 12 13
+
( p ( p C j D ( p
t 2 1 3 t o l l
t 6
p
t 4 t 5
f l O
t 1 1 T 8 t 9
T l 4
f ! 5 1 1 2 1 1 3
cotumn-write row-read row-write column-read
IW R step 1 IW R step 2
(b) Implementing a +2 ea st communication in two IW R step s
Figure 6.5: Converting a (4 x 4) mesh with wrap-around connections onto a 4-'
processor OMP (P = processor node). [
i This vectorized communication provides k, 1 < | k |< n — 1, east or west commu-
i !
i
jnication steps to be implemented with two IW R steps. During column-write access
I.
•in first IW R step, the processors also have flexibility to use on-the-fly indexing. This'
I
; allows up to k column and k row operations, k, 1 < | k |< n — 1 , to be combined in
i
[two IW R steps.
i
■ Communication vectorization allows converting variations of mesh architectures’
i
such as mesh with broadcast [Bok84, Sto83], mesh with m ultiple broadcast [KR87],
•and generalized mesh [BA84], Meshes with broadcast capability can be directly'
converted to OMP due to the flexibility of data duplication associated with on-the-
fly index m anipulator.
1 0 2
6 .4 .2 M e sh w ith G e n era lized W rap -arou n d ^
Consider a generalized mesh architecture as shown in Fig. 6 .6 . In addition to regular'
m esh links, the architecture provides all possible interconnections between the nodes
i
jwithin each row and within each column. This generalized mesh architecture is a
•special case of the generalized hypercube, developed by Bhuyan and Agrawal [BA84]. •
i
i Consider partitioning a (n x n) generalized mesh into n clusters by columns
las shown in Fig. 6 .6 . Let these n clusters be allocated to n processors on O M P.'
j
The communication on extra column links collapses to intra-cluster com m unication.;
j
jUsing on-the-fly indexing associated with data buffer, communication on extra row
jlinks can be emulated in two IW R steps as discussed in previous subsection for m esh.'
j Similar results also hold good for partitioning and clustering the generalized mesh
i _ _ .
'by rows. Thus, a (n x n) generalized mesh program can be converted efficiently to
run on an n-processor OMP using communication vectorization.
i Cluster 0 Cluster 2
Cluster 1 Cluster 3
12
Figure 6 .6 : Converting a (4 x 4) generalized mesh program onto a 4 processor OMP
iby mapping each column of 4 nodes to run on a single processor of the OMP.
I
103
___j
C h a p ter 7
C o n clu sio n s an d S u g g e ste d F u tu re R ese a r ch
i
\
i
I
j
7.1 S u m m a ry o f R esea rch C o n trib u tio n s j
Shared-memory multiprocessors with memory-interleaving support vectorized data
access between processor and memory subsystem. Besides restricting the use of this
t
access for computation, our goal in this research has been to explore possibility of I
supporting data m anipulation and interprocessor communication using vectorized i
memory access. The key original contributions of this thesis are summarized below:.
• Dem onstrating the feasibility of memory-based vectorized interprocessor com -'
munication. Similar to vectorizing com putational steps on vector supercom
puters, we dem onstrated that interprocessor communication steps of a parallel
program can be implemented as vector read and write steps on shared-memory |
multiprocessors with interleaved memory organizations. Chapter 5 emphasized:
on this aspect. (
• Dem onstrating communication vectorization to be more efficient with two-;
dimensional interleaved memory organization in orthogonally-connected m ulti- 1
processor than one-dimensional organization in crossbar-connected and single- >
bus based systems. This conclusion is based on analytical and simulation!
results presented in Chapter 6 .
Developing a communication vectorization procedure to convert and imple
ment distributed-m emory m ulticom puter programs on shared-memory mul
tiprocessors with significant reductions in communication complexity. This
procedure is described in section 5.3.2.
• Introducing a new concept of atomic vector-read-modify-write access in mul
tiprocessing. In Chapter 4, we dem onstrated its use with interleaved memory
organization to implement efficient processor-memory data movement.
• Design and development of an on-the-fly index m anipulator to work with in
terleaved bus organization. This data m anipulator, working with interleaved)
memories implement functionality of a generalized interconnection switch box 1
proposed by Thompson [Tho78] as presented in Chapters 3 and 4..
• Development of a new hardware-based approach to fast data movement and ]
m anipulation in shared-memory multiprocessors. Dem onstrating a two di- ■
mensional interleaved memory organization being more capable of memory-j
to-memory data movement than one-dimensional organization. We presented)
|
theoretical, analytical, and simulation results in Chapter 4 to support this
claim.
• Design and development of vector register windows to work as data buffers at
tached to processors. These data buffers has potential to be used as a replace
m ent to cache in a partially-shared or non-uniform shared-memory system.
Chapter 3 discusses its organization, functionality, and potential.
7.2 S u g g e stio n s for F u tu re R esea rch
No research is complete at any time. During the course of this research, we havei
encountered several interesting problems. We have addressed some of these problems
<
in this thesis. At this stage of thesis completion, we provide a list of suggestions for,
future research. Some of these can be treated as a continuation to the present work, i
<
The remaining ones are long term problems and need sufficient research.
A. Short-Term Problems:
• Proving the concept of communication vectorization over a large set of appli- j
cations. In C hapter 6, we used representative applications involving primitive!
message patterns to dem onstrate the feasibility and efficiency of communi- j
cation vectorization. It will be interesting to observe program behavior and j
develop a vectorized communication model. This model can be used to pre- j
diet reduction in communication complexity for programs using mixed and I
irregular message patterns. j
i
• Investigating the use of page-mode DRAMs to support virtual memory in ter-!
I
leaving, on-the-fly data m anipulation, and communication vectorization. A j
1 . 1
I page-mode DRAM with its page mode access supports pipelined data transfer j
similar to an interleaved bus. This indirectly leads to virtual memory inter-1
leaving, where the degree of memory interleaving is equal to the size of aj
page. Consider a multiprocessor with n processors, n buses, and n memory j
i
modules using page mode DRAMs. This architecture is significantly less com- j
plex compared to a crossbar-connected multiprocessor. We predict that data j
m anipulation and communication vectorization can be implemented on this \
multiprocessor with a performance identical to th at of a crossbar-connected j
multiprocessor. j
I
• Analyzing feasibility and associated gain of communication vectorization on
hierarchical multiprocessor systems with hierarchical buses or cluster-based or
ganizations. Both data m anipulation and communication vectorization can be
i
implemented on these multiprocessors using local and global data exchanges.
i
!B. Long Term Problems:
i
• Using vector-read-modify-write cycle to achieve register or memory-based vec
torized synchronization in multiprocessors. Considering a large scale mul-|
tiprocessor system, processors can write m ultiple synchronization variables
to shared memory modules. These variables can be compared with existing
count variables to implement fast barrier synchronization. Using few barrier j
variables, this scheme will allow arbitrary many-to-many synchronization effi- |
ciently.
106
• Considering the use of vector register windows and its associated data ma
nipulator in providing vector support for RISC processors. The present day
RISC processors depend on the internal cache organization to support indirect
vector computation. However, the block size of an internal cache limits the
vector length. The vector register windows presented in this thesis supports
large vector length due to its reconfigurability feature. Hence, this can be used
to alleviate problems associated with internal cache.
• Using on-the-fly indexing scheme with link-based data transfer. Our proposed
index m anipulator in Chapters 3 and 4 do indexing of data buffers associated
with processor instead of the memory modules. Hence, this can be used at any
place where data transfer takes place in a streaming manner. It will be interest
ing to analyze the capability of this m anipulator to work with multicom puters
such as hypercube, mesh, and pyramids.
• Investigating the potential of mixed-mode communication (shared-memory
and message-passing) in scalable shared-memory systems. The design of seal-
I
able shared-memory systems is taking a shape where clusters of shared-memory j
multiprocessors are linked through communication links to achieve scalability.
Our proposed communication vectorization fits well into this research dom ain.:
Vectorization technique allows an user to start with a m ulticom puter program ,,
break into inter-cluster and intra-cluster communication steps. Inter-cluster
!
communication steps can be implemented by message passing. Intra-cluster I
steps can be converted to vectorized shared variable communication. Another |
alternative is to have intra-cluster steps as message passing and inter-cluster |
steps as vectorized communication. It will be interesting to observe the inter
play between the design of scalable shared-memory systems and their commu- *
nication models.
107
R eferen ce L ist
i
[BA84]
[Bai87]
[Bal84]
[BC90]
[BM89]
\
i
i
[Bok84]
I
l
l[Bok91a]
l[Bok91b]
[CE90]
i
!
j
L. N. Bhuyan and D. P. Agrawal. Generalized Hypercube and Hyperbus j
Structures for a Com puter Network. IE E E Transactions on Computers, !
C-33(4):323-333, April 1984.
D. H. Bailey. Vector Computer Memory Bank Contention. IE E E Trans- ]
actions on Computers, C-36:293-298, Mar 1987. \
j
R. V. Balakrishnan. The Proposed IEEE-896 Futurebus - A Solution to j
the Bus Driving Problem. IE E E Micro, pages 23-27, Aug 1984. |
I. Y. Bucher and D. A. Calahan. Access Conflicts in Multiprocessor!
Memories: Queuing Models and Simulation Studies. In Proc. of A C M
International Conference on Supercomputing, Amsterdam, The Nether
lands, pages 428-438, June 11-15 1990. j
A. Baratz and K. McAuliffe. A Perspective on Shared-memory and
Message-memory Architectures. In J. L. C. Sanz, editor, Opportuni
ties and Constraints of Parallel Processing, pages 9-10. Springer-Verlag,
1989.
S. H. Bokhari. Finding Maximum on an Array Processor with a G lobal;
Bus. IE E E Transactions on Computers, C-33(2): 133-139, Feb 1984. !
i
S. H. Bokhari. Complete Exchange on the iPSC-860. Technical R e p o rt'
91-4, Institute for Computer Applications in Science and Engineering, I
NASA Langley Research Center, Jan 1991. I
S. H. Bokhari. Multiphase Complete Exchange on a Circuit Sw itched:
Hypercube. Technical Report 91-5, Institute for Com puter Applications i
in Science and Engineering, NASA Langley Research Center, Jan 1991. ;
S. C hittor and R. Enbody. Performance Evaluation of Mesh-Connected
W ormhole-Routed Networks for Interprocessor Communication in Mul
ticomputers. In Proceedings of the Supercomputing ’ 90, New York, pages j
647-656, Nov 1990.
[Gha8~9j
[HC91]
[HDP+90]
[HJ89]
[HK88]
[HP91]
[HTK89]
[Kar87]
[KR87]
[Kum88]
[LA81]
K. Gharachorloo. Towards More Flexible Architectures. In J. L. C. Sanz,
editor, Opportunities and Constraints of Parallel Processing, pages 49-
53. Springer-Verlag, 1989.
K. Hwang and C. M. Cheng. Simulated Performance of a RISC-Based
Multiprocessor with Orthogonal-Access Memory. Journal of Parallel and
Distributed Computing, Sep 1991.
K. Hwang, M. Dubois, D. K. Panda, S. Rao, S. Shang, A. Uresin,
W. Mao, H. Nair, M. Lytwyn, F. Hsieh, J. Liu, S. M ehrotra, and C. M.
Cheng. OMP: A RISC-based Multiprocessor using Orthogonal-Access
Memories and M ultiple Spanning Buses. In Proc. o f A C M International
Conference on Supercomputing, Amsterdam, The Netherlands., pages 7-
22, Jun 1990.
C. T. Ho and S. L. Johnsson. Optim um Broadcasting and Personal
ized Communication in Hypercubes. IE E E Transaction on Computers,
38(9):1249-1268, Sept 1989.
K. Hwang and D. Kim. Generalization of Orthogonal Multiprocessor
for Massively Parallel Com putations. In Proceedings o f the Conference j
on Frontiers o f Massively Parallel Computations, Fairfax, Virginia, Oct I
1988. j
K. Hwang and D. K. Panda. The USC Orthogonal Multiprocessor fo r:
Image Understanding. In V. K. Prasanna Kumar, editor, Parallel A r
chitectures and Algorithms fo r Image Understanding. Academic Press,
1991.
j
K. Hwang, P.S. Tseng, and D. Kim. An Orthogonal Multiprocessor for
Parallel Scientific Com putations. IE E E Transactions on Computers, C -
38(1):47— 61, Jan 1989.
A. H. Karp. Programming for Parallelism. IEEE Computer, pages 43-
57, May 1987.
V. K. Prasanna Kum ar and C. S. Raghavendra. Array Processor with
M ultiple Broadcasting. Journal of Parallel and Distributed Computinq,
4:173-190, 1987.
Manoj Kumar. Supporting Broadcast Connections in Benes Networks.
Technical Report RC 14063, IBM Research, May 1988.
B. Lint and T. Agerwala. Communication Issues in the Design and Anal
ysis of Parallel Algorithms. IE E E Transactions on Software Engineering,
SE-7(2):174-188, Mar 1981.
_________________________ _ _ _ 109 J
|[LEN90]
i[LN90]
[LS90]
[Map90]
i
:[MCH+90]
[NS81]
[OD90]
[PH90]
!
[PH91a]
i
I
l
■[PH91b]
[PH91c]
Y. Lan, A. H. Esfahanian, and L. M. Ni. M ulticast in Hypercube Multi^]
processor. Journal of Parallel and Distributed Computing, 8:30-41, 1990. j
I
X. Lin and L. M. Ni. M ulticast Communication in M ulticom puter Net- j
works. In Proc. International Conference on Parallel Processing, pages :
111:114-118, 1990. j
S. Lee and K. G. Shin. Interleaved All-to-all Reliable Broadcast On I
i
Meshes and Hypercubes. In Proceedings of the International Conference i
on Parallel Processing, pages III: 110-113, Aug 1990. |
C. Maples. A High-Performance, Memory-Based Interconnection System |
for M ulticom puter Environments. In Proceedings of the Supercomputing j
’ 90, New York, pages 295-304, Nov 1990. >
I
S. M ehrotra, C. M. Cheng, K. Hwang, M. Dubois, and D. K. Panda, j
Algorithm-Driven Simulation and Projected Performance of the USC
Orthogonal Multiprocessor. In Proc. o f ICPP, St. Charles, IL., pages :
III: 244-253, Aug 1990. j
D. Nassimi and S. Sahni. D ata Broadcasting in SIMD Computers. IEEE\
Transactions on Computers, C-30(2):101-106, Feb 1981. ;
M. T. O’Keefe and H. G. Dietz. Hardware Barrier Synchronization: j
Static Barrier MIMD (SBM). In Proceedings o f the International Con- j
ference on Parallel Processing, pages I: 35-42, Aug 1990. j
D. K. Panda and K. Hwang. Reconfigurable Vector Register Windows {
for Fast M atrix M anipulation on the Orthogonal Multiprocessor. In Pro
ceedings of the International Conference on Application Specific Array
Processors, Princeton, New Jersey,, pages 202-213, Sep 5-7, 1990.
D. K. Panda and K. Hwang. Communication Vectorization in M ulti
processors with Interleaved Shared Memories. IE E E Transactions on
Parallel and Distributed Systems, 1991. (under review).
i
D. K. Panda and K. Hwang. Fast D ata M anipulation in Multiprocessors ;
Using Parallel Pipelined Memories. Journal o f Parallel and Distributed j
Computing, Special Issue on Shared-Memory Systems, pages 130-145, j
June 1991. J
D. K. Panda and K. Hwang. Message Vectorization for Converting Mul
ticom puter Programs to Shared-Memory Multiprocessors. In Interna- j
tional Conference on Parallel Processing, pages 1:204-211, Aug 1991.
110 j
[PHRH90]
[RK86]
i
i
I
|[Sch86]
[SM89]
[Sto83]
[Tho78]
[WB91]
[YTL87]
I
i
J
' [Zho90]
D. K. Panda, F. Hsieh, S. Rao, and K. Hwang. OMP Processor Board'
Design Specification. Technical report, Laboratory of Parallel and D is-!
tributed Computing, Dept, of Electrical Engineering-Systems, Univ. of
Southern California, Los Angeles, CA, Mar 1990.
C. S. Raghavendra and V. K. Prasanna Kumar. Perm utations on Illiac
IV-Type Networks. IE E E Transactions on Computers, C-35(7):662-669,
Jul 1986.
H.D. Schwetman. CSIM: A C-Based, Process-Oriented Simulation Lan
guage. In Proceedings of the 1986 W inter Simulation Conference, pages.
387-396, 1986.
I.D. Scherson and Y. Ma. Analysis and Applications of the Orthogonal
Access Multiprocessor. Journal of Parallel and Distributed Computing,
7(2):232-255, Oct 1989.
Q. F. Stout. Mesh-Connected Computers with Broadcasting. IE E E
Transactions on Computers, C-32(9):826-830, Sep 1983.
C. D. Thompson. Generalized Connection Networks for Parallel Proces
sor Interconnections. IE E E Transaction on Computers, C-27(12):1119-
1125, Dec 1978.
K. H. W arren and E. D. Brooks. Gaussian Elimination: A Case Study
on Parallel Machines. In Compcon, pages 57-61, 1991.
P. C. Yew, N. F. Tzeng, and D. H. Lawrie. D istributed Hot-spot Ad
dressing in Large-scale Multiprocessor. IE E E Transactions on Comput
ers, pages 388-395, Apr 1987.
J. X. Zhou. A Parallel Com puter Model Supporting Procedure-Based
Communication. In Proceedings of the Supercomputing ’ 90, New York,
pages 286-294, Nov 1990.
Appendix A
Architecture M odeling CSIM Macros:
1. Bus-based Crossbar-Connected Multiprocessor and
Orthogonally-connected Multiprocessor
2. Hypercube m ulticom puter
113
CSIM Modeling macros for OMP and CCM
/(include
#include
//include “ csim.h”
^ .A .A A A A A A A A A A .A A A .A A A A A .A A .A A A A A A .A A A ,,....,
Time Constants ■ Worst case values
Units of simulated time = microseconds
^define L M ACCESS 0.150 /* Local Memory access time */
//define SCAL_ACC_COST 1.0 /* Scalar access */
//define in c rJa c c _c o s t 0.050 /* Time for accessing succesive items */
/* after a FIXED_ACC_COST */
//define FIXED_ACC_COST 0.800 /* fixed memory access cost */
#define ACC COST 0.2 /* memory read/write access cost*/
#define BLK_ACC_COST 0,2 /* cost for changing memory */
/* address in */
1 * block vector access */
//define SBUS_BRDCST 0.300 1 * Time for doing a SBUS broadcast */
r reflects hardware synchronization */
//define RISC_OP 0.0303 /* Time for int or fp RISC op on i860 */
/* 33 Mips rating for 40MHz chip */
//define BUFACC 0.050 !* Time to read (write) one data from(to)*
I* processor board data buffer or */
/* communication data buffer */
r U .u ..M UM.MU i,u UU,UM , « M m ,.u ,.MUUIu Ua U^ u U,
CSIM Representation for OMP components and machine state
FACILITY om(NCPUS*NCPUS]; /* orthogonal memory */
FACILITY s_bus; 1 * synchronization bus */
FACILITY r_bus[NCPUS]; /* row buses for crossbar-connected multiprocessor*/
EVENT row_mode[NCPUS]; 1 * EVENT var for row-access mode */
EVENT col_mode[NCPUS]; /* EVENT var for col-access mode */
EVENT g_done; 1 * global EVENT decl to detect end */
#define CNTSIZ16 1 * Number of synchronization counters */
/* Array of pointers to Synchronization semaphore structure */
struct t_COUNTER {
int c_num; /* number of processes to be synch’ ed */
int init; /* flag to show initialization status */
} *cntr[CNTSIZ];
EVENT c_set[CNTSIZ]; /* EVENT variable to enforce synch */
y ^ A M A A ttU ^ A A A ^ A A A a ^ A A A A A A A A A A A A A A A A A A A .**.*
Simulator variables
A A A A A A A A A A A A A M ^ .A A A A A A A A A A A A A A .A A . A A .A A A A A A ............. ;
int act, i;
float
float
float
float
#ifdef
#define
#define
#endif
#ifdef
#define
#define
#endif
/* structure for accumulating execution statistics for a process (in addition
to CSIM’ s built-in mechanisms) */
struct pstat {
long num_om; /* orthogonal memory accesses */
long numjm; /* local memory accesses */
long num_buf; r data buffer accesses */
long num_inst; * number of RISC instructions executed */
long num_inst_comp; /* number of inst used for computation */
long num inst comm; /* number of inst used for communication */
long num_inst_sync; /* number of inst used for synchronization */
long num_synch; /* number of synchronizations done */
float p_time; /* total time spent by the processor */
float p_synctim; /•process synchronization time */
float p_commtim; /* process communication time */
float p_comptim; /* process computation time */
float p_wtcommtim; /* process write communication time */
float p_rdcommtim; / * process read communication time */
} pstat_ar[NCPUS];
/A A .A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A
CSIM macros - Verbs for simulation only
A .A A A A A A A A A AA^A.AM A A A A A A A A A A A A .K A A * , ^
I* access Data Buffer for single datum on proc x
#define _ACC_BUF(x) asm(CLOCK_OFF); \
ts1[NCPUS], ts2[NCPUS];
tc1[NCPUS], tc2[NCPUS];
tsum;
t_hold[NCPUS]; /* variable for consolidating holds */
SUN4
CLOCKON “!#clock_on"
CLOCK_OFF “!#dock_of'
SUNS
CLOCKON “|#clock_on”
CLOCK_OFF “ |#clock_or
114
if (NCPUS > 1) - £ \
hold(BUF_ACC); \
pstat_ar[x].p_compfim += BUF_ACC; \
pstat_ar[x].num_inst_comp += 1; \
pstat_ar(x].num_buf += 1;} \
else {\
hold(LM_ACCESS);\
pslat_ar[x],p_complim += LM_ACCESS; \
pstat_ar[x].num_inst_comp += 1; \
pstat_ar[x].num_lm += 1;}\
asm(CLOCKON)
/* access Local Mem for processor x * [
#define _ACC_LM (x) asm(CLOCK_OFF); \
hold(LM_ACCESS); \
pstat_ar[x].p_comptim += LM_ACCESS; \
pstat_ar[x].num_lm + = 1; \
pstat ar[x].num inst_comp += 1; \
asm(CLOCKON)
/* account for Local Mem data fetch */
#define _FETCH_LM (x) asm (CLOCKOFF); \
t_hold[x] += LMACCESS; \
pstat_ar[x].p_comptim += LM_ACCESS; \
pstat_ar[x].num_lm += 1; \
pstat_ar[x].num_inst_comp += 1;\
asm(CLOCK_ON)
I * account for VRW data fetch */
#define _FETCH_VRW(x) asm(CLOCK_OFF);\
t_hold[xJ += BUF_ACC; \
pstat_ar[x].p_comptim += BUF_ACC; \
pstat_ar[x].num_buf += 1; \
pstat_ar[x].num_inst_comp += 1; \
asm(CLOCK_ON)
/* advance simulation time by n to account for compute activity on proc x */
#define ADV TM (n,x) asm(CLOCK_OFF); \
t~ hoid[x] += n*RISC_OP; \
pstat_ar[x].p_comptim += n*RISC_OP; \
pstat_ar[x].num_inst_comp += n; \
asm(CLOck_ON)
/* advance simulation time by n to account for communication activity on proc x */
#define ADV TWC(n,x) asm(CLOCK OFF); \
t~ hold[x] += n*RISC_OP” \
pstat_ar[x).p_commtim += n*RISC_OP; \
pstat ar[x].num inst comm += n; \
asm(CLOCK_ON)
/* advance simulation time by n to account for synchronization activity on proc x */
Adeline _ADV_TMS(n,x) asm(CLOCK OFF); \
t_hold[x]+=n*RISCOP~\
pstat_ar[x].p_synctim += n*RISC_OP; \
pstat_ar[x].num_inst_sync += n; \
asm(CLOCKON)
I * increment index i for synchronization modulo CNTSI2 */
#define _INC(i) asm(CLOCK_OFF); \
i * (i s s (CNTSIZ -1)) ? 0 ; i+1; \
asm(CLOCK_ON)
I * initiate simulation run with model name mn */
#define _SIM JNIT(mn) asm(CLOCK_OFF); \
simO { \
int $$idxx; \
createfsim’ ); \
set_model_name(“ mn’ ) ; \
csim_initO; \
asm(CLOCK_ON)
Sdefine REPORT reportO; \
omp_reportO ;\
mdlstatO;\
omp_sum_commtimeO
{
int $$idxx;
tsum = 0.0;
for ($$idxx = 0; $$idxx < NCPUS; $$idxx++)
tsum += pstat_ar[$$idxx].p_wtcommtim;
}
tsum += pstat_ar[NCPUS-1].p_rdcommtim;
printf(“ \n\n Total communication time\n” );
printf (“ % 12,3f”, tsum);_________________
115
ccm_sum_commtime()
<
int $$idxx;
tsum = 0.0;
for ($$idxx = 0; $$idxx < NCPUS; $$idxx++)
{
tsum += pstat ar[$$idxx].p wtcommtim;
}
tsum += pstat_ar[NCPUS-l].p_rdcommtim;
tsum += pstat_ar[0].p_rdcommtim;
tsum += ((4 * SIZE) + 2) * SBUS_BRDCST;
printf(“ \n\n Total communication time\n” );
printf<“ %12.3f’, tsum);
omp reportO
{
int $$idxx;
for ($$idxx = 0; $$idxx < NCPUS; $$idxx++)
{
pstat_ar[$$idxx].num_inst = pstat_ar[$$idxx].num_inst_comp +
pstat_ar[$$idxx].num_inst_comm +
pstat_ar[$$idxx].num_inst_sync;
pstat_ar[$$idxx].p_time = pstat_ar[$$idxx].p_comptim +
pstat_ar[$$idxx].p_commtim +
pstat_ar[$$idxx].p_synctim;
} ;
printf(“ \n\n---------------------Execution Statistics--------------------“ );
printf(‘ \n\n PID totaljnst com pjnst com minst sync inst o m a c c lm_acc buf acc
no_synch\n’ };
for ($$idxx = 0; $$idxx < NCPUS; $$idxx++) {
printf(“%3d %8d %10d %8d %9d %8d %Sd %6d %6d \n", $$idxx,
pstat_ar[$$idxx].num_inst,
pstat_ar[$$idxx].num_inst_comp,
pstat_ar[$$idxx] .num _inst_com m,
pstat_ar[$$idxx].num_inst_sync,
pstat_ar[$$idxx].num_om,
pstat_ar[$$idxx].num_lm,
pstat_ar[$$idxx].num_buf,
pstat_ar[$$idxx].num_synch);
}
printf(‘ \n” );
printf(‘ Vi” );
printf(“ \n\n PID totaljim e com pjim e synch_time com mjim e wtcommjime rdcomm_time\n” )
for ($$idxx = 0; $$idxx < NCPUS; $$idxx++) {
printff%3d %13.3f %13.3f %12.3f %12.3f %12.3f %12.3f\n”, $$idxx,
pstat_ar[$$idxx).p_time,
pstat_ar[$$idxx].p_comptim,
pstat_ar[$$idxx].p_synctim,
pstat_ar[$$idxx).p_commtim,
pstat_ar[$$idxx].p_wtcommtim,
pstat_ar[$$idxx].p_rdcommtim);
}
printf('V)” );
}
CSIM macros - Verbsr
/* create a parallel code segment named “k” for execution on i860 */
#define CRT_TSK(k) asm(CLOCK_OFF); \
create (“ k’ ) ; \
asm(CLOCKON)
I * terminate parallel code segment on processor pn, record simulation time for process,
and set global completion flag if needed */
#define END_TSK(pn) asm(CLOCK_OFF); \
hold(t_ho!d[pn]);\
t_hold[pn) = 0.0;\
act-; \
if (act == 0) set(g_done); \
asm(CLOCK_ON)
I * set processor x to row-mode; count instructions and expend time, “set” or
“clear” =2 i860 instructions*/
,#define SET_ROWACC(x) asm(CLOCK_OFF);\
if (NCPUS >1)(\
set(row_mode[x]); \
_ADV_TMC(2,x); \
clear(col_mode[x]); /* comment */ \
_ADV_TMC(2,x);}\
asm(CLOCK_ON)
I * set processor x to col-mode; count instructions and expend time “ set" or
“ clear” =2 i860 instructions. */
#define SET_COLACC(x) asm(CLOCK OFF); \
if (NCPUS >1){\
set(col_mode[x]); \
_ADV_TMC(2,x); \
116
I ______
clear(row_mode[x]); \
ADV TMC(2,x);}\
asm(CLOCK_ON)
r clear processor x from row-mode */
;#define CLR_ROWACC(x) asm1){\
clear(row_mode[x]); \
_ADV_TMC(2,x);} \
asm(CLOCK_ON)
I * clear processor x from col-mode */
#define CLR_COLACC(x) asm(CLOCK_OFF); \
if {NCPUS >1H\
clear(col_mode[x]); \
_ADV_TMC(2,x);}\
asm(CLOCK_ON)
/* set semaphore structure #k’ s counter to n and clear assoc EVENT flag
if no one else has done so yet (irtit = 0), then signal this fact (by setting inft = 1).
Do this for process x. */
#define TSETCNT(k, n, x) asm(CLOCK_OFF); \
if (NCPUS >1){\
int $$i; \
hold(SBUS_REQ+t_hold(x]); \
t_hold[x] = 0.0;\
pstat_ar[x3.num_inst_sync += 1; \
pstat_ar[xj.p_synctim += RISC_OP; \
$$i = reserve(sbus); \
hold(SBUS_RMW); \
pstat_ar[x].num_inst_sync += 2; \
pstat_ar[x].p_synctim += 2*RISC_0P; \
if (cntr[k]->init == 0) { \
hold(SBUS_RMW); \
pstat_ar[x].num_inst_sync += 2; \
pstat_ar[x].p_synctim += 2*RISC_0P; \
cntr[k]->c_num = n; \
pstat_ar[x].num_inst_sync += 2; \
pstat_ar[x].p_synctim += 2*RISC_0P; \
clear(c_set[k]); \
cntr[k]->init = 1; \
} ;\
release(s_bus);\
} ;\
asm(CLOCK_ON)
/*synchronize processes (or processors) on synch structure #k, Once synchronization
completes, this structure will be re-initialized and available for reuse (init = 0, c_set[k] cleared).
To be used for process x. “ hold (SBUS_REQ)” = 1 i860 instruction (stretched), instruction count
for “ wait(c_set)”. These are computed by timing the wait instruction counting during
synchronization time. */
#define SYNCH(k,x) asm(CLOCK_OFF); \
if (NCPUS > 1) \
n
int$$i;\
hold(SBUS_REQ+1_hold[x]); \
t_hold[x] = 0.0; \
pstat_ar[x].num_synch += 1 ;\
$$i = reserve(s_bus); \
hold(SBUS_RMW);\
cntr[k]->c_num-; \
pstat_ar[x].num_inst_sync += 2; \
pstat_ar[x].p_synctim += 2*RISC_OP; \
if (cntr[k]->c_num == 0) \
(\
set(c_set[k]); \
pstat_ar[x].num_inst_sync += 2; \
pstat_ar(x].p_synctim += 2*RISC_OP; \
hold(SBUS_BRDCST); \
pstat_ar(x].num_inst_sync += 2; \
pstat_ar[x].p_synctim += 2*RISC_OP; \
cntr[k]->inH = 0; \
release (s bus);\
} \
e lse \
{ \
release(s_bus); \
ts1[x] = simtime(); \
wait (c_set[k]); \
ts2[x] = simtimeO; \
pstat_ar[x].p_synctim += (ts2[x) -ts1[xj); \
} \
asm(CLOCK_ON)
I * simulate a column oriented scalar read operation from OM in a Crossbar-connected
multiprocessor system. Similar to PIPEJN operation of the OM P. rb indicates requests to the
row bus ‘rb’ and mm indicates access to the memory module ‘ mm’ on the row bus.
#define CROSS_ROW_IN(x,rb,mm) asm(CLOCKOFF); \
hold(t_hold[x]); \
t_hold[x] = 0; \
117
psta1_ar[x].numJnst_comm += 4; \
pstat_ar[x].p_commtim += 4*RISC_0P; \
pstat_ar[x].num_om += 1; \
reserve(r_bus[rb]); \
reserve(om[rb*NCPUS+mm]); \
hold(SCAL_ACC_COST); \
pstat_ar[x).p_commtim += SCAL_ACC_COST; \
release(om[rb*NCPUS+mm]); \
release(r_bus[rb]); \
asm(CLOCK_ON)
/* simulate a column oriented blk_scalar read operation from OM in a CRossbar-connected
multiprocessor system. Similar to CROSS_ROW_IN operation except that it reads Ig data
elements from a memory module in a single access, rb indicates requests to the row bus 'rb’
and mm indicates access to the memory module ‘ mm’ on the row bus.
#define CROSS_ROW_BLK_IN(x,rb,mm,lg) asm(CLOCK_OFF); \
hold(t_ho!d[x]); \
t_hold[x] = 0; \
pstat_ar[x].num_inst_comm += 4; \
pstat_ar[x].p_commtim += 4*RISC_OP; \
pstat_ar[x].p_rdcommtim +- 4*RISC_OP; \
pstat_ar[x].num_om += 1; \
tc1[x] = simtime();\
reserve(r_bus[rb]); \
reserve(om[rb*NCPUS+mm]); \
hold(FIXED ACC COST);\
hold(lg*ACC_COST); \
release(om[rb*NCPUS+mm]); \
release(r_bus[rb]); \
tc2[x] = simtimeO; \
pstat_ar[x].p_commtim += (tc2[x] - tc1 [x]); \
pstat_ar[x].p_rdcommtim += (tc2[x] - tc1[x]); \
asm(CLOCK_ON)
I * simulate a column oriented scalar write operation to OM in a Crossbar-connected
multiprocessor system. Similar to PIPE_OUT operation of the OMP. rb indicates requests
to the row bus ‘ rb’ and mm indicates access to the memory module ‘ mm’ on the row bus.
#define CROSS_ROW_OUT(x,rb,mm) asm (CLOCKOFF); \
hold (t_hold[x|); \
t_hold[x] = 0.0; \
pstat_ar[x].num_inst_comm += 4; \
pstat_ar[x].p_commtim += 4*RISC_OP; \
pstat_ar[x].num_om += 1; \
reserve(r_bus[rb]); \
reserve(om[rb*NCPUS+mm]); \
________ hold(SCAL ACC COST); \ _______ _______________________
pstat_ar[x].p_commtim += SCAL_ACC_COST; \
release(om[rb*NCPUS+mm]); \
release(r_bus[rb]); \
asm(CLOCK_ON)
I * simulate a row-read vector operation from OM in a Crossbar-connected multiprocessor
system. Similar to VECTOR _IN operation with the exception that processors can access
row buses in a permutation manner with out any conflict, rb indicates requests
to the row bus ‘ rb’.
ffdefine CROSS_VECTOR_IN(x,rb) asm(CLOCKOFF); \
hold(t_hold[xJ); \
t_hold[x] = 0,0;\
{ \
int $$idxx;\
pstat_ar[x].num_inst_comm += 3; \
pstat_ar[x].p_commtim += 3*RISC_OP; \
reserve(om{rb*NCPUS+0]); \
hold(FIXED_ACC_COST+ACC_COST); \
pstat_ar[x].p_commtim += (FIXED_ACC_COST + ACC_COST); \
release(om[rb*NCPUS+0]); \
pstat_ar[x].num_om += 1; \
pstat_artx].num_inst_comm += 1; \
pstat_ar[x].p_commtim += RISC_OP; \
for ($$idxx=1; $$idxx < NCPUS; $$idxx++) \
U
reserve(om[rb*NCPUS+$$idxx]); \
hold(INCRACCCOST); \
pstat_ar[x].p_commtim += INCR_ACC_COST; \
release(om[rb*NCPUS+$$idxxl); \
} ; \
y,\
asm(CLOCKON)
/* simulate a row-read operation from OM in a Crossbar-connected multiprocessor system.
Similar to VECTOR_OUT operation with the exception that processors can access row buses
in a permutation manner with out any conflict, rb indicates requests to the row bus ‘rb’.
#define CROSS_VECTOR_OUT(x,rb) asm(CLOCK_OFF);\
hold(t_hold[x]);\
t hold[x] = 0.0;\
f\
int $$idxx;\
pstat_ar[x],num_inst_comm += 3; \
pstat_ar[x].p_commtim += 3*RISC_OP; \
reserve(om[rb*NCPUS+0]); \
________________________________ hold (FIXED_ACC_CO STf ACC_COST); \ _ _ _ _ _ _ _ _ _
118
pstat_ar[x].p_commtim += (FIXED_ACC_COST + ACC_COST); \
release(om[rb*NCPUS+0]); \
pstat_ar[x].num_om += 1; \
pstat_ar[x].numJnst_comm += 1; \
pstat_ar[x].p_commtim += RISC_OP; \
for ($$idxx=1; $$idxx < NCPUS; $$idxx++) \
{ \
reserve(om[rb*NCPUS+$$idxx]); \
hold(INCR_ACC_COST); \
pstaf_ar[x].p_commtim += INCR_ACC_COST; \
release(om[fb*NCPUS+$$idxx]); \
} ;\
} ;\
asm(CLOCKON)
/* simulate a row-read block vector operation from OM in a Crossbar-connected multiprocessor
system. Similar to BLK_VECTOR_IN operation with the exception that processors can access
row buses in a permutation manner with out any conflict. x=processor, rb=requests to the row
bus 'rb', l=number of vectors, NCPUS=degree of memory interleaving. 7
#define CR0SS_BLKVECTOR_IN(x,rb,l) asm(CLOCK OFF); \
hold(t_hold[x]); \
t hold[x] = 0.0;\
f t
int $$idxx,$$idyy;\
pstat_ar[x].numJnst_comm += 3; \
pstat_ar[x].p_commtim += 3*RISC_0P; \
hold (FIXED_ACC_COST + ACC_COST); \
pstat_ar[x].p_commtim += FIXED_ACC_COST + ACC_COST; \
for ($$idxx=0; $$idxx < NCPUS; $$idxx++) \
{ \
reserve(om[rb*NCPUS+$$idxx]); \
hold{INCR_ACC_COST); \
pstat_ar[x].p_commtim += INCR_ACC_COST; \
release(om[rb*NCPUS+$$idxx]); \
} \
pstat_ar[x].num_om += I ; \
for ($$idyy=1; $$idyy < I ; $$idyy++) \
ft
for ($$idxx=0; $Sidxx < NCPUS; $$idxx++) \
{\
reserve(om[rb*NCPUS+$$idxx]); \
hold(INCR_ACC_COST); \
pstat_ar[x].p_commtim += INCR_ACC_COST; \
release(om[rb*NCPUS+$$idxx]); \
} \
hold(BLK_ACC_COST); \
__________________________ pstat arfxj.p commtim + = BLK ACC COST; \ _________________
} ;\
asm(CLOCK_ON)
r simulate a row-write block vector operation to OM in a Crossbar-connected multiprocessor
system. Similar to BLK_VECTOR_OUT operation with the exception that processors can
access row buses in a permutation manner with out any conflict. x=processor, rb=requests to
the row bus ‘rb’, l=number of vectors, NCPUS=degree of memory interleaving*/
#define CROSS_BLK_VECTOR_OUT(x,rb,l) asm(CLOCK_OFF); \
hold(t_hold[x]); \
t_hold[x] = 0.0; \
int $$idxx,$$idyy; \
pstat_ar[x],num_inst_comm += 3; \
pstat_ar[x].p_commtim += 3*RISC_OP; \
hold (FIXED_ACC_COST + ACCCOST); \
pstat_ar[x].p_commtim 4= (FIXED_ACC_COST 4 ACC_COST); \
for ($$idxx=0; $$idxx < NCPUS; $$idxx44) \
{ \
reserve(om[rb*NCPUS4$$idxx]); \
hold(INCR_ACC_COST); \
pstat_ar[x].p_commtim 4= INCR_ACC_COST; \
release(om[rb*NCPUS4$$idxx]); \
} ;\
pstat_ar[x].num_om 4= I ; \
for ($$idyy=i; $$idyy < I; $$idyy44) \
ft
for ($$idxx=0; $$idxx < NCPUS; $$idxx44) \
reserve(om[rb*NCPUS4$$idxx]); \
hold(INCR_ACC_COST); \
pstat_ar[x].p_commtim 4= INCR_ACC_COST; \
release{om[rb*NCPUS4$$idxx]); \
} ;\
hold(BLK_ACC_COST); \
pstat_ar[x].p_commtim + = BLK_ACC_COST; \
} ;\
asm(CLOCK_ON)
/* simulate a BLK_SCALAR_OUT access to the OM on row-mode, ‘ mm’ indicates the
memory module to be written into. ‘ Ig’ indicates message length in words */
(fdefine BLK_SCALAR_OUT(x,mm,lg) asm(CLOCK_OFF); \
hold (t_hoid [x]); \
tholdfx] = 0; \
pstat_£ir[x].num_inst_comm + = 4; \
pstat_ar(x].p_commtim 4= 4 * RISC_OP; \
119
pstat_ar[x].num_om += 1; \
reserve(om[x*NCPUS+mm]); \
hold(FIXED_ACC_COST+lg*ACC_COST);\
release(om[x*NCPUS+mm]); \
pstat_ar[x].p_commtim += (FIXED_ACC_COST+lg*ACC COST); \
asm(CLOCK_ON)
/* simulate a BLK_SCALAR_IN access from the OM in column-mode, ‘mm’ indicates the
memoty module from which data is read.'lg' indicates m essage length in words */
#define BLK_SCALAR_IN(x,mm,lg) asm(CLOCK_OFF); \
hold (t_hold[x]); \
t_hold[x] = 0; \
pstat_ar[x].num_inst_comm += 4; \
pstat_ar[x].p_commtim += 4 * RISC_OP; \
pstat_ar[x].p_rdcommtim += 4 * RISC_OP; \
pstat_ar[x].num_om +=1; \
reserve(om[mm*NCPUS+x]); \
hold(FIXED_ACC_COST+lg*ACC_COST); \
release(om[mm*NCPUS+x]); \
pstat_ar[x].p_commtim += (FIXED_ACC_COST+lg*ACC_COS'T); \
pstat_ar[x].p_rdcommtim += (FIXED_ACC_COST+lg*ACC_COST); \
asm(CLOCK_ON)
/* simulate a BROADCAST write to a bus from proc x in row-mode or col-mode. A BROADCAST
is defined as a vector write of a single value. For a BROADCAST, data buffer access has to be
accounted for separately. “ hold(SCAL_ACC_COST)” = 1 i860 instruction (stretched) checking
stale flag = 3 i860 instructions */
#define BROADCAST^) asm(CLOCK_OFF); \
hold(t_hold[x]);\
t hold[x] = 0.0;\
u
int $$idxx; \
if (state(row_mode(x])==OCC) \
{ \
pstat_ar[x].numJnst_comm += 3; \
pstat_ar[x].p_commfim += 3*RISC_OP; \
pstat_ar[x].p_wtcommtim += 3*RISC_OP; \
reserve(om[x*NCPUS+0]); \
hold (FIXED_ACC_COST+ACC_C0ST); \
pstat_ar[x).p_commtim += (FIXED_ACC_COST+ACC_COST); \
pstat_ar[x].p_wtcommtim += (FIXED_ACC_COST+ACC_COST); \
release(om[x*NCPUS+0]); \
pstat_ar[x].num_inst_comm+= 1;\
pstat_ar[x].p_commtim += RISC_OP; \
pstat_ar[x].num_om += 1; \
[_ for ($$idxx=1; $$idxx < NCPUS; $$idxx++) \_______________
reserve(om[x*NCPUS+$$idxx]); \
release(om[x*NCPUS+$$idxx]); \
)\
}\
else if (state(col_mode[x])==OCC) { \
pstat_ar[x].num _inst_comm += 3; \
pstat_ar[x].p_commtim += 3*RISC_OP; \
pstat_ar[x].p_wtcommtim += 3*RISC_OP; \
reserve{om[0*NCPUS+x]); \
hold(FIXED_ACC_COST+ACC_COST); \
pstat_ar[x].p_commtim += (FIXED_ACC_COST+ACC_COST); \
pstat_ar[x].p_wtcommtim += (FIXED_ACC_COST+ACC_COST); \
release(om[0*NCPUS+x]); \
pstat_ar[x].num_inst_comm += 1; \
pstat_ar[x].p_commtim += RISCOP; \
pstat_ar(x].num_om += 1; \
for ($$idxx=1; $$idxx < NCPUS; $$idxx++) \
{ \
reserve(om[$$idxx*NCPUS+x]); \
release(om[$$idxx*NCPUS+x]); \
} ;\
} ;\
} ;\
asm(CLOCK_ON);
/* simulate a Block BROADCAST write to a bus from proc x in row-mode or col-mode. The write con
sists of I vectors.
A BROADCAST is defined as a vector write of a single value. For a BROADCAST data buffer access
has to be
accounted for separately. “ hold(SCAL_ACC_COST)” = 1 i860 instruction (stretched) checking state
flag = 3 i860
instructions */
#define BLK_BROADCAST(x,lg) asm(CLOCKOFF); \
hold (t_hold [x]); \
t_hold[x] = 0.0; { \
int $$idxx; \
if (state(row_mode[x])==OCC) \
{\
pstat_ar[x],num_inst_comm += 3; \
pstat_ar[x].p_commtim += 3*RISC_OP; \
pstat_ar[x].p_wtcommtim += 3*RISC_OP; \
reserve(om[x*NCPUS+0]); \
hold (FIXED_ACC_COST); \
pstat_ar[x].p_commtim += FIXED_ACC_COST; \
______________ pstat_ar[x].p_wtcommtim += FIXED_ACC_COST; \
120
release(om[x*NCPUS+OJ); \
pstat_ar[x].num_inst_comm += 1; \
pstat_ar[x).p_commtim += RISC_OP; \
pstat_ar[x].p_wtcommtim += RISC_OP; \
pstat_ar[xJ.num_om += Ig; \
for ($$idxx=0; $$idxx < Ig; $$idxx++) { \
reserve(om[x*NCPUS+0]); \
hold(ACC_COST); \
pstat_ar[x].p_commtim += ACCCOST; \
pstat_ar[x].p_wtcommtim +=ACC_COST; \
release(om[x*NCPUS+0]); \
} ;\
} ;\
if (stafe(col_mode[x])==OCC) \
< \
pstat_ar[x].numJnst_comm += 3; \
pstat_ar[x].p_commtim += 3*RISC_OP; \
pstat_ar[x3.p_wtcommtim += 3*RISC_0P; \
reserve(om[x*NCPUS+0]); \
hold(FIXED_ACC_COST); \
pstat_ar[x].p_commtim += FIXED_ACC_COST; \
pstat_ar[x].p_wtcommtim += FIXED_ACC_COST; \
release(om[x*NCPUS+0]); \
pstat_ar[x].num_inst_comm+= 1;\
pstat_ar[x].p_commtim += RISC_OP; \
pstat_ar[x].num_om += Ig; \
for ($$idxx=0; $$idxx < Ig; $$idxx++) { \
reserve(om[x*NCPUS+0]); \
hold(ACC_COST); \
pstat_ar[x].p_commtim += ACC_COST; \
pstat_arlx].p_wtcommtim += ACC_COST; \
release(om[x*NCPUS+0]); \
} ;\
} ;\
asm(CLOCKON)
/* simulate a PIPELINED READ from proc x in row or column mode. For a VECTOR_IN, we
have to account for data buffer accesses separately. “hold(SCAL_ACC_COST) = 1 i860
instruction (stretched) checking state flag = 3 i860 instructions */
ffdefine VECTOR_IN(x) asm(CLOCK_OFF); \
hold(t_hold[x]);\
t_hold[x] = 0.0; \
int $Sidxx; \
if (state(row_mode[x])==OCC) \
{\
pstat_ar[x].num_inst_comm += 3; \
pstat_ar[x].p_commtim += 3*RISC_OP; \
reserve(om(x*NCPUS+0]); \
hold(FIXED_ACC_COST+ACC_COST); \
pstat_ar[x].p_commtim += (FIXEDACCCOST+ACCCOST); \
release(om[x*NCPUS+0]); \
pstat_ar[x].num_om += 1; \
pstat_ar(x].num_inst_comm += 1; \
pstat_ar[x].p_commtim += RISC_OP; \
for ($$idxx=1; $$idxx < NCPUS; $$idxx++) \
{ \
reserve(om[x*NCPUS+$$idxx]); \
hold(INCR_ACC_COST); \
pstat_ar[x].p_commtim += INCR_ACC_COST; \
release(om[x*NCPUS+$$idxx]); \
} ;\
} \
else if (state(col_mode[x))==OCC) \
{ \
pstat_ar[x].num_inst_comm += 3; \
pstat_ar[x).p_commtim += 3*RISC_0P; \
reserve(om[0*NCPUS+x]); \
hold(FIXED_ACC_COST+ACC_COST); \
pstat_ar[x].p_commtim += (FIXED_ACC_COST+ACC_COST); \
retease(om[0*NCPUS+x]); \
pstat_ar[x].num_om += 1; \
pstat_ar[x].num_inst_comm += 1; \
pstat_ar[x].p_commtim += RISC_OP; \
for ($$idxx=1; $$idxx < NCPUS; $$idxx++) \
{ \
reserve(om[$$idxx*NCPUS+x]); \
hold(INCR_ACC_COST); \
pstat_ar[x].p_commtim += INCR_ACC_COST; \
release(om[$$idxx*NCPUS+x]); \
} ;\
> :\
asm(CL0CK_0N)
/* simulate a PIPELINED WRITE from proc x in row or column mode. For a VECTOR_OUT, we
have to account for data buffer accesses separately, “hold(SCAL_ACC_COST) = 1 i860
instruction (stretched) checking state flag = 3 i860 instructions * 1
ffdefine VECTOR_OUT(x) asm (CLOCKOFF); \
hold(t_hold[x]);\
t_hold(x] = 0.0;\
int $$idxx; \
if (state (row_mode[xj)==OCC) ( \
pstat_ar[x].num_inst_comm += 3 ;\
pstat_ar[x].p_commtim += 3*RISC_OP; \
121
reserve(om[x*NCPUS+0]); \
hold(FIXED_ACC_COST+ACC_COST); \
pstat_ar[x].p_commtim += (FIXED_ACC_COST+ACC_COST); \
release(om[x*NCPUS+0]); \
pstat_ar[x].num_om += 1; \
pstat_ar[x].num_inst_comm += 1; \
pstat_ar[x].p_commtim += RISC_OP; \
for ($$idxx=l; $$idxx < NCPUS; $$idxx++) \
< \
reserve (om[x*NCPUS+$$idxx]); \
hold(INCRACCCOST); \
pstat_ar[x].p_commtim += INCR_ACC_COST; \
release(om[x*NCPUS+S$idxx]); ^
> ;\
> \
else if (state(col_mode|x])==OCC) \
{ \
pstat_ar[x].num_inst_comm += 3; \
pstat_ar[x].p_commtim += 3*RtSC_0P; \
reserve(om[0*NCPUS+x]); \
hold{FIXED_ACC_COST+ACC_COST); \
pstat_ar[x].p_commtim += (FIXED_ACC_COST+ACC_COST); \
release(om[0*NCPUS+x]); \
pstat_ar[x].num_om += 1; \
pstat_ar[x].num_inst_comm += 1; \
pstat_ar[x].p_commtim += RISC_OP; \
for ($$idxx=1; $$idxx < NCPUS; $$idxx++) \
{ \
reserve(om[$$idxx*NCPUS+x]); \
hold(INCR_ACC_COST); \
pstat_ar[x].p_commtim += INCR_ACC_COST; \
release(om[$$idxx*NCPUS+x]); \
} ;\
asm(CLOCK_ON)
/* Simulate a BLOCK PIPELINED READ from proc x in row or column mode. For a
BLK_VECTOR_IN, we have to account for data buffer accesses separately.
“hold(SCAL_ACC_COST) = 1 i860 instruction (stretched) checking state flag = 3 i860
instructions x= processor, lg=number of vectors, NCPUS=degree of memory interleaving. */
#define BLK_VECTOR_IN(x,lg) asm(CLOCK_OFF);\
~ hold (t_hold[x]); \
t_hold[x] = 0.0; { \
int $$idxx,$$idyy; \
if (state(row_mode[x])==OCC) \
(\
_____________________ pstat ar[x].num inst comm + -3 ; \ ___________ ____ _____________
pstat_ar[x].p_commtim += 3*RiSC_OP; \
reserve(om[x*NCPUS+0]); \
hold (FIXED_ACC_COST); \
hold(ACC_COST); \
release(om[x*NCPUS+0]); \
pstat_ar[x].p_commtim += FIXED_ACC_COST; \
pstat_ar[x].p_commtim += ACC_COST; \
for ($$idxx=0; $$idxx < NCPUS; $$idxx++) \
(\
reserve(om[x*NCPUS+$$idxx]); \
hold(INCR_ACC_COST); \
pstat_ar[x].p_commtim += INCR_ACC_COST; \
release(om[)?NCPUS+$$idxx]); \
} ;\
pstat_ar[x].num_om += Ig; \
pstat_ar[x].num_inst_comm += 1; \
pstat_ar[x].p_commtim += RISC_OP; \
for ($$idyy=1; $$idyy < Ig; $$idyy++) \
{ \
for ($$idxx=0; $$idxx < NCPUS; $$idxx++) \
(\
reserve(om[x*NCPUS+$$idxxJ; \
hold (I NCR_ACC_C0 ST); \
pstat_ar[x].p_commtim += INCR_ACC_COST; \
release(om[x*NCPUS+$$idxx]); \
} ;\
hold (BLK_ACC_COST); \
pstat_ar[x].p_commtim += BLK_ACC_COST; \
} ;\
if (state{col_mode(x])==OCC) \
{\
pstat_ar[x].num_inst_comm += 3; \
pstat_ar[x].p_commtim += 3*RISC_OP; \
reserve(om[0*NCPUS+x]); \
hold (FIXED_ACC_COST+ACC_COS‘ r); \
release(om[0*NCPUS+x]); \
pstat_ar[x].p_commtim += FIXED_AGC_COST + ACC_COST; \
for ($$idxx=0; $$idxx < NCPUS; $$idxx++) \
(\
reserve(om[$$idxx*NCPUS+x|); \
hold(INCR_ACC_COST); \
pstat_ar[x].p_commtim += INCR_ACC_COST; \
release(om[$$idxx*NCPUS+x]); \
} ;\
pstat_ar[x].num_om += Ig; \
pstat_ar[x).num_inst_comm += 1; \
pstat_ar[x].p_commtim += RISC_OP; \_____________
122
for ($$idyy=1; $$idyy < Ig; $$idyy++) \
(\
for ($$idxx=0; $$idxx < NCPUS; $$idxx++) \
{ \
res8rve(om[$$idxx*NCPUS+x]); \
hold(INCR_ACC_COST); \
pstat_ar[x].p_commtim += INCR_ACC_COST; \
release(om[$$idxx*NCPUS+x]); \
} ;\
hold (BLK_ACC_COST); \
pstat ar[x].p_commtim += BLKjACC_COST; \
} ;\ ~
} ;\
} ;\
asm(CLOCKON)
/* simulate a BLOCK PIPELINED WRITE from proc x in row or column mode. For a
BLK_VECTOR_OUT, we have to account for data buffer accesses separately.
“ hold(SCAL_ACC_COST) = 1 i860 instruction (stretched) checking state flag = 3 i860
instructions x= processor, lg=number of vectors, NCPUS=degree of memory interleaving. */
#define BLK_VECTOR_OUT(x,lg) asm(CLOCKOFF); \
hold(t_hold[x]);\
t_hold[x] = 0.0; \
< \
int $$idxx,$$idyy; \
if (slate(row_mode[x])==OCC) \
{ \
pstat_ar[x].num_inst_comm +=3; \
pstat_ar[x].p_commtim += 3*RISC_OP; \
reserve(om[x*NCPUS+0]); \
hold(FIXED_ACC_COST);\
hold(ACC_COST); \
release(om[x*NCPUS+0]); \
pstat_ar[x].p_commtim += FIXED_ACC_COST; \
pstat_ar[x].p_commtim += ACC_COST; \
for ($$idxx=0; $$idxx < NCPUS; $$idxx++) \
{ \
reserve(om[x*NCPUS+$$idxx]); \
hold(INCR_ACC_COST); \
pstat_ar[x].p_commtim += INCR_ACC_COST; \
release(om{x*NCPUS+$$idxx]); \
pstat_ar[x].num_om +=lg; \
pstat_ar(x].num_inst_comm +=1; \
pstat_ar[x].p_commtim += RISC_OP; \
for ($$idyy=1; $$idyy < Ig; $$idyy++) \
< \
for ($$idxx=1; $$idxx < NCPUS; $$idxx++) \
{ \
reserve(om[x*NCPUS+$$idxx]}; \
hold(INCR_ACC_COST); \
pstat_ar[x].p_commtim += INCR_ACC_COST; \
release(om[x*NCPUS+$$idxx]); \
} ;\
hold(BLKACCCOST); \
pstat ar[x].p commtim += BLK_ACC COST; \
> ;\ “
} ;\
if (state(col_mode[x])==OCC) \
(\
pstat_ar[x].num_inst_comm +=3; \
pstat_ar[x].p_commtim += 3*RISC_OP; \
reserve(om[0*NCPUS+x]); \
hold(FIXED_ACC_COST+ACC_COST); \
release{om[0*NCPUS+x]); \
pstat_ar[x].p_commtim += FIXED_ACC_COST + ACC_COST; \
for ($$idxx=0; $$idxx < NCPUS; $$idxx++) \
(\
reserve(om($$idxx*NCPUS+x]); \
hold(INCR_ACC_COST); \
pstat_ar[x],p_commtim += INCR_ACC_COST; \
release(om[$$idxx*NCPUS+x]); \
> ; \
pstat_ar[x].num_om += Ig; \
pstat_ar[x].num_inst_comm += 1; \
pstat_ar[x].p_commtim += RISC_OP; \
for ($$idyy=1; $$idyy < Ig; $$idyy++) \
U
for ($$idxx=0; $$idxx < NCPUS; $$idxx++) \
(\
reserve(om[$$idxx*NCPUS+x]); \
hold(INCR_ACC_COST); \
pstat_ar[x].p_commtim += INCR_ACC_COST; \
release(om[$$idxx*NCPUS+x]); \
} ;\
hold(BLK_ACC_COST); \
pstat ar[x].p_commtim += BLK_ACC_COST; \
} ;\ "
} \
asm(CLOCKON)
INITIAUZATION routines
' I
123
i ---------------------------------------------- -------------------------------
| /* routine to initialize facilities */
i csimjnitO
j {
' int i;
| /* declare extra event variables if necessary */
I if (NCPUS <= 64) {
I i = max_events(200);
i i = max_facilities(5000);
i i = max_servers(5000);
}
else {
i = max_events(550);
i = max_facilities(20000);
i = max_servers (20000);
} ;
I * declare facilities and events */
s_bus = facility(“ s_bus” );'
facility_set(om,”om” ,NCPUS*N CPUS);
facility_set(r_bus,”r_bus”,NCPUS); I * inluded for crossbar-connected */
g_done = event(“ g_done” );
event_set(row_mode,” row_mode”,NCPUS);
event_set(col_mode,” col_mode”,NCPUS);
event_set(c_set,” c_set”,CNTSIZ);
I * instantiate and initialize semaphore structures */
for (i = 0; i < CNTSIZ; i++) {
cntr[i) = (struct ^COUNTER * ) malloc(sizeof(structt_COUNTER));
cntr(i]->c_num = 0;
cntr(i]->init = 0;
} ;
I * initialize the instruction counting structures and holds array */
for (i = 0; i < NCPUS; i++) {
pstat_ar[i].num_om = 0;
pstat_ar[i].num_lm = 0;
pstat_ar[i].num_buf = 0;
pstat_ar[i].num_inst = 0;
pstat_ar[i].num_inst_comp = 0;
I pstat_ar(il.num_inst_comm = 0;
j pstat_ar[i].num_inst_sync = 0;
I pstat_ar[i].num_synch = 0;
pstat_ar[i].p_time = 0.0;
pstat_ar[i].p_comptim = 0.0;
pstat_ar[i].p_synctim = 0.0;
pstat_ar[i].p_commtim = 0.0;
pstat_ar[i].p_wtcommtim = 0.0;
pstat_ar[i].p_rdcommtim = 0.0;
t hold[i] = O X );
} ;
} /* csim-init */
/ * “....
CSIM Modeling of Hypercube
#include
#include
#include “ csim.h”
/ .....
Time Constants - Worst case values
Units of simulated time = microseconds
#define LM_ACCESS 0.150
#define RISC_0P 0.0303
#define BUF_ACC 0.050
#define START_UP 95.0
#define INCR_COMM 0.394
#define DIM_COST 10.3
#define CNTRDY 3
CSIM Representation for OMP components and machine state
FACILITY link[NCPUS*DIM]; /* communication links in hypercube */
EVENT g_done; /* global EVENT dec! to detect end */
EVENT data_rdy[CNTRDY][NCPUS]; /* event variable to indicate data ready situation */
Simulator variables
int act, i;
float ts 1 [NCPUS], ts2[NCPUS];
float tsum;
float tc 1 [NCPUS], tc2 [NCPUS];
float t_hold[NCPUS]; /* variable for consolidating holds */
#ifdef SUN4
#define CLOCK_ON “!#clock_on"
#define CLOCK_OFF “!#clock_off”
V
/* Local Memory access time */
/* Time for int or fp RISC op on i860,*/
r 33 Mips rating for 40MHz chip */
I* Time to readfwrite) one datum from(to) */
/* processor board */
I* data buf or communication data buffer */
/* start-up time for communication */
/* increment access time * 1
r incremental cost for each dim */
I* number of ready signals */
#endif
#ifdef SUN3
#de1ine CLOCK_ON “ |#clock_on"
#de1ine CLOCK_OFF “|#clock_off"
ffendif
/* structure for accumulating execution statistics for a process (in addition to CSIM’ s
built-in mechanisms) */
struct pstat {
long num_om; /* orthogonal memory accesses */
long numjm; I * local memory accesses */
long num_buf; /* data buffer accesses */
long numjnst; /* number of RISC instructions executed */
long num_inst_comp; /* number of inst used for computation */
long num_inst_comm; 1 * number of inst used for communication */
long num_inst_sync; 1 * number of inst used for synchronization */
long num_synch; r number of synchronizations done */
float p_time; 1 * total time spent by the processor */
float psynctim; I * process synchronization time */
float p_commtim; 1 * process communication time */
float p_comptim; I * process computation time */
long numjink; I * number of link communication */
long num_bytes; 1 * number of bytes transfered */
} pstat_ar(NCPUS];
CSIM macros - Verbs for simulation only
/* access Local Mem for processor x */
#define ACC_LM(x) asm(CLOCK_OFF); \
hold(LM_ACCESS);\
pstat_ar[x].p_comptim += LM_ACCESS; \
pstat_ar[x].numjm += 1; \
pstat_ar[x].num_inst_comp += 1; \
asm(CLOck_ON)
I * account for Local Mem data fetch */
ffdefine _FETCH_LM(x) asm(CLOCK_OFF); \
t_hold[x] += LM_ACCESS; \
pstat_ar[x].p_comptim += LM_ACCESS; \
pstat_ar[x].num_lm += 1; \
pstat_ar[x].num_inst_comp += 1; \
asm(CLOCK_ON)
/* advance simulation time by n to account for compute activity on proc x * 1
#define _ADV_TM(n,x) asm(CLOCK_OFF); \
t_hold[x] += n*RISC_OP; \
pstat_ar[x].p_comptim += n*RISC_OP; \
pstat_ar[x].num_inst_comp += n; \
asm(CLOCK_ON)
r advance simulation time by n to account for communication activity on proc x */
#define _ADV_TMC(n,x) asm(CLOCK OFF); \
t_hold[x] += n*RISC_OP“ \
pstat_ar[x].p_commtim += n*RISC_OP; \
pstat_ar[x].num_inst_comm += n; \
asm(CLOCK_ON)
/* advance simulation time by n to account for synchronization activity on proc x */
#define _ADV_TMS(n,x) asm(CLOCKOFF); \
t_hold[x] += n*RISC_OP; \
pstat_ar[x].p_synctim += n*RISC_OP; \
pstat_ar[x].num_inst_sync += n; \
asm(CLOCKON)
#detine REPORT reportO; \
hyp_reportO ;\
mdlstatQA
hyp_sum_commtimeO
int $$idxx;
tsum = 0.0;
for ($$idxx = 0; $$idxx < NCPUS; $$idxx++)
tsum += pstat_ar[$$idxx].p_commtim;
}
printffVAn Total communication time\n” );
printff%12.3P, tsum);
}
count_start(x)
intx;
{
td [x] = pstat_ar[x].p_commtim;
)
count_end(x)
intx;
{
tc2[x] = pstat_ar[x],p_commtim;
tsum += tc2[x] - tel [x];
}
hyp_sum_com reportO
{
printf(“\n\n Total communication time\n” );
printf(“ %12.3f”, tsum);
printffVi” );
}
hypreportft
{
int $$idxx;
for ($$idxx = 0; $$idxx < NCPUS; $$idxx++)
{
pstat_ar[$$idxx].numjnst = pstat_ar|$$idxx].numJnst_comp +
pstat_ar[$$idxx].num_inst_comm +
pstat_ar[$$idxx].num_inst_sync;
pstat_ar[$$idxx].p_time = pstat_ar[$$idxx].p_comptim +
pstat_ar[$$idxx].p_commtim +
pstat_ar[$$idxx].p_synctim;
y .
printffVAn---------------------Execution Statistics---------------------“ );
printf(“ \n\n PID totaljnst com pjnst commjnst lm_acc buf_acc link_acc bytes_xfered\n” )
for ($$idxx = 0; $$idxx < NCPUS; $$idxx+4) {
printf(“ %3d %8d %10d %8d %9d %8d %8d %8d \n”, $$idxx,
pstat_ar[$$idxx].numjnst,
pstat_ar[$$idxx].numJnst_comp,
pstat_ar[$$idxx].num_inst_comm,
pstat_ar[$$idxx].num_lm,
pstat_ar[$$idxx].num_buf,
pstat_ar[$$idxx].num_link,
pstat ar[$$idxx].num bytes);
>
printf(“ \n” );
printffV i”);
printf(“ \n\n PID total_time com pjim e waitjim e commJimeVi” );
for ($$idxx = 0; $$idxx < NCPUS; $$idxx++) {
printf(“ %3d %13.3f %13.3f %12.3f %12.3f\n", $$idxx,
pstat_ar[$$idxx] .p_time,
pstat_ar[$$idxx].p_comptim,
pstat_ar[$$idxx].p_synctim,
pstat_ar[$$idxx].p_commtim);
printffAn” );
}
***CSIM macros - Verbs
/* create a parallel code segment named “k” for execution on i860 */
#define CRT_TSK(k) asm(CLOCK_OFF); \
create (“ k” );\
asm(CLOCKON)
/* terminate parallel code segment on processor pn, record simulation time for process, and
set global completion flag if needed */
#define END_TSK(pn) asm(CLOCK_OFF); \
hold (t_hold[pn]> ;\
t_hold[pn] = 0.0;\
act-; \
if (act == 0) set(g_done); \
asm(CLOCKON)
/* Macro to take care of hypercube link communication. Processor x sends message to
processor y on dimension d. The length of the m essage is Ig bytes */
#define COMMUNICATE(x,y,d,lg) asm(CLOCKOFF); \
hold(t_hold[x]); \
t_hold(x] = 0.0; \
pstat_ar[x].num_inst_comm += 1; \
pstat_ar[x].num_link +=1; \
reserve(link[x*DIM+d]); \
hold(START_UP); \
hold(lg*INCR_COMM); \
hold(DIM_COST);\
pstat_ar[x].p_commtim += START_UP; \
pstat_ar(x].p_commtim += (lg*INCR_COMM); \
pstat_ar[x].p_commtim += DIM_COST; \
pstat_ar[x].num_bytes += Ig; \
release(link[x*DIM+d]); \
asm(CLOCK_ON);
/* Macro to take care of hypercube link communication in a circuit switched manner.
x=source, y = destination, Ig = length of message. */
#define COMMUNICATE_CIRCUIT(x,y,lg) asm(CLOCK_OFF); \
hold(t_hold[x]); \
t_hold[x) = 0.0; \
pstat_ar(x].num_inst_comm +=1; \
pstat_ar[x).num Jink +=1; { \
int $$idxx,S$idyy,$$idzz,$$idww; \
$$idxx = 0; \
$$idzz = x; \
for ($$idyy=0; $$idyy<=DIM; $$idyy++) \
{ \
if ($$idzz != y) \
{\
$$idxx += 1; \
$$idww = FRO UTE_D[x][$Sidzz]; \
reserve(link[$$idzz*OIM+S$idww]}; \
S$idzz = FROUTE[x][$$idzz]; \
y.\
hold(START_UP+$$idxx*DIM_COST+lg*INCR_COMM); \
pstat_ar[x].p_commtim += START_UP + $$idxx*DIM_COST+
lg*]NCR_COMM; \
pstat_ar[x].num_bytes +=lg; \
$$idzz = y; \
for ($$idyy=0; $$idyy \
asm(CLOCK_ON)
/* waits for data to be received in its input buffer */
#define WAIT_DATA_RDY(y,x) asm(CLOCK_OFF); \
hold(t_hold[x]); \
t_hold"[x] = 0.0; \
ts1[x] = simtime0;\
wait(data_rdy[y][x]); \
ts2[x] = simtimeO; \
pstat ar[x].p_synctim +=(ts2[x] -ts1[x]);\
asm(CLOCK_ON);
#define CLR_DATA_RDY(y,x) asm(CLOCK_OFF); \
hold(t_hold[x]);\
t_hold[x] = 0.0; \
clear(data_rdy|yj[x]); \
asm(CLOCK_ON);
#define SET_DATA_RDY(y,x) asm(CLOCK_OFF); \
hold (t_hold [x]); \
t_hold[x] = 0.0; \
set(data_rdy[y][x]);\
asm(CLOCK_ON);
INITIALIZATION routines
/* routine to initialize facilities */
csimj'nitO
{
int i;
/* declare extra event variables if necessary */
if (NCPUS <= 64) {
i = max_events(200);
i = max_facilities(5000);
i = max_servers(5000);
}
else
{
if (NCPUS <= 512)
{
i = max_events(2200);
i = max_facilities(20000);
i = max servers (20000);
}
else
{
i = max_processes(2000);
i = max_events(4000);
i = max_facilities(20000);
i = max_servers(20000);
>
} ;
tsum = 0.0; /* tsum initialization */
r declare facilities and events
s_bus = facility(“ s_bus” );
facility_set(om,” om” ,NCPUS*NCPUS);
facility_set(r_bus,”r_bus”,NCPUS); /* inluded for crossbar-connected */
facility_set(link,’1ink”,NCPUS*DIM); I * communication links */
g_done = event(“g_done” );
event_set(data_rdy,’ ’ data_rdy”,CNTRDY*NCPUS);
I * instantiate and initialize semaphore structures */
for (i = 0; i < CNTSIZ; i++) {
127
cntrp] = (struct t_COUNTER * ) malloc(sizeof(structt_COUNTER));
cntr(i]->c_num = 0;
cntr(i]->init = 0;
} ;
/* initialize the data_ready signals */
for (i=0; i < NCPUS; i++)
{
clear(data_rdy[0][i]);
clear(data_rdy[1][0);
clear(data rdy[2][i]);
}
/* initialize the instruction counting structures and holds array */
for (i = 0; i < NCPUS; i++) {
pstat_ar[i].num_om = 0;
pstat_ar[i].num_lm = 0;
pstat_ar[i].num_buf = 0;
pstat_ar[i].num_inst = 0;
pstat_ar[t].num_inst_comp = 0;
pstat_ar[i].num_inst_comm = 0;
pstat_ar(0.num_inst_sync = 0;
pstat_ar[i].num_synch = 0;
pstat_ar[i].p_time = 0,0;
pstat_arp].p_comptim = 0.0;
pstat_ar[i].p_syndim = 0.0;
pstat_ar[i].p_commtim = 0.0 ;
pstat_ar[i].num_link = 0;
pstat_ar[i].num_bytes = 0;
t_hold[i] = 0.0;
} ;
} /* csim-init */
Appendix B
Simulation Program Listings:
1. Image shifting by columns and rows on OMP
2. Image shifting by columns and rows on OMP using on-the-fly indexing
3. Image shifting by columns and rows on CCM
4. Image shifting by columns and rows on CCM using on-the-fly indexing
5. M atrix transposition and m ultiplication on Hypercube
6. M atrix transposition and multiplication on OMP
7. M atrix transposition and multiplication on CCM
8. M atrix row shuffle on Hypercube
9. M atrix row shuffle on OMP and CCM
10. Histogramming on Hypercube
11. Histogramming on OMP
12. Histogramming on CCM
13. Gaussian-elimination on Hypercube
14. Gaussian-elimination on OMP
15. Gaussian-elimination on CCM
i
!
128 |
129
Translates a PxP image on an NxN OMP without using
index manipulation. The image is shifted by ‘col-shift’
number of columns and ‘ row-shift’ number of rows.
^define NCPUS 16
#define SIZE 1024
#detine CNTSZ 16
#define cof_shift 5
#define row shift 3
#include “ omp simS.h"
#define N NCPUS
#define P SIZE
#define LEN (P/N)
#define CAP (LEN*LEN)
FILE *fptr, *outptr;
int OMDI[CAP][N][N];
int OMDO[CAP][N][N];
int VRW[N][P];
int TEMP[N](P];
input_arraysO
{
int i,j,k,row,offset1 .offset;
fptr = fopen (“ Image”, “r ”);
for (i=0; i
_ADV_TM(5,pn);
}
C LR_RO WACC (pn);
SYNCH (cnt, pn);
SET_COLACC(pn);
cnt=(cnt+i) % CNTSZ;
_ADV_TM(8,pn);
TSETCNT(cnt,N,pn);
ADV_TM(5,pn);
for (i=0; i < LEN; i++)
(
_ADV_TM(5,pn);
for (j=0; j < LEN; j++)
{
PIPEJN(pn);
1 1 = { j *LEN) +i;
_ADV_TM(3,pn);
for (k=0; k
_ADV_TM(5,pn);
}
_ADV_TM(5,pn);
for (yi =0; yi < rowshift; yi ++)
{
_FETCH_VRW(pn);
_ADV_TM(7,pn);
TEMP[pn][y1] = 0;
_ADV_TM(5,pn);
}
y2 = P-row_shift;
ADV TM(2,pn);
~ADV~TM(5,pn);
for (yl=0; yi < y2; y1++)
{
y3=y1 +row_shift;
FETCH VRW(pn);
_FETCH_VRW(pn);
/* Index manipulator tn rf */
I * loop initialization */
r loop increment */
I * overhead for mod operation * 1
t * loop initialization * 1
I * loop initialization */
/* Index manipulator init */
/ * loop increment */
t * loop initialization */
/* VRW access */
/* instr overhead for filling pattern */
/* fills up 0 */
/* loop increment */
r overhead for subtraction */
/* loop initialization */
/* VRW access */
/* VRW access */
TEMP[pn][y3] = VRW [pn](y1J;
_ADV_TM(17,pn);
_ADV_TM(5,pn);
}
_ADV_TM(5,pn);
for (j=0; j < LEN; j++)
(
PIPE_OUT(pn);
1 1 =(j*LEN)+i;
_ADV_TM(3,pn);
for (k=0; k
_ADV_TM(5,pn); /* loop increment */
CLR_ROWACC(pn);
SYNCH (cnt,pn);
SET_COLACC(pn);
cnt=(cnt+1) % CNTSZ;
_ADV_TM(8,pn);
TSETCNT (cnt.N ,pn);
/* loop initialization */
I * Index manipulator init */
/* loop increment */
/* loop initialization */
r No overhead due to OIM */
/* fills up 0*/
I * loop initialization */
r additional overhead for */
r selecting appropriate index set */
/* Index manipulator init */
/* overhead for mod operation */
_ADV_TM(5,pn);
for (i=0; i < LEN; i++)
{
_ADV_TM(5,pn);
for (j=0; j < LEN; j++)
{
PIPEJN(pn);
1 1 = (j *LEN) +i;
_ADV_TM(3,pn);
for (k=0; k
_ADV_TM(5,pn);
for (j=0; j < LEN; j++)
{
PIPE_OUT(pn);
_ADV_TM(10,pn);
1 1 = (j *LEN) +i;
_ADV_TM(3,pn);
F or (k=0; k
ADV_TM(5,pn);
}
CLR_COLACC(pn);
SYNCH (cnt,pn);
END_TSK(pn);
/* loop initialization */
/* loop initialization */
/* Index manipulator init */
/* loop increment */
/* loop initialization */
/* no overhead for OIM */
/* loop initialization */
I * additional overhead for */
/* selecting appropriate index set */
/* Index manipulator init */
/* loop initialization */
/* loop increment */
simQ
{
int i;
create(“sim” );
csimJnitO;
input_arraysO;
act = N ;
for (i=0; i
_ADV_TM(5,pn);
}
}
CLR_COLACC(pn);
SYNCH (cnt,pn);
/* loop increment */
/* VRW access */
/* instr overhead for filling pattern */
/•fills u p 0*/
/* loop increment */
1 * overhead for subtraction */
/* loop initialization */
/* VRW access */
/* VRW access */
/* overhead for duplicating */
/* loop increment */
/* Index manipulator init */
I * loop increment */
END_TSK(pn);
sim O
(
int i;
createfsim” );
csim_init();
input_arraysQ;
act = N ;
for (i=0; i
trans(pn)
int pn;
{
int ni,j,k,H,l2,l3,lg;
int z1 ,z2,z3,z4,z5,z6,msg_length,dst;
create(“ trans");
msgjength = LEN*LEN;
ADV TM(7,pn);
~ADV~TM(5,pn);
for (ni=1; ni 0 */
/* can be kept in a look-up table */
/* so no timing overhead */
/* determines routing links */
int p,x,y;
{
int i,j,k,pr1 ,step,s_node,d_node,tempsrc,tempdst,cj;
step = 0;
s_node = x;
d_node = x;
for(i=0; i ;
} ;
} ;
}
mult(pn)
int pn;
{
int temp, ni, j, k, src_blk_length, dest_blk_index;
int z1, z2, z3, y1, y2, y3, msgjength, cpn, dst;
createf'mult’ );
_ADV_TM(3,pn);
copy_buffer(pn);
BIND[pn] = pn;
_ADV_TM(5,pn);
compute(pn);
_ADV_TM(8,pn);
msg_length = P*LEN;
_ADV_TM(7,pn);
_ADV_TM(5,pn);
/* computing the complement */
I* sets own identified block */
I* sets the identification */
/* initiating compute operation */
/* determining message length */
I* loop initialization */
for (ni= 1; ni < M ; ni++)
{
dst = pn A ni;
_ADV_TMC(7,pn); I* determining the destination */
routing (pn,pn,dst); /* determines the routing path */
COMMUNICATE_CIRCUIT(pn,dst,msg_length); /* performs the communication */
for (z2 = 0; z2
fclose{fptr);
)
init CO
{ “
int i,j,k;
for (i=0; i
SVNCH(cnt,pn);
END_TSK(pn);
}
simO
{
int i,i1;
create(“ sim” );
csim_init();
input_arraysO;
init_C;
a d = N;
for (i=0; i
_ADV_TM(5,pn);
}
SYNCH (cnt.pn);
ENDTSK(pn);
>
simO
int i,il;
createf'sim” );
csimJnitQ;
input_arraysO;
init_C;
act = N ;
for (i=0; i :
wait(g_done);
act = N;
for (il =0; i1
DESTfi] = dest;
SRC[dest] = i;
}
/* determines the MSB */
/* multiplies the src address */
/* adjusts the LSB */
/* stores destination addresses */
/* stores source address */
shuffle(pn)
int pn;
{
int i,j, 11,12, src, dst, msgjength;
createfshuffle” );
m sgjength = LENT;
dst = DESTjpn);
_ADV_TM(6,pn); /* overhead for initialization */
if (dst T= pn)
{
routing(pn,pn,dst); /* determining routing path */
COMMUNICATE_CIRCUIT(pn,dst,msgjength); I* performs the communication */
for (i=0; icLEN; i++) /* copies data from matrix A to Buffer */
for (j=0 ; j 0
/* initializing p */
/*for loop initialization */
/* loop inaement */
I* returning the value */
/* determines routing links */
int p,x,y;
{
int i,j,k,pr1,step,s_node,d_node,tempsrc,tempdst,cj;
step = 0;
s_node = x;
d_node = x;
for(i=0; i ;
}
simO
{
int xi;
create(“ sim");
csim_initO;
input_arrays();
shuffle_determine();
act = M ;
for (xi=0; xi
simO
{
int i;
createfsim” );
csim_initO;
input_arraysO;
shuffle_determine();
act = N;
for (i=0; i
fdose(fptr);
}
in'rt_H0
{
int i,j,k;
for (i=0; i 0 */
/* initializing p */
148
_ADV_TM(5,pr);
for (i=1; i <=n; i++)
<
p = p*x;
_ADV_TM(5,pr);
}
_ADV_TM(3,pr);
return(p);
routing (p,x,y)
int p,x,y;
i
int i,j,k,pr1,step,s_node,d_node,tempsrc,tempdst,cj;
step = 0;
s_node = x;
d_node = x;
for(i=0; i ;
}
simO
{
int xi;
create fsim ” );
/* for loop initialization */
/* loop increment */
I * returning the value */
/* determines routing links
csim_inH0;
input_arraysO;
init_H();
act = M ;
for (xi=0; xi
S ET_R0WACC (pn);
cnt = (cnt+1) % CNTSZ;
_ADV_TMS(8,pn);
TSETCNT (cnt,N,pn);
length = B;
_ADV_TMC(2,pn);
BLK_SCALAR_OUT(pn,0, length);
P loop initialization */
p initilaizes histogram values */
P overhead for initialization */
p loop condition checking */
p loop initialization */
p loop initialization */
P loop initialization */
p histogram on local data */
p overhead for local histogramming */
P loop exiting */
P loop exiting */
P overhead for mod operation */
P length assignment */
P writes a block of data into a single */
_ADV_TMC(10,pn);
for (i=0; i k )
{
tempi = A[pn][l1][k]/temp;
ADV_TM(l6,pn);
~ADV_TM(5,pn);
for (I2=k; I2=0; i-)
{
_ADV_TM(5,pn);
for (j=M-1; j>=0; j-)
{
k = (i*M ) + j;
BRDCNT[pn] = 0;
_ADV_TM(5,pn);
_ADV_TM(4,pn);
if (tp n = = i)
{
_ADV_TM(4,pn);
if (k == P-1)
{
X [D [i] = B[j]P]/Afi][i][k];
temp = X[j][i];
ADV_TM(36,pn);
f
else
{
tempi = B[tpn][i] - PSUM[tpn][i];
temp = tempi / A[tpn][i][k];
X Q ][i] = temp;
ADV_TM(43,pn);
r
/* loop exiting */
I* wait condition for the loop * 1
/* loop exiting */
/* loop initialization */
/* loop initialization */
/* initialization overhead */
/* condition check */
/* condition check * /
/* determines last element */
/* data to be broadcasted */
I* computational overhead */
/* computational overhead */
_ADV_TM(4,pn);
if (k != P-1)
(
W AIT_D ATAR D Y ( 1 ,tpn);
}
length =1;
count_start(tpn);
_ADV_TMC(2,pn);
ADV TMC(S,pn);
while (BRDCNT[tpn) < DIM )
(
bent = BRDCNT[tpn];
_ADV_TMC(5,pn);
brdindex = power (2, bent, tpn);
t1 = tpnA brdindex ;
_ADV_TMC(4,pn);
routing{pn,pn,t1);
COMMUNICATE_CIRCUIT(pn,t1 .length)
BUF[t1][0] = temp;
bent += 1;
BRDCNT[tpn] = bent;
BRDCNT[t1] = bent;
_ADV_TWC(13,pn);
SET_DATA_RDY (0,t1);
}
count_end(tpn);
_ADV_TM(5,pn);
for (II ~ = 0; 1 1 < LEN; 11++)
{
k1 = (I1*M ) + tpn;
_ADV_TM(5,pn);
_ADV_TM(5,pn);
if (kl < k )
{
PSUM[pn][l1] += A[pn][h][k] ‘ temp;
_ADV_TM(29,pn);
_ADV TM(5,pn);
for (t1=0;t1
if (k != 0)
{
SET_DATA_RDY(1 ,next);
}
}
else
{
WAIT_DATA_RDY(0,pn);
length = 1;
_ADV TMC(2,pn);
_ADVJMC(6,pn);
while (BRDCNTttpn] < DIM)
{
bcnt=BRDCNT[tpn];
_ADV_TMC(5,pn);
brdindex = power(2,bent,tpn);
t1 = tpn A brdindex;
_ADV_TMC(4,pn);
routing (tpn,tpn,t1);
COMMUNICATE_CIRCUIT(pn,t1 .length);
BUF[t1][0] = BUF[tpn][0];
bent +=1;
BRDCNT[tpn] = bent;
BRDCNT[t1] = bent;
_ADV_TMC (13,pn);
SET_DATA_RDY(0,t1);
}
x_update = BUF[pn](0];
_ADV_TM(8,pn);
_ADVTM(5,pn);
for (11=0; I1 I
_ADV_TM (5,pn); I * loop exiting */
}
WAIT_DATA_RDY(2,pn); /* wait for processors to synchronize */
}
_ADV_TM(5,pn); /* loop exiting */
}
_ADV_TM (5,pn); I* loop exiting */
}
END_TSK(pn);
}
power(x,n,pr) I* raise x to n-th power; n > 0 */
intx,n,pr;
(
int i,p;
p=f;
for (i=1; i <=n; i++)
{
p = p*x;
}
return(p);
routing(p,x,y) /* determines routing links */
int p,x,y;
{
int i,j,k,pr1,step,s_node,d_node,tempsrc,tempdst,cj;
step = 0;
s_node = x;
d_node = x;
for(i=0; i
simO
{
int xi;
create (“ sim” );
csimjnitO;
input_arraysQ;
init_X0;
act = M ;
foi (xi=0; xi
else
{
SYNCH (cnt.pn);
SET_COLACC(pn);
src = j;
length = P-k+1;
_ADV_TMC(7,pn);
BLK_SCALAR_IN (pn,j,length);
for (11=0; 1 1 k )
{
tempi = A[pn][l1 ][k] / temp;
_ADV_TM(16,pn);
_ADV_TM(5,pn);
for (I2=k; I2=0; i-)
_ADV_TM(5,pn);
for (j=N-1; j>=0; j-)
{
k = (i*N) + j;
_ADV TM(5,pn);
_ADV_TM(4,pn);
if (tpn ==j)
{
_ADV_TM(4,pn);
if (k == P-1)
{
X (D (i] = BB][i]/A[fl[i][k];
temp = X [jjfi] ;
_ADV TM(36,pn);
}
else
{
tempi = B[tpn][i] - PSUM[tpn][i);
temp = tempi /A[tpn][i][kJ;
X[fl[i] = temp;
_ADV_TM (43 ,pn);
}
BROADCAST^;
for (11=0; 1 1
I fclose(fptr);
! >
j inK_X0
! < " .
| int i ,j ;
] for (i=0; i < N; i++)
i for 0=0; j < LEN; j++)
I {
X[0[fl = 0.0;
; PSUMIOD] = 0.0;
; }
matrix_array_mulO
int i,j,k,pe,ind;
float temp;
for (i=0; i < LEN; i++)
for (j=0; j < N; j++)
<
| temp = 0.0;
I for (k=0; k < P; k++)
; {
j ind = k/N ;
pe = k%N;
temp += A[j][i][k] * X[pe][ind];
RB[j][i] = temp;
}
}
gauss (pn)
int pn;
{
int i,j,k,k1,11,12,tpn,length,src,cnt;
float temp,tempi;
cnt=0;
tpn=pn;
_ADV_TMS(2,pn);
create(“gauss” );
SET_ROWACC(pn);
TSETCNT(cnt,N,pn);
_ADV_TM(5,pn);
for (i=0; i < LEN; i++)
ADV_TM(5,pn);
I* initialization overhead */
I * initialization overhead */
(j=0;j k )
<
tempi = A[pn][l1][k]/temp;
_ADV_TM(16,pn);
_ADV_TM(5,pn);
for (I2=k; I2=0; i-)
{
_ADV_TM(5,pn);
for (j=N-1; j>=0; j-)
{
k = (i*N) + j;
_ADV_TM(5,pn);
_ADV_TM(4,pn);
if (tpn ==j)
_ADV_TM(4,pn);
if (k == P-1)
{
X[fl[i] = BG ][i]/A[fl[i][k];
temp = X G lfil;
_ADV TM(36,pn);
}
else
(
tempi = B(tpn][i] - PSUM[tpn][i];
temp = tempi / Appn][i][k];
X D J[i} = temp;
_ADV TM(43,pn);
}
BROADCAST^);
for (1 1 =0; 1 1
Asset Metadata
Creator
Panda, Dhabaleswar Kumar (author)
Core Title
Vectorized interprocessor communication and data movement in shared-memory multiprocessors
Contributor
Digitized by ProQuest
(provenance)
School
Graduate School
Degree
Doctor of Philosophy
Degree Program
Computer Engineering
Degree Conferral Date
1991-12
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
oai:digitallibrary.usc.edu:usctheses,OAI-PMH Harvest
Format
theses
(aat)
Language
English
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC11255754
Unique identifier
UC11255754
Identifier
DP22828.pdf (filename)
Legacy Identifier
DP22828
Document Type
Dissertation
Format
theses (aat)
Internet Media Type
application/pdf
Type
texts
Source
University of Southern California Dissertations and Theses
(collection),
University of Southern California
(contributing entity)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
uscdl@usc.edu
Linked assets
University of Southern California Dissertations and Theses