Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
00001.tif
(USC Thesis Other)
00001.tif
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
DATA STORAGE AND MOVEM ENT IN SHARED AND DISTRIBUTED
MEMORY SYSTEMS
by
Kichul Kim
A Dissertation Presented to the
FACULTY OF TH E GRADUATE SCHOOL
UNIVERSITY OF SOUTHERN CALIFORNIA
In Partial Fulfillment of the
Requirements for the Degree
D O CTOR OF PHILOSOPHY
(Com puter Engineering)
August 1991
Copyright 1991 Kichul Kim
UMI Number: DP22820
All rights reserved
INFORMATION TO ALL USERS
The quality of this reproduction is dependent upon the quality of the copy submitted.
In the unlikely event that the author did not send a complete manuscript
and there are missing pages, these will be noted. Also, if material had to be removed,
a note will indicate the deletion.
Published by ProQuest LLC (2014). Copyright in the Dissertation held by the Author.
Diss&tation Publishing
UMI DP22820
Microform Edition © ProQuest LLC.
All rights reserved. This work is protected against
unauthorized copying under Title 17, United States Code
ProQuest LLC.
789 East Eisenhower Parkway
P.O. Box 1346
Ann Arbor, Ml 48106- 1346
CpS
m
? £ 2 f A p 3
This dissertation, written by
f c j c j m l K la a ? ....................................
under the direction of h.i$....... Dissertation
Committee, and approved by all its members,
has been presented to and accepted by The
Graduate School, in partial fulfillment of re
quirements for the degree of
DOCTOR OF PHILOSOPHY
Dean of Graduate Studies
D a te.......
DISSERTATION COMMITTEE
Chairperson
UNIVERSITY OF SOUTHERN CALIFORNIA
THE GRADUATE SCHOOL
UNIVERSITY PARK
LOS ANGELES, CALIFORNIA 90089-4015
A cknow ledgem ents
i
I would like to thank Professor V. K. Prasanna Kumar, my dissertation chairman, |
for his valuable help and support. I have been continuously influenced by his rigorous!
i
attitude toward research. He has also been a continuous source of encouragement.,
I also would like to thank other members of my dissertation committee: Professors;
Herbert Taylor, Michel Dubois and Ming-Deh A. Huang. Their valuable advice
helped me a lot during my research.
I
I
t
1 1
C O N T E N T S
A cknow ledgem ents ii
I
A b stract v iii!
i
1 Introduction 1
1.1 Shared and Distributed Memory S y stem s................................................. 2
1.2 Three-Stage Clos Networks and Benes N etw o rk s...................................... 3
1.2.1 Three-Stage Clos Networks.................................................................. 4!
1.2.2 Benes N etw orks...................................................................................... 7]
1.2.3 Self-Routing Benes N etw orks............................................................ 9
1.3 An Overview of the D issertation.................................................................. 10
2 A P ro o f o f T he R earrangeability o f Five Stage Shuffle/E xchange
N etw orks for N = 8 13
2.1 In tro d u ctio n .................................................................................................... 13
2.2 An Overview of The Proof T echnique...................................................... 15
2.3 Realizing Permutations on Five Stage Shuffle/Exchange Network . . . 17
2.4 Examples of the Routing A lgorithm ......................................................... . 21
2.5 Conclusion....................................................................................................... 28
Latin Squares for Parallel Array A ccess 29
3.1 In tro d u ctio n ....................................................................................................
29
3.2 Latin Squares for Parallel Array A ccess.................................................. . 34
3.3 Construction of Perfect Latin Squares......................................................
36
3.4 Address G eneration....................................................................................... . 44
3.5 Self-Routing for Perfect Latin Squares .................................................. 46
3.6 Efficient Access to Three-Dimensional Arrays ..................................... 53
3.7 Conclusion....................................................................................................... 57
4 An Efficient M apping o f D irected Graph B ased C om putations onto
H ypercube Arrays 59
4.1 In tro d u ctio n ....................................................................................................... 59
4.2 An Overview of the Mapping 63
4.3 Data T ra n s p o rt................................................................................................ 65
4.3.1 Data R o u tin g ...................................................................................... 65
4.3.2 B ro a d c a s t............................................................................................. 69
4.3.3 S u m m a tio n .......................................................................................... 70
4.4 A pplications....................................................................................................... 74
4.4.1 An Efficient Iterative Sparse Linear System S olver..................... 74
4.4.2 Neural Network Im p lem en tatio n s.................................................. 76
4.4.3 Logic Sim ulations................................................................................ 78
4.5 Implementations on Other Parallel M achines............................................... 80 j
4.6 Conclusion 81 j
5 A n Efficient M apping of D irected Graph Based C om putations onto
Star G raphs 82
5.1 In tro d u ctio n ....................................................................................................... 82
5.2 Star G ra p h s ....................................................................................................... 84
5.3 Indexing Scheme and Mapping a Grid onto Star G r a p h .......................... 86
5.4 An Overview of the A lgorithm ......................................................................... 90
5.5 Data T ra n s p o rt................................................................................................ 93
5.5.1 Data R o u tin g ...................................................................................... 93
5.5.2 Simultaneous Broadcasting and S u m m a tio n .............................. 96
5.6 Conclusion.......................................................................................................... 99
6 C oncluding R em arks 100
i
i i i
LIST OF TABLES
5.1 A comparison of star graphs and n-cubes
I
LIST OF FIG U R E S
1.1 A shared memory sy stem ............................................................................. 3
1.2 A distributed memory s y s te m ................................................................... 3
1.3 A three-stage Clos network C{2,2,4) of size 8 ..................................... 5
1.4 Permutation P and its bipartite graph G ............................................... 6
1.5 Pu, P l and switch settings in the input and output stag es................. 6
1.6 A Benes network B ( N ) ................................................................................
8
1.7 A Benes network of size 8 (B(8) ) ............................................................ 9
1.8 A self-routing Benes N e tw o r k ................................................................... 10
2.1 A five stage shuffle/exchange network for iV — 8 .................................. 14
2.2 Five stage shuffle/exchange network for N — 8 ..................................... 16
2.3 The effect of flipping the setting of a switch in the left most stage . . 20
2.4 Permutation Pi and its bipartite graph representation........................ . 22
2.5 The initial switch settings of Example 1 ............................................... 22
2.6 The final switch settings of Example 1 .................................................. 23
2.7 Permutation P2 and its bipartite graph representation........................ 24
2.8 The initial switch settings of Example 2 ............................................... 24
2.9 The final switch settings of Example 2 .................................................. 26
2.10 Permutation P3 and its bipartite graph representation........................ 26
2.11 The initial switch settings of Example 3 ............................................... 27
2.12 The final switch settings of Example 3 .................................................. 28
3.1 A shared memory sy stem ............................................................................. . 30
3.2 Two skewing schem es.................................................................................... 32
3.3 The address generation circuit for Ps........................................................ 46
4.1 A directed graph m odel................................................................................ 61
4.2 An example of the initial data m ap p in g .................................................. 64
4.3 The routing of n, to the leader of i< /l-c o lu m n ........................................ 65
4.4 Vi is broadcast within i< /l-column................................................................. 66
4.5 The distribution after the transformation to row major order . . . . 66
4.6 The sum of products in ith-row is routed to y , - ..................................... . 67
4.7 The channel representation of a Benes network of size 8 .................... 68
4.8 The routing a lg o r ith m ................................................................................ 69
4.9 The broadcast a lg o rith m ............................................................................. . 7 1
vi
4.10 An example of broadcast for 4-dimensional h y p ercu b e.......................... 71
4.11 The summation a lg o rith m ............................................................................ 72
4.12 An example of row summation for 4-dimensional hypercube................ 73
4.13 (a) A model of a neuron and (b) A neural netw ork................................ 77
4.14 A simple logic d ia g ra m .................................................................................. 78
4.15 A directed graph for a logic c i r c u i t ........................................................... 79
5.1 A 3 - s t a r ............................................................................................................. 85
5.2 A 4 - s t a r ............................................................................................................. 85
5.3 Row Major Indexing Scheme for 4-star . . . ....................................... 87
5.4 Row Mode for a 4 - s t a r .................................................................................. 89
5.5 Column Mode for a 4 - s t a r ............................................................................ 89
5.6 A sparse iteration matrix and the initial data m a p p in g ...................... 90
5.7 The routing of Xk to the leader of column k .......................................... 92
5.8 Xk is broadcast within column k of P ........................................................ 92
5.9 The distribution after the transformation to row major o r d e r .................92
5.10 The sum of products in row k is routed to X k .......................................... 93
5.11 A Clos network of size ra!, C(n, n, (n — 1 ) ! ) .............................................. 95
5.12 An example of grouping for simultaneous broadcasting and summation 97
V11
A bstract
The main purpose of this dissertation is to develop basic techniques for efficient
data storage and data movement that can be used in a variety of algorithms in
shared and distributed memory systems. Techniques based on basic graph theory
and combinatorics that can be used in variety of architectures and applications are|
I
emphasized. 1
The first major result is a simple proof of the rearrangeability of five-stage shuf-j
fle/exchange networks for N = 8. The proof is based on the well known proof)
technique for the rearrangeability of three-stage Clos networks. The extended proof
technique gives a conceptually simple outlook. The proof is also a constructive one
leading to a routing algorithm.
The second major result is a new powerful solution to the well known parallel
array access problem. New combinatorial objects called perfect latin squares are
introduced to solve the problem. By using perfect latin squares as skewing schemes,
we can provide conflict free access to various subsets of an iV x N array including
rows, columns, diagonals and N 1?2 x N 1^2 subarrays. The memory utilization is
maximized for frequently used subsets. The address generation can be performed
in constant time using small amount of circuitry. Perfect latin square is the first
skewing scheme that can provide constant time access to rows, columns, diagonals
and subarrays of an array. Furthermore, the permutations needed between process
ing elements and memory modules can be realized by existing self-routing Benes)
networks. The resulting memory system provides fast access to various subsets ofj
an array in a cost effective way.
viii
The third major result is efficient data movement techniques for hypercubes and
star graphs. These data movement techniques exploit topological properties of in
terconnection networks. Efficient implementations of several generic data movement
operations in hypercubes and star graphs are developed to be used in implementing,
directed graph oriented computations. Especially, the routing method for three-stage
Clos networks and Benes networks is extended to provide efficient routing methods
for hypercubes and star graphs. The resulting implementation of the directed graph
oriented computations has many applications including sparse linear system solvers,
neural networks and logic simulators.
. I
IX
C hapter 1
Introduction |
j
]
i
i
Parallel processing has been an active area of research for more than two decades and 1
i
has come out of its infancy. Many experimental parallel systems have been built and
several parallel systems are commercially available [6, 38, 104]. These systems are
attractive alternatives to conventional computing machines since they easily provide
unprecedented computing power [38, 83, 104]. It is expected that parallel processing
will be a main force in supercomputing in the near future.
It is well known that communication is the main bottleneck in parallel processing
[45, 105]. In parallel processing, tasks are divided into subtasks which are performed
on different processing elements. Hence, any nontrivial parallel processing task needs i
communication for information interchange and synchronization. Extensive research,
I
has been done to reduce the communication complexity in parallel processing. How-1
i
ever, many of the results are useful only for specific problems. It is of utmost |
importance to develop techniques useful in reducing the communication complexity |
of generic data movement operations which can be used in a variety of problems.
The main purpose of this dissertation is to develop basic techniques useful for
efficient data storage and data movement that can be used in a variety of algorithms
in shared and distributed memory systems. Techniques based on graph theory and
combinatorics that can be used in various architectures and applications are empha-
jsized. Especially, well known techniques on three-stage Clos networks and Benes
;networks comprise a basis of this dissertation.
The rest of this chapter is organized as follows. Section 1.1 gives a brief intro-j
duction to shared and distributed memory systems. Section 1.2 gives a detailed
introduction to three-stage Clos networks and Benes networks. Self-routing Benes
Networks are also introduced in section 1.2. The techniques introduced in section 1.2
and their extensions are heavily used throughout this dissertation. Section 1.3 gives
an overview of the dissertation.
1.1 Shared and D istributed M em ory System s
There are many models for parallel computation and many ways to classify parallel
processing systems [28, 45, 85]. One useful classification method is a classification
according to memory organizations. This section briefly introduces shared memoryj
systems and distributed memory systems. Major problems to be solved for efficient
data movement in each system are also discussed.
Figure 1.1 shows a diagram for a shared memory system. In shared memory|
systems, multiple processing elements are connected to multiple memory modules
through an interconnection network. The memory modules comprise the shared!
memory of the system. Data are stored in the shared memory which is accessed byj
processing elements. Main data movement occurs between the processing elements
and the shared memory through an interconnection network. The effective band
width of the memory system, consisting of an interconnection network and memory
modules, should match the data processing rate of the processing elements. To
achieve this goal, normal data accesses should be done without memory conflicts
or interconnection path conflicts. Hence, efficient data movement involves memory
module assignment (data storage) and efficient data routing through the intercon
nection network.
Figure 1.2 shows a diagram for a distributed memory system. In distributee
memory systems, multiple processing elements are connected through an intercon
nection network. Each processing element has its own local memory. Data are
stored in local memories. Main data movement occurs between processing elements
2
PEn
PEI
PE2 M2
Mm
M l
Interconnection
Network
Figure 1.1: A shared memory system
PEI
PE2
PEn
Interconnection
Network
Figure 1.2: A distributed memory system
through the interconnection network which usually has a static topology. Exploit
ing the topological properties of interconnection networks is im portant to achieve
efficient data movements.
1.2 T hree-Stage Clos N etw orks and B enes
N etw orks
Three-stage Clos networks and Benes networks are well known interconnection net
works which can be used as interconnection networks in shared memory systems
[20, 10, 45]. Two factors make these networks im portant in parallel processing.
First, they are viable candidates for interconnection networks in parallel processing
3
systems. Second, techniques developed for these networks can be used to solve vari
ous problems in parallel processing as shown in this dissertation. This section gives a
detailed introduction to three-stage Clos networks and Benes networks. Self-routing
Benes networks are also introduced since they are used to make an efficient memory
system in chapter 3. The material introduced in this section will be heavily used
throughout this dissertation.
1.2.1 T hree-Stage C los N etw orks
A three-stage Clos network of size 8, 67(2,2,4), is shown in Figure 1.3 [10]. It has
three stages. The input stage and the output stages consist of four 2 x 2 switches
each. Each 2 x 2 switch can realize a parallel connection or a crossed connection.
The middle stage consists of two 4 x 4 switches. Each 4 x 4 switch can realize all
4! permutations from its input to its output. Let the two 4 switches in the middle
stage called switch 0 and switch 1. The connections between the input stage and the,
middle stage and the connections between the middle stage and the output stage
satisfy the following property:
P rop erty 1 Among two outputs of each switch in the input stage, one is connected^
to switch 0 in the middle stage and the other is connected to switch 1 in the middle^
stage. Similarly each switch in the output stage has one input from each of the two
switches in the middle stage.
In general, in a three-stage Clos network C (m ,m ,r), the input (output) stage con
sists of r n x m (m x n ) switches. The middle stage consists of m r x r switches.
The connections between stages have properties similar to property 1. The switches
in each stage can realize any connection from their input to output.
Routing permutations in multi-stage interconnection networks involves setting
switches to realize required connections between input and output.
D efinition 1 A n interconnection network is said to be rearrangeable if it can realize
all possible T V ! permutations between inputs and outputs by rearranging its switch
settings.
4
switch 0
switch 1
Figure 1.3: A three-stage Clos network C(2,2,4) of size 8
Three-stage Clos networks are known to be rearrangeable [10]. The proof of
the rearrangeability of three-stage Clos networks C (2,2, N/2) proceeds as follows
[11]. Given a perm utation P : { 0 ,1 ,..., iV — 1} — > { 0 ,1 ,..., N — 1} to be passed
from input to output, obtain an undirected bipartite graph G = (Vj, V2 , E), where
V\ = V2 = {0,1,..., N/2 — 1}. An edge (i ,j ) exists in G if and only if an input of
switch i in the input stage is to be connected to an output of switch j in the output
stage. The degree of each vertex is two and there may be multiple edges between
vertices. Thus the graph G, in general, is a bipartite multigraph. A matching M in
a bipartite graph G is a set of edges such that no two edges in M are incident on a
same vertex. The size of a matching M is the number of edges in M . A matching M
is said to be complete if the size of M is min(| Vj |, | V 2 |), i.e., the largest possible
in G.
The basic idea of the proof of rearrangeability of the three-stage Clos network
is to represent the permutation to be passed in terms of the bipartite graph G as
described above and obtain two disjoint complete matchings Mo and M \ of size N/2.
The connections in matching M0 will be routed through switch 0 in the middle stage
and the connections in matching Mi will be routed through switch 1. Once complete
matchings M0 and Mi are determined, the switch settings of the input and output
5
p =
0 0
1 1
2
3
4
5
6
7
2
4
5
6
7
3
( o } = Z Z = ( o )
P and its bipartite graph G
pu
■ 0 0 ’
1 1
2 2
. 3 3 .
o
0 '
3 1
1 2
. 2 3 .
;ure 1.5:
0 0
1 1
Switch 0
2 2
3 3
0 0 ■ 4
1 1 ■ 5
Switch 1
2 2 -6
3 3 ■ 7
Figure 1.5: P u, P l and switch settings in the input and output stages
stages are obtained. Also, the permutations to be realized by the switches in the
middle stage are obtained. Since each of the switches in the middle stage can realize
any arbitrary permutation, the permutation from input to output can be realized
by the three-stage Clos network. As an example, Figure 1.4 shows a perm utation to
I
be realized by the three-stage Clos network shown in Figure 1.3 and the resulting;
i
bipartite graph G. The switch settings for input and output stages and permutations j
to be realized by the switches in the middle stage are shown in Figure 1.5 whenj
we choose M0 = {(0,0), (1, 1), (2,2), (3,3)} and M x = {(0,0), (1, 2), (2,3), (3, l)}.j
P u and P l represent the permutations to be realized by switch 0 and switch lj
respectively. !
I
The above argument completes a proof of the rearrangeability except that we
need to show that, given a bipartite graph G which represents the perm utation
to be realized, it is always possible to choose two disjoint complete matchings M0
and Mi. Existence of such matchings follows from the Hall’s theorem of distinct
representatives [12]. Since finding a matching of size N takes 0 ( N ) time, the time
complexity of the above routing algorithm for a three-stage Clos network of size N
is O(N)1.
1.2.2 B en es N etw orks
Benes network is a rearrangeable network which provides a good trade-off between
network delay and cost [11]. A Benes network of size N — 2n, denoted by B(N),
can be defined as follows,
1. A Benes network of size 2, 15(2), is a 2 x 2 switch.
2. A Benes network of size N consists of an input stage, an output stage and
two Benes networks of size N/2 in the middle. The input stage and output
stage consist of N/2 2 x 2 switches each. The connections between the input
stage and the smaller Benes networks and the connections between the smaller
Benes Networks the output stage satisfy the following property:
P roperty 2 Among two outputs of each switch in the input stage, one is con
nected to the upper smaller Benes network of size N /2 and the other is con
nected to the lower smaller Benes network of size N/2. Similarly each switch
in the output stage has one input from each of the two smaller Benes networks.
Figure 1.6 illustrates the above recursive definition of Benes networks. Figure 1.7
shows a Benes network of size 8. Notice that 3 middle stages make two Benes
networks of size 4. From the definition, we can see that a Benes network of size
1 f(n) = 0(g(n)) if there exist positive constants c and no, such that, for all n > na, f ( n ) <
cg(n).
7 ]
B(N/2)
B(N/2)
Figure 1.6: A Benes network B ( N )
N has 21ogAf — l 2 stages. Assuming that passing through a 2 x 2 switch takes
one unit of time, the delay of a Benes network of size N is O(log A"). Each stage
of a Benes network has N/2 2 x 2 switches. Therefore, a Benes network of size N
has N log2 N — N/2 2 x 2 switches. This number is considerably smaller than N 2
switches used in a crossbar network of size N. This is the main reason that Benes
networks are attractive alternatives to crossbar networks.
It is easy to see that Benes networks are rearrangeable. If we assume that a Benes
network of size N/2 is rearrangeable, a Benes network of size N is rearrangeable
by the same reason a three-stage Clos network is rearrangeable. We can also use
the routing method for three-stage Clos networks shown in the previous section
for routing Benes networks. By applying the routing method for three-stage Clos
networks of size N, we can get the switch setting for the switches in the input
and output stages and the permutations to be realized by the two smaller Benes
networks. The switch settings for the smaller Benes networks can be obtained by
2Throughout this dissertation, if not otherwise mentioned, all logarithm s are to base 2.
. 8 1
Figure 1.7: A Benes network of size 8 (B(8))
applying the method recursively on Benes networks of smaller size. Since finding a !
matching of size N takes 0(N ) time, the routing algorithm for a Benes network of
size N takes 0 ( N log IV).
1.2.3 Self-R ou tin g B en es N etw orks
Benes networks are attractive since they provide rearrangeability with a reasonable
cost. However, 0 ( N log N) time complexity of the routing algorithm is considerably
larger than O(loglV) network delay time. To overcome this situation many re
searchers proposed self-routing schemes for Benes networks [13, 64, 73]. Self-routing
Benes networks do not perform routing algorithm to set up switches in the net-j
work. Switch setting is done on the fly using only target address information, thus:
eliminating the set-up time.
Figure 1.8 shows how a bit-reversal perm utation can be routed on a , self-routing
Benes network [73]. The state of a switch in stage b or stage 2n — 2 — 6, 0 < b < n — l]
is determined by bit 6 of the destination tag of its upper input. If bit b is 0, the)
switch is set straight. Otherwise, the switch is set crossed.
g
000 0001 000 000
000
100
001
010 101 010
101
010
110 -
001 100
001
101
001 100
011
110
0
1
2
3
4
5
6
7
Figure 1.8: A self-routing Benes Network
Unfortunately, existing self-routing algorithms on Benes networks can not realize
all N\ permutations. Lenfant proposed self-routing Benes networks which could sup
port five families of Frequently Used Bijections (F U B ) [64]. Nassimi and Sahni pro
posed a self-routing Benes network which supports a class of permutations denoted
F(n ) [73]. F(n) contains many permutations including Lenfant’s FU B and Bit-j
Permute-Complement permutations (B P C ). Boppana and Raghavendra recently
proposed yet another self-routing Benes network which supports a large class of per
m utations called Linear-Complement permutations (LC ) [13]. Formal definitions oi
B P C and LC are shown in chapter 3.
In chapter 3, self-routing Benes networks are adopted as interconnection networks,
between processing elements and memory module. The resulting memory systems
provide efficient access to data.
1.3 A n O verview of th e D issertation
Our research can be divided into two parts: work pertaining to shared memory
systems and work pertaining to distributed memory systems. Research ori sharec
15
memory systems focuses on implementing efficient memory systems including inter
connection networks. Research on distributed memory systems focuses on efficient
data movement techniques exploiting topological properties of interconnection net
works.
In chapter 2, we present a new technique to prove the rearrangeability of 5-stage
shuffle/exchange networks. The technique is an extension to the proof technique for
three-stage Clos networks shown in section 1.2. As in section 1.2, bipartite graph
matching and Hall’s theorem play key roles in the proof technique. The proof is
a conceptually very simple one and a constructive one which leads to a routing
algorithm.
In chapter 2, we propose a novel solution to the classical problem of parallel array
storage and access. We introduce new combinatorial objects (perfect latin squares)
to be used as skewing schemes. We present simple construction methods for perfect
latin squares. The resulting perfect latin squares have extra properties useful forj
parallel array access. We also propose to use self-routing Benes networks to realize
the required permutations between processing elements and memory modules. The
resulting memory system provides efficient access to various subsets of an array in
a cost effective way.
In chapter 4, we propose a simple, efficient way of mapping solutions to prob
lems that can be modeled as directed graphs onto fine grain hypercube arrays. The,
mapping uses m + e processing elements to map a solution whose underlying di
rected graph has m nodes and e edges. The data transport problems that arise in
the mapping are solved (asymptotically) optimally. The data movement techniques
exploit topological properties of the given interconnection network. One iteration
step can be performed in 0(log(m + e)) time. The mapping technique can be ap
plied to many problems including iterative solutions to sparse linear systems, neural
network implementations and logic simulations to result in efficient parallel imple
mentations. The mapping method has very small constant factors and is well suited
for implementations on fine grain hypercube arrays.
11
In chapter 5, we propose a simple, efficient way of mapping solutions to problems
th at can be modeled as directed graphs onto star graphs. The same mapping method
used in chapter 4 is used to map directed graphs onto a star graph with (m + e)
processing elements, where m is the number of nodes and e is the number of edges.
Each iteration of the computation can be done in 0 (n 2) tim e for a star graph with
n\ = m + e nodes. To solve the data transport problem arising in the mapping, new
algorithms for star graphs are developed for routing, simultaneous broadcasting and
simultaneous summation. These algorithms are based on special multi dimensional
grids which can be emulated by star graphs without penalty in time complexity.
Especially, the well known routing methods for three-stage Clos networks and Benes
networks are modified to provide an efficient routing in star graphs.
In chapter 6, we briefly summarize the works described in this dissertation. We
also address open problems and possible future research.
12i
C hapter 2
A P ro o f o f T he R earrangeability o f F ive Stage
Shuffle/E xchange N etw orks for N — 8
Shuffle/Exchange networks have been widely used in parallel processing [45, 99].
Such interconnection networks have been well studied to provide communication
between processing elements and memory modules [55, 74, 82, 86, 98, 99,110]. In this
chapter, we show a simple proof of the rearrangeability of five stage shuffle/exchange
networks for N = 8. The number of stages needed in the network is the same as the
lower bound of 2 log N — I. Our technique uses an extension to the basic idea used in
proving the rearrangeability of three-stage Clos networks. The proof is constructive
and leads to a routing algorithm to realize arbitrary permutation in the network.
2.1 Introduction
Shuffle/Exchange networks consist of several stages, each stage consisting of a perfect
shuffle perm utation followed by N/2 2 x 2 switches. Such a network is shown in Fig
ure 2.1 for N = 8. This network has 5 shuffle/exchange stages. A shuffle/exchange
network of size N with log N stages are called an omega network [59].
An interesting problem that arises in the design of such multistage networks is
how many stages of shuffle/exchange are sufficient to be able to pass all of N\ permu
tations from input to output. This problem has been studied by several researchers
1 3 ]
Figure 2.1: A five stage shuffle/exchange network for N = 8
[43, 55, 66, 82, 86, 110]. It is easy to establish a lower bound of 21oglY — 1. It is
not known if 2 log N — 1 is an upper bound for all N.
By simulating parallel sorting algorithms (such as the bitonic sort or the odd even
sort), a shuffle/exchange network with 0(log2 N) stages can pass any permutation
from input to output. Actually, this leads to a sorting network. Siegel has shown an
algorithm to route any permutation on a shuffle/exchange network using 2 log2 N
stages [98]. Apparently, the routing problem seems to be a simpler problem than
sorting and fewer number of stages may be sufficient to perform any permutation
of inputs. Parker has improved the number of stages needed for rearrangeability to
3 log N [82]. Wu and Feng showed that 3 log N — 1 stages are sufficient [110]. This
bound was later improved to 3 log N — 4 in [66, 87].
Several attem pts have been made to prove the rearrangeability of 2 log N — 1 stage
shuffle/exchange networks for small values of N. Parker showed th at 5 stage shuf
fle/exchange network is rearrangeable for N = 8 by extensive search [82]. Kothari
et al. have shown that 3 log N — 3 stages are sufficient for Af = 16 and N = 32
[55]. Recently, rearrangeability of 5 stage shuffle/exchange network for N = 8 and
an algorithm for routing were proposed in [66, 86]. In [66], the problem is solved by
introducing the notion of balanced matrices and using algebraic methods to study
the routing in the network. Raghavendra and Varma [86] have derived a routing
14
algorithm to realize any perm utation by decomposing the perm utation into connec
tion sets and assigning each connection set to the switches in the middle stages.
This partitioning is done so that no conflicts occur in the first two and in the last
two stages.
In this chapter, we show a simple proof of the rearrangeability for 5-stage shuf
fle/exchange networks of size 8. Our proof method is an extension to the technique
used for proving the rearrangeability of three-stage Clos networks. We partition the
5 stage shuffle/exchange network into 3 stages with the first and last stages each
having a column of 4 switches. The middle stage consists of three stages of shuffle
and exchange. We imagine that the middle stage consists of two switches with 4
inputs and 4 outputs each, such that each imaginary switch has exactly one input
(output) from each of the leftmost (rightmost) switches. Note that such a network
is rearrangeable, if each of the two (imaginary) middle switches can realize all 4!
permutations from their input to their output as shown in chapter 1. However, the
(imaginary) middle switches can be shown not to be able to realize all permutations.
We show that this problem can be easily handled by flipping the state of certain
input and output switches.
The rest of this chapter is organized as follows. In the next section, we give
an overview of the proof technique which is an extension of the proof technique for
three-stage Clos networks. Section 2.3 has the complete details of the routing and
the proof of correctness of our routing method to realize arbitrary permutations in
5 stage shuffle/exchange network of size 8. Section 2.4 gives examples of our routing*
algorithm.
2.2 A n O verview o f The P roof Technique
It is tem pting to apply the rearrangeability proof technique for three-stage Clos1
networks to 5 stage shuffle/ex change networks. However, it is known that the 5 ^
stage shuffle/exchange network is not topologically equivalent to Benes networks of
size 8 [86]. Thus, a direct application of the idea is not possible.
15,
- 0
- 1 1 -
2 -
3 - - 3
4 - - 4
- 5
6 -
7 -
- 6
- 7
Figure 2.2: Five stage shuffle/exchange network for N = 8
We will look upon the 5 stage shuffle/exchange network as shown in Figure 2.2.
Note that the input shuffle has been eliminated. Since we are proving rearrange
ability, the elimination of the input shuffle will not affect the rearrangeability proof.
Notice the dotted box corresponds to an omega network. In this box, suppose we de
fine a set of inputs of the box to correspond to those inputs to upper switch (Switch 0)j
in the three-stage Clos network then we can mimic the proof of rearrangeability ofj
the 3 stage network to prove the rearrangeability of the 5 stage shuffle/exchange
network as follows.
The inputs/outputs with bold lines correspond to upper switch (Switch 0) while
the other inputs and outputs correspond to lower switch (Switch 1). Now, given a
perm utation to be passed from input to output, we can represent the perm utation by
a bipartite graph G as described in section 1.2. From this, complete matchings Mq
and Mi can be obtained which lead to permutations P u and P l to be realized by the:
(imaginary) upper and lower switches in the middle dotted box. It is easy to come,
up with permutations P u and P l which can not be realized by the middle switches.
In the next section we will show how the settings of some input/output switches can
16
be flipped to lead to alternate permutations which can be realized through the omega
network. We will show that using the alternate permutations which can be realized
by the omega network, the perm utation from input to output can be realized by the
5 stage shuffle/exchange network.
Notice that, our technique first sets the state of the switches in the left and right
most stages such that the resulting perm utation to be realized by the middle stages
is an omega permutation. Since the stages 2, 3, 4 of the 5 stage shuffle/exchange net
work correspond to an omega network, the perm utation from inputs to outputs can
be realized by the 5 stage network. This strategy also leads to a routing algorithm.
2.3 R ealizing P erm utations on Five Stage
Shuffle/E xchange N etw ork
In this section, we show that our approach outlined in the previous section to re
alizing arbitrary permutations on 5 stage shuffle/exchange network can be used
to realize any perm utation on the network. Recall th at the middle stages were
grouped into an omega network with bold and plain lines corresponding to the up
per and lower switches of a three-stage Clos network. The upper switch has input
set Iu = {0,2, 5, 7} and output set Ou = {0,1,2,3}. The remaining inputs/outputs
correspond to the lower switch. Thus, = {1 ,3 ,4 ,6} and 0/ = {4, 5 ,6, 7). Notice
that these numberings correspond to the output of switches in stage 1 and the output
of switches in stage 4 of the 5 stage shuffle/exchange network shown in Figure 2.1.
The network from the output of stage 1 to the output of stage 4 corresponds to an
omega network for N — 8.
Now, given a perm utation P to be realized from input I to output O by the
network in Figure 2.2, let P u and P l to be the permutations to be realized by the
upper and lower switches. Thus Pu : {0,2,5, 7} — ► {0,1,2,3} and P l : {1 ,3 ,4 ,6} — ♦
{4,5,6,7}. Let [Pu] and [Pl] denote these permutations in m atrix form where,
1 7 1
[ p v
Notice that
h 000
»2 001
t3 010
*4 o i l
i5 100
* 6 101
i7 110
*8 H I
p u
where i5, i e, i 7, i 8 6 h
is the perm utation to be realized by the middle omega net
work. In order to check if a perm utation can be realized by an omega network, we
will use the following well known fact about the realizability of perm utations in an
omega network [43, 59, 66].
Lem m a 1 A permutation represented by a 8 x 6 binary matrix can be realized by
an omega network with 8 inputs/outputs if and only if the 8 x 3 matrices formed by
the columns 2, 3, 4 (window w \) and by the columns 3, 4 > 5 (window w2) contain
all 23 3-bit strings.
D efinition 2 Given rows p and q, we say p and q conflict in a window iO {, 1 < * < 2,
if in window Wi the corresponding bits of row p and q are the same.
For example, if we want to realize permutations P u and Pl given by
000 000
010 001
101 010
pu
p l
then rows (
111 011
001 100
100 101
011 110
110 111
and 1 conflict in w2. Similarly rows 2 and 3 conflict in w2. Thus, ah
alternate statem ent of lemma 1 is:
18
A perm utation is passable by the omega network if and only if there is
no conflict in window w\ and window w2.
Notice that given any P u : Iu — ► Ou and P l : It —> Oi, there can not be any,
conflict in W\ and in w2 between any two rows p and q, where 0 < p < 3 and'
4 < q < 7. However, conflicts can exist within the first four rows and within the lastj
four rows.
Let Ri, R 2 denote output sets {0,1} and {2,3} in 0\ respectively. Similarly,
let Si and S2 denote output sets {4,5} and {6,7} in 0 2. Let I(Ri) denote the
inputs in Iu which get mapped to elements in R\ by the permutation Pu. Similarly
I(R 2), I(Si) and I(S 2) can be defined. Now, we are ready for the main theorem
T heorem 1 Given a permutation P to be realized by the 5 stage network shown in
Figure 2.2, let P u and P l be the permutations to be realized by the upper and /owerj
switches. There exist permutations Qu and Ql which can be realized by the omega
network which are sufficient to realize P through the 5 stage network.
proof: If
and we are done. If not, conflicts occur in a window. (Notice that choice of
Ou,Ih Oi ensures that no conflicts occur in window Wi). Also, if rows p and
q are in conflict then 0 < p, q < 3 or 4 < p, q < 7. Now, consider the case where
p u
there are conflicts within upper 4 rows as well as within lower 4 rows in ^
Notice that there are conflicts in the upper four rows if and only if I{R\) = {0, 2}
and I(R 2) = {5,7} or I(Ri) = {5,7} and I(R 2) = {0, 2}. Similarly, there are^
conflicts in the lower four rows if and only if I(S\) — {1,3} and I(S 2) — {4,6} orj
I(Si) = {4,6} and I(S2) = {1,3}. Let us flip the settings of switches 1 and 3 in the;
p u
Qu
p u
P l
can be realized by the omega network, then let
- r
Ql
P l
left most stage and exchange inputs 0 and 1, and 4 and 5 in
Qu
p u
. Let
Qu
P l
Ql
the resulting permutation. It is easy to verify that
Ql
be
can be realized by the
1 9 ,
- 5
6“
- 7 “ 7
Omega
Network
Omega
Network
Figure 2.3: The effect of flipping the setting of a switch in the left most stage
omega network by using lemma 1. Notice that by exchanging 0 ancl 1 in the per
m utation to be realized by the omega network and by flipping the setting of switch
l(th e topmost switch) in the left most stage, inputs 0 and 1 of switch 1 are routed
to the desired terminals at the output of the omega network as shown in Figure 2.3.
In this example we have assumed that switch 1 in the left most stage is set straight
to start with.
Now consider the case when there is conflict in only one of the upper or lower four
p u
L P l
four rows exis
rows in . Assume that the conflict is in [Pu]. Note that, conflict in the upper
s if and only if /(P i) = {0,2} and / ( P 2) — {5, T} or /( P i) = {5,7}
and / ( P 2) = {0,2}. Similarly no conflicts exist in the lower four rows if and only if:
Case A: I (Si) = {1,4} and /(S 2) = {3,6} or /(S i) = {3,6} and I(S 2) = {1,4}
or
Case B: /(S i) = {1,6} and /(S 2) = {3,4} or /(S i) = {3,4} and /( S 2) = {1,6}.
If case A, then exchange 0-1 and 4-5 in
p u
p i
to get
Qu
Ql
and flip the settings
of switches 1, 3 in the left most stage and we are done. If case B, exchange inputs
0-1 and 6-7 in the input of
p u p u
and 0-4 and 1-5 in the output of
p i p i
and
20
flip the settings o:
most stage. Let
switches 1, 4 in the left most stage and switches 1, 2 in the right
Q
Ql
be the resulting permutation to be realized by the omega
network. It is easy to verify (using lemma 1) that
Qu
Ql
can be realized by the
omega network. ■
Thus the complete algorithm to realize a permutation through the 5 stage shuf
fle/exchange network (shown in Figure 2.2) is as follows:
1. Construct a bipartite multigraph G with 4 vertices on each side, where an edge
(i,j) exists if and only if an input of switch i in the left most stage is to be
connected to an output of switch j in the right most stage.
2. Obtain two disjoint complete matchings M 0 and M\ in G. This can be easily
done by tracing cycles in G and alternatively placing the edges in the cycle in
M 0 and Mi (see [74] for example). Set the switches in the left and right most
stages to realize the connections in Mo and M\.
3. Using M q and Mi obtain permutations P u and P l to be realized by the middle
stages.
4. Obtain
Qu
Ql
so it can be realized by the middle omega network, as in the
proof of Theorem 1.
5. The settings of switches in the left and right most stages are determined by
" Q "
Ql
steps 2 and 4. Since omega network is a self routing network,
mines the settings of switches in stages 2,3 and 4.
2.4 E xam ples of th e R outing A lgorithm
In this section, we show three examples to illustrate our routing algorithm.
deter-
21
Pl
0 0
1 2
2
3
4
5
6
7
4
5
3
6
7
1
Figure 2.4: Permutation Pi and its bipartite graph representation
0
1
2
3
4
5
6
7
Figure 2.5: The initial switch settings of Example 1
E xam ple 1
Given a perm utation Pi as shown in Figure 2.4, we construct a bipartite multigraph
Gi as shown in Figure 2.4. If we choose Mo = {(0,0), (1,2), (2,1), (3,3)} and Mi —
{(0,1), (1, 2), (2,3), (3,0)}, then we get the initial switch settings for left/right most
stages as shown in Figure 2.5.
In order to realize Pi, using M 0 and M\, we get the required omega connection
p u
as:
22
0
1
2
3
4
5
6
7
pu
P l
Figure 2.6: The final switch settings of Example 1
000 000
101 001
010 010
111 Oil
110 100
001 101
Oil 110
100 111
Because there is no conflict in
p u
P l
, we can pass it through the middle omega
network. The final switch settings are shown in Figure 2.6.
E xam ple 2
Given a perm utation P2 as shown in Figure 2.7, we construct a bipartite multigraph
G2 as shown in Figure 2.7. If we choose M 0 = {(0,0), (1,1), (2,2), (3,3)} and Mi
{(0,0), (1,1), (2, 2), (3, 3)}, then we get the initial switch settings for lel't/right most
stages as shown in Figure 2.8.
23
u u
1 1
= 0
Po —
2 2
3 3 © =
= 4 . )
r 2 —
4 4
5 5
6 6
7 7
= ( 3 )
Figure 2.7: Permutation P2 and its bipartite graph representation
0
1
2
3
4
5
6
7
Figure 2.8: The initial switch settings of Example 2
as:
In order to realize P2, using Mo and M i, we get the required omega connection
p u
p l J
r 000 000
010 001
101 010
111 Oil
001 100
Oil 101
100 110
110 111
Because there are conflicts both within the upper four rows and within the lower
four rows, we flip the settings of switches 1, 3 in the left most stage and we exchange
pu
p u
P l
inputs 0-1 and 4-5 in
P l
. This leads to,
Qu
Ql
Since
Qu
Ql
001 000
010 001
100 010
111 Oil
000 100
Oil 101
101 110
110 111
has no conflicts, we can pass it through the middle omega network.
The final switch settings are shown in Figure 2.9.
E xam ple 3
Given a perm utation P3 as shown in Figure 2.10, we construct a bipartite multigraph
G3 as shown in Figure 2.10. If we choose Mo = {(0,0), (1,1), (2,2), (3,3)} and
25.
0
1
2
3
4
5
6
7
Figure 2.9: The final switch settings of Example 2
f t =
0 0
1 1
2
3
4
5
6
7
2
4
5
6
7
3
© « = < o )
Figure 2.10: Permutation f t and it’s bipartite graph representation
Mi = {(0,0), (1,2), (2,3), (3,1)}, then we get the initial switch settings for left/right
most stages as shown in Figure 2.11.
In order to realize f t , using M 0 and M i, we get the required omega connection
p u
p l
as:
26
0
1
2
3
4
5
6
7
Figure 2.11: The initial switch settings of Example 3
p u
P l
000 000
010 001
101 010
111 Oil
001 100
110 101
Oil 110
100 111
Because there are conflicts only within the upper four rows and I(Si) = {1,6}
and /(52) = {3,4}, we flip the settings of switches 1, 4 in the left most stage and
the settings of switches 1, 2 in the right most stage . We also exchange inputs 0-1
P U
and 6-7 and outputs 0-4 and 1-5 in . This leads to,
1 ! l
0
1
2
3
4
5
6
7
Figure 2.12: The final switch settings of Example 3
Qu
Ql
000 000
111 001
101 010
110 Oil
001 100
010 101
Oil 110
100 111
Since
Qu
Ql
has no conflicts, we can pass it through the middle omega network.
The final switch settings are shown in Figure 2.12.
2.5 C onclusion
In this chapter, we presented a simple proof of the rearrangeability of five stage shuf
fle/exchange networks for N = 8. Our proof technique is an extension to the basic
idea used in proving the rearrangeability of three-stage Clos networks. The proof
is constructive and leads to a routing algorithm to realize arbitrary perm utation in
the network.
28]
C hapter 3
Latin Squares for Parallel Array A ccess
We propose a new parallel memory system for efficient parallel array access. New
latin squares called perfect latin squares are introduced to be used as skewing func
tions. Simple construction methods are shown for building perfect latin squares. The
resulting skewing scheme provides conflict free access to several im portant subsets
of an array. The address generation can be performed in constant tim e with simple
circuitry. The skewing scheme is the first skewing scheme that can provide constant
tim e access to rows, columns, diagonals, and N 1 ^ 2 x N 1^ 2 subarrays of an N x N array
with maximum memory utilization. Self-routing Benes networks can be used to real
ize the permutations needed between processing elements and memory modules. We
also propose two skewing schemes to provide conflict free access to three-dimensional
arrays. Combined with self-routing Benes networks, these schemes provide efficient
access to frequently used subsets of three-dimensional arrays.
3.1 Introduction
The parallel array access problem is how to store an N x N array into M memory
modules such that no memory conflict occurs when various subsets of the array (rows,
columns, diagonals and N ^ 2 x N 1 ^ 2 subarrays etc.) are accessed. A memory conflict
is said to occur when more than one memory request is given to the same memory
module. The importance of the problem is well understood in terms of the effective
processor memory bandwidth. High performance pipelined computers and parallel
29
PE, PE,
Interconnection Network
Figure 3.1: A shared memory system
computers use multiple memory modules to overcome the effect of memory cycle time
on the performance of the system. However, the effective bandwidth of the memory
system depends not only on the speed and the number of memory modules but also
on the occurrence of memory conflict. Figure 3.1 shows a typical shared memory
system where processing elements PEo, P E\,..., P E ^ -i are connected to memory
modules M0, M i,..., A/m-i via an interconnection network so that any processing
element can access any memory module. At each pass through the interconnection
network a memory module can be accessed by at most one processing element. In
an extreme case, if every processing element wants to access data which reside in
memory module M0, then the effective bandwidth is same as using only one memory
module irrespective of the number of memory modules in the memory system.
A skewing scheme is a mapping of the array elements into memory modules
to provide conflict free memory access to various subsets of the array. Figure 3.2
shows two examples of skewing schemes in which array A = (u;,j) of order 4 is to
be mapped into four memory modules numbered 0,1,2,3. In each example, array
element is stored in the memory module written below it. In Figure 3.2(a), it is
easy to see that, if any column is to be accessed, four memory cycles are needed. On
the other hand, Figure 3.2(b) shows another skewing scheme which provides conflict
30j
free access to rows and columns. A skewing scheme S is called linear if it assigns
the array element atj to memory module = pi + qj (mod M ) for some fixed
integers p, q and M. If a skewing scheme S is not linear, then it is called a nonlinear
skewing scheme. The following issues are im portant in evaluating a skewing scheme.
• Set of subsets accessible without memory conflict: Rows, columns, diagonals
and subarrays are most im portant because of their frequent usage in scientific
computations.
• Address generation: The computation of the memory module address and the
local address within the module should be done fast using small amount of
circuitry. Thus, given i and j, there should be an efficient way of computing
S*,j •
• Memory utilization: The ratio of active memory modules (memory modules
in which desired data resides) to the total number of memory modules in a
memory access should be high. An N x N array can be stored in N 2 memory
modules leading to a trivial solution to the parallel array access problem.
However, the utilization of the memory modules will be very low for most
subsets.
• Interconnection network: Even if all the above criteria are satisfied, a memory
system can not be efficient if there is no interconnection network that efficiently
realizes the permutations needed between the processing elements and the
memory modules. Such an interconnection network should provide efficient
routing for the permutations arising in the mapping.
The parallel array access problem has been given much attention since the early
stages of research in parallel processing [56]. Extensive research has been done to
solve the problem [7, 8, 16, 47, 59, 62, 93, 109]. Budnik and Kuck introduced the
linear skewing scheme [16], and many researchers have investigated linear skewing
schemes [59, 107, 109]. In general, linear skewing schemes provide conflict free
access to many im portant subsets including rows, columns, diagonals and subarrays
31
<*0,0 <*0,1 <*0,2 <*0,3 <*0,0 <*0,1 <*0,2 <*0,3
0 1 2 3 0 1 2 3
<*1,0 <*1,1 <*1,2 <*1,3 <*1,0 <*i,i <*1,2 <*1,3
0 1 2 3 1 2 3 0
<*2,0 <*2,1 <*2,2 <*2,3 <*2,0 <*2,1 <*2,2 <*2,3
0 1 2 3 2 3 0 1
<*3,0 <*3,1 <*3,2 <*3,3 <*3,0 <*3,1 <*3,2 <*3,3
0 1 2 3 3 0 1 2
(a) (b)
Figure 3.2: Two skewing schemes
using M (M > N) memory modules to store an N x N array. However, linear
skewing schemes have some serious drawbacks. Most linear skewing schemes need
modulo operations of a number which is not a power of two for address generation
[16, 107]. The modulo operations can not be done in constant time with “reasonable”
amount of circuitry. Also, most linear skewing schemes need full crossbar networks
for processor memory interconnection which are expensive for large N [60]. One
exception is Lawrie’s scheme which uses omega network and provides easy address
generation [59]. However, the scheme uses 2N memory modules to store an ,V x N
array leading to low memory utilization.
Most nonlinear skewing schemes are based on bitwise XOR operations which
was first used in Batcher’s scheme [8]. Even though the problem considered in
[8] is somewhat different from the parallel array access problem (N memory chips
with N bit locations in each chip were used to store N words, each of N bits),
Batcher’s scheme can provide conflict free access to rows and columns and some other
subsets (stencils) of an iV x N array using N memory modules. The memory module
address generation can be done in constant time. Lee proposed a skewing scheme
(scrambled storage scheme) based on XOR scheme which provides conflict free access
to rows, columns and subarrays using N memory modules [62]. This scheme offers
constant time address generation and adapted a new interconnection network to
realize the required permutations efficiently. However, the scrambled storage doesn’t
32
provide conflict free access to diagonals. Balakrishnan et al. proposed a novel
approach leading to new nonlinear skewing schemes based on magic squares [7].
The resulting skewing schemes can provide conflict free access to rows, columns
and diagonals using N memory modules. However, the skewing scheme doesn’t
provide fast address generation and conflict free access to subarrays. Frailong et
al. proposed a generalized XOR scheme and derived conflict free conditions for
crumbled rectangles and chessboards [29]. Boppana and Raghavendra refined the
generalized XOR scheme using linear permutations [14, 89] and derived conflict
free conditions for various subsets of an array. They also proposed to use self
routing networks to realize the required permutations between processing elements
and memory modules. In the approaches of [14, 29, 89], conflict free conditions for
each subset are developed independently and are not guaranteed to be compatible:
with each other. The resulting mapping of array elements to memory modules may
change from subsets to subsets leading to a dynamic skewing scheme compared to
a static skewing scheme where the mapping remains unchanged. Dynamic skewing
schemes can suffer from re-skewing overhead arising from changing the mapping of
data elements onto memory modules.
To achieve conflict free access to various subsets of an array, we propose to use
latin squares which are well known combinatorial objects for centuries [25]. We in
troduce new latin squares called perfect latin squares which have properties useful
for parallel array access. We show detailed construction methods for several impor
tant classes of perfect latin squares. Using perfect latin squares, many interesting,
subsets of an array (rows, columns, diagonals, subsquares and same positions) can
be accessed without memory conflicts. The address generation of memory modules
and local addresses can be performed in constant time with a simple circuit when
N is an even power of two. The efficient skewing scheme with constant time address
generation leads to the first memory system that achieves constant time access to
rows, columns, diagonals, and subarrays with maximum memory utilization, assum
ing an interconnection network with constant delay. Furthermore, the permutations
needed between the processing elements and memory modules can be realized by
33
self-routing Benes networks. We also propose two new skewing schemes for three-
dimensional arrays which can provide conflict free access to various subsets of an
array such as rows, columns, files, subcubes and planes. The address generation for
these schemes can be done in constant time and maximum memory utilization is
achieved for the supported subsets. Furthermore, the permutations needed between
processing elements and memory modules can be realized by self-routing Benes net
works.
The next section introduces latin squares and perfect latin squares and shows
their usefulness in parallel array access. Section 3.3 shows detailed construction
methods for several im portant classes of perfect latin squares. Section 3.4 shows
how the address generation for perfect latin square can be performed in constant
time with simple circuitry. Section 3.5 shows the self-routing capability of the perfect
latin squares. Section 3.6 shows two skewing schemes for three-dimensional arrays.
Section 3.7 concludes the chapter.
3.2 Latin Squares for Parallel Array A ccess
A latin square of order n is an n x n square composed with symbols from 0 to n — 1
such that no symbol appears more than once in any row or in any column [25]. The
rows are numbered from 0 to n — 1, top to bottom. The columns are also numbered
from 0 to n — 1, left to right. The squares A and B shown below are examples of
latin squares of order 4.
’ 0 1 2 3 ' ’ 0 1 2 3 '
1 2 3 0 2 3 0 1
A = B =
2 3 0 1 3 2 1 0
3 0 1 2 1 0 3 2
A diagonal latin square of order n is a latin square of order n such that no symbol
appears more than once in any of it’s two main diagonals. The square B shown
above is an example of a diagonal latin square of order 4. A simple construction
34|
method is known for diagonal latin squares of order n when n is a power of two [25].
Gergely showed a general method to construct diagonal latin squares of any order n
(n > 4) [31],
Since rows and columns are the most im portant subsets of an array, the usefulness
of latin squares in parallel array access is obvious. Suppose we use latin square L of
order N as a skewing scheme, i.e., array element atJ oi N x N m atrix A = («;,}) is
stored in the memory module /JtJ, then
• Rows and columns can be accessed without memory conflict.
• Only N memory modules are needed resulting in 100% memory utilization.
Similarly, a diagonal latin square can provide conflict free access to row, columns
and diagonals. In the following, we introduce perfect latin squares which are more
useful than plain latin squares and diagonal latin squares for parallel array access.
We define a subsquare Sij of a latin square of order n 2 as an n x n square whose
top left cell has the coordinate (i,j). When * = 0 (mod n ) and j = 0 (mod n),
f 1 2
subsquare Sij is called a main subsquare. Subsquare 5o,i of the square B is
3 0
We define a perfect latin square and main subsquare of the square B is
3 2
1 0
er n 2 such that no symbol appears more1 of order n 2 as a diagonal latin square of ore
than once in any main subsquare. Hence, in a perfect latin square, no symbol appears
more than once in any row, in any column, in any main diagonal or in any main
subsquare. The square E shown below is a perfect latin square of order 9.
------------- ,
o
3 6 1 4 7 2 5
1
00
2 5 8 0 3 6 1 4 7
1 4 7 2 5 8 0 3 6
3 6 0 4 7 1 5 8 2
5 8 2 3 6 0 4 7 1
4 7 1 5 8 2 3 6 0
6 0 3 7 1 4 8 2 5
0 0
2 5 6 0 3 7 1 4
_ 7 1 4 8 2 5 6 0 3
If perfect latin square P — (pij) is used as a skewing scheme, then rows, columns,
diagonals and main subsquares can be accessed without memory conflict. The degree
of memory conflict, the maximum number of memory requests given to a memory
module, for arbitrary subsquare will be at most four since any subsquare can be
partitioned into four parts, each of them being a part of a main subsquare.
The next section shows detailed construction methods for several im portant
classes of perfect latin squares. The resulting perfect latin squares have additional
properties useful for parallel array access.
3.3 C onstruction o f Perfect Latin Squares
In this section, we show simple construction methods for latin squares of order n 2
where n is odd, n is a power of two or n = 2lm 2, I 6 { 2 ,3 ,4 . . m is odd. We also
show that the perfect latin squares built from these methods have several properties
useful for parallel array access.
Two latin squares C and D of order n are orthogonal to each other if the set
of ordered pairs CD = dij), 0 < i ,j < n — 1} is equal to the set of all
0 1 2 0 1 2
1 2 0
D =
2 0 1
2 0 1 1 2 0
possible ordered pairs (i, j), 0 < i ,j < n — 1. The squares C and D shown below
are orthogonal to each other.
C =
A self-orthogonal latin square is a latin square orthogonal to its transpose. We define
a doubly self-orthogonal latin square as a latin square orthogonal to its transpose
and to its antitranspose. Notice that a doubly self-orthogonal latin square is also
a diagonal latin square. The square B in the previous section, also shown below
for convenience, is also an example of a doubly self-orthogonal latin square. A
transversal of a latin square of order n is a set of n cells, no two in a same row, no
two in a same column and no two have the same symbol. The set {60,0, 61,1, 62,2? ^3,3}
is a transversal of the square B.
B
0 1 2 3
2 3 0 1
3 2 1 0
1 0 3 2
A sufficient condition for the existence of a perfect latin square is given in the
following theorem.
T heorem 2 If there exist two latin squares A and B of order n such that A is
orthogonal to the transpose of B and to the antitranspose of B , then there exists a
perfect latin square P of order n 2. Furthermore, P is also a doubly self-orthogonal
latin square.
Proof: From latin squares A and B , we construct a latin square C of order n2 with
the following rule:
Cjj — n X ^[j'/n],[j/n] T ^imodfijmodrn 0 S ^ 1
37j
where [i/n] represents the largest integer not exceeding i/n and i mod n represents
the non-negative remainder of i/n. The construction of C can be viewed as follows:
1. Construct a square of order n2 using n 2 J3’s. Let be the square B whose
position is (i , j ) among the B ’s.
2. Add n x ahj to the all members of
From C, we construct another latin square P of order n 2 using the following row
exchange rule:
Pi ,j CnX*m0dra+ [i/n],j
An example is shown below for n = 3. Notice that A is orthogonal to the transpose
of B and to the antitranspose of B.
0 1 2 0 1 2
1 2 0
B = 2 0 1
2 0 1 1 2 0
1
o
1 2 3 4 5 6 7
r.... * ■ "
0 0
2 0 1 5 3 4 8 6 7
1 2 0 4 5 3 7 8 6
3 4 5 6 7 8 0 1 2
5 3 4 8 6 7 2 0 1
4 5 3 7 8 6 1 2 0
6 7 8 0 1 2 3 4 5
O O
6 7 2 0 1 5 3 4
1 ----
8 6 1 2 0 4 5 3
38]
-----1
o
1 2 3 4 5 6 7
1
0 0
3 4 5 6 7 8 0 1 2
6 7 8 0 1 2 3 4 5
2 0 1 5 3 4 8 6 7
5 3 4 8 6 7 2 0 1
8 6 7 2 0 1 5 3 4
1 2 0 4 5 3 7 8 6
4 5 3 7 8 6 1 2 0
_ 7 8 6 1 2 0 4 5 3
From the construction, it is clear that, in square P, no symbol appears more than
once in any row, in any column or in any main subsquare. Suppose P is not or
thogonal to its transpose. There should be i,j, k and /, i ^ k or j ^ /, such that
Pi,j — Pk,i and pj} i = pitk- From the construction, ptt1, 0 < i, j < n 2 — 1, can be
expressed as
pi,j — U X U jm o d n ,[j/n ] T ^ [i/n ],jm o d n
Since 0 < a ,j, bitj < n — 1,0 < i ,j < n — 1, from pij = pkj, we have
^ tm o d n ^ j/n j — ^fcm odn,[l/n ] a n d ^[j/n ],jm o d n
(3.1)
From phi = pitk, we have
® jm odn,[*/n] = ®imodn,[fe/n]
and odn (3.2)
(3.1) and (3.2) contradict the assumption that A and the transpose of B are
orthogonal. Hence P is orthogonal to its transpose. The same argument can be
applied to P and its antitranspose. Hence, P is a doubly self-orthogonal latin square
and no symbol appears more than once in any of its two main diagonals. ■
Since a doubly self-orthogonal latin square is orthogonal to its transpose and
to its antitranspose, a doubly self-orthogonal latin square can be used both as the
39]
square A and as the square B in the previous theorem. Hence, we get the following
corollary.
C o ro llary 1 If there exists a doubly self-orthogonal latin square of order n, then
there exists a perfect latin square of order n2, which is also a doubly self-orthogonal
latin square.
Since the resulting perfect latin squares from theorem 2 and corollary 1 are doubly
self-orthogonal latin squares, by applying the construction method recursively, we
have an infinite set of perfect latin squares. Thus, we have the following theorem.
T h e o re m 3 If there exist two latin squares A and B of order n such that A is
orthogonal to the transpose of B and to the antitranspose of B, or there exists a
doubly self-orthogonal latin square of order n, then there exists an infinite set S =
{perfect latin square of order n2 > c ,k 6 {1 ,2 ,3 ,...} } .
The following theorem shows a construction method for perfect latin squares of
order n 2, when n is odd.
T h e o re m 4 For all odd n, there exists a perfect latin square of order n 2, which is
also a doubly self-orthogonal latin square.
Proof: A cyclic latin square Cn of odd order n is composed of n distinct transversals,
one of them being a main diagonal and others being broken diagonals. We fill a
transversal with a same symbol, but using distinct symbols for distinct transversals.
The resulting square is a latin square Dn orthogonal to Cn. The transpose of D„ is
a symbol permutation of Dn which is also orthogonal to Cn. The antitranspose of
D n is identical to Dn, therefore, orthogonal to Cn. An example is shown, below for
n = 5. Hence, by theorem 2, there exists a perfect latin square of order n 2 for all
odd n. m
40
o
1 2 3
i
i
O
1 2 3
1 -----
1 2 3 4 0 4 0 1 2 3
2 3 4 0 1 D 5 = 3 4 0 1 2
3 4 0 1 2 2 3 4 0 1
1 -----
0 1 2 3 1 2
C O
4
----- 1
o
For the construction of perfect latin squares of order n2, where n is a power of
two, we start with the following lemma.
L e m m a 2 If there exist two doubly self-orthogonal latin squares A of order m, and
B of order n, then there exists a doubly self-orthogonal latin square C of order mn.
Proof: We construct C with the following rule:
C {j — n X G ‘[i/n],[j/n] A ^ im o d n jm o d n 5 0 A i , J < TllTl 1
Notice that the construction method is similar to the construction method for C in
theorem 2. From the construction, it is clear that C is a latin square. Suppose C
is not orthogonal to its transpose. There should exist i,j,k and I, i ^ k or j ^ /,
such that Cij = c*,/ and Cjj = c^k- Since 0 < ap < q < m — 1, 0 < p, q < m — 1, and
0 < brtS < n — 1, 0 < r, s < n — 1, from q j = Ck,i, we have
^ [i/n ],[j/n ] ~ ®[fc/ra],[//n] a n d ^ im o d n jm o d n ^ m o d n ./m o d n ( 3 -3 )
From Cjj = we have
[k/n] a n d bjm odn,im odn — ^im odn,fcm odn
(3.4)
(3.3) and (3.4) contradict the assumption that A and B are doubly self-orthogonal
latin squares. Hence, C is orthogonal to its transpose. The same argument can be
applied to C and its antitranspose. Therefore, C is a doubly self-orthogonal latin
square. ■
41
Now, we are ready for the following lemma.
L em m a 3 For all k € {2, 3 , 4 , . . there exists a doubly self-orthogonal latin square
D k of order 2k.
Proof: We use induction on k.
Bases: For k = 2 and k = 3, the following squares are doubly self-orthogonal latin
squares of order 22 and 23. D 3 with a symbol perm utation can be found in [7].
0 1 2 3 4 5 6 7 1
2 3 0 1 6 7 4 5
5 4 7 6 1 0 3 2
7 6 5 4 3 2 1 0
1 0 3 2 5 4 7 6
3 2 1 0 7 6 5 4
4 5 6 7 0 1 2 3
6 7 4 5 2 3 0 1
Hypothesis: There exists a doubly self-orthogonal latin square D k of order 2k.
Induction Step: We construct a doubly self-orthogonal latin square D k + 2 of order
2fc+2, using D 2 and D k whose existence is proven in the previous lemma. ■
Notice that D 2 in the above lemma is also a perfect latin square. From the above
lemma and corollary 1, we have the following theorem.
T h e o re m 5 For all k € { 1 ,2 ,3 ,...} , there exists a perfect latin square P 2k of order
2 2k, which is also a doubly self-orthogonal latin square.
Since a perfect latin square built from theorem 4 is also a doubly self-orthogonal
latin square, we can build a new doubly self-orthogonal square of order 2km 2 using
D k and the square from theorem 4. We can build a perfect latin square from the
doubly self-orthogonal latin square.
T h e o re m 6 For all n = 2km 2, k € {2,3,4 ...}, m is odd, there exists a perfect latin
square of order n2, which is also a doubly self-orthogonal latin square.
___________________________________________________ 42
D =
0 1 2 3
2 3 0 1
3 2 1 0
1 0 3 2
D 3 =
As mentioned earlier, the usefulness of perfect latin squares as skewing schemes
is obvious from the definition. Furthermore, there are more properties useful for|
parallel array access in the perfect latin squares built from the construction methods
in this section.
It is easy to check the following lemma applies to a perfect latin square A of
order n 2 built from theorem 2.
L e m m a 4 No symbol appears more than once in a subsquare S ij such that i = 0
(mod n) or j = 0 (mod n).
The above lemma shows a subsquare in the horizontal or vertical strip of width n
are accessible without memory conflict as well as main subsquares. It is easy to see
th at there does not exist a skewing scheme to store an T V x T V array in N memory
modules such that rows, columns, and all T V 1/2 x T V 1 /2 subarrays can be accessed
without memory conflict. However, we have the following lemma which assures
that the maximum degree of memory conflict is two when an arbitrary subsquare is
accessed.
L e m m a 5 No symbol appears more than two times in any subsquare Sij.
The above lemma is a direct result of lemma 4 because every subsquare can be
partitioned into two parts such that each part belongs to a subsquare on a strip of
width n.
Another set of interest is the set of elements whose coordinates are the same
within each main subsquare. Let a same position SP ij of a perfect latin square A
of order n 2 be defined as follows:
SP ,j = {akj | k = i (mod n), / = j (mod n )}, 0 < i, j < n — 1
Now, we have the following lemma which shows no memory conflict occurs when a
same position is accessed.
L e m m a 6 No symbol appears more than once in any SP ij.
43
3.4 A ddress G eneration
The previous two sections mainly showed skewing schemes that guarantee conflict
free access to various subsets of an array. Even if a skewing scheme provide a
conflict free access to memory modules, the memory system can not provide an
efficient memory access unless the address generation is simple and fast. That is,
there should be an efficient way of computing p4 -j, given i and j , when P = (ptJ ) is
used as the skewing scheme. In this section, we show that the address generation for
perfect latin square P 2k of order 22k can be done in constant time. The importance of
the case when the number of memory modules is an even power of two is clear from
the viewpoint of address generation, utilization of address space and interconnection
network. Notice that we are storing an n 2 x n2 array in n 2 memory modules.
From lemma 3, we can build a doubly self-orthogonal latin square D k = (dk j) of
order 2k from D 2 and Dk~2. We can express dh as follows,
d i,j = 2 fc_2 fi5 / 2'=-2],[j/2'=-2] + 4 m o d 2 * - 2,im od2*-2 ( 3 ‘5 )
Since 0 < dk ~ 2 < 2k~2 — 1, 0 < p, q < 2k~l — 1, the first two most significant bits
of are determined by the first two most significant bits of i ([i/2 k 2]) and the
first two most significant bits of j (\j/ 2k~2]). The rest bits of d-j can be determined
using the above relation recursively.
Let ... dp be the binary representation of dk j. Also, let ik-\ik- 2 • • • * 0
and jk-ijk - 2 ■ ■ ■ jo l> e the binary representations of i and j respectively. From the
doubly self-orthogonal latin squares D 2 and D 3 in the previous section, we get the
following relations.
dl = * 1 © jo
d\ = * 'o © *i 0 ji
dp = * 2 ® ® i © jo
d\ = io 0 ji
d\ = h 0 j i
44
From (3.5) and the above relations, we get:
1) when k is even, for all m, 0 < m < k — 1,
< = <
im+1 © jm when m is even
im-i © im © jm otherwise
2) when k is odd, we have
l 2 © *1 ©Jo
* 0 © jl
H © ji
and for all m, 3 < m < k — 1,
dr =
m
when m is odd
-l © im © jm otherwise
The address of the perfect latin square P 2k can be easily obtained from the
doubly self-orthogonal latin square D k. The perfect latin square P 2k is obtainec
from D k using the following relation:
_ 2 k c\k j k I j k
pi j Z X ^ imocJ2fc,[j/2 f c ] ' [ * ’ / 2*]tjm od2*
(3-6)
From the above relation, we can see that the k most significant bits of p2k are
determined from the k least significant bits of i and the k most significant bits of j . j
The k least significant bits of pfk are determined from the k most significant bits of
i and the k least significant bits of j. In Figure 3.3, an example of memory module
address generation is shown for the perfect latin square P 8. The only logic element
used in the circuit has four inputs and two outs, i.e., f = a © b © c and g = a © d\
Note that memory module address generation can be performed in constant time.
For the local address generation within each memory module, we adopt a simple
scheme that element a8 j of array A = (a2 > J ) is stored in the local address i of the
43
i3 j7 i2 j6
il j5 iO j4 i7 j3 16 j2 15 jl 14 jO
A1 AO A7 A6 A5 A4
Figure 3.3: The address generation circuit for P 8.
memory module pfk -. This way, there is no conflict in local addresses between array
elements stored in the same memory module. Notice that we do not need any
hardware for the local address generation.
Suppose we are using a crossbar network as the interconnection network. Then
the interconnection network delay is constant. Since the address generation can be
performed in constant time, a subset of an array can be accessed in constant time
if there is no memory conflict or the degree of memory conflict is constant. We
summarize the fact in the following theorem.
T h e o re m 7 A parallel memory system with a perfect latin square as the skewing
scheme can provide constant time access to rows, columns, diagonals, subarrays and
same positions.
The hardware cost of a crossbar network is too expensive when N is large. In
the next section, we show that perfect latin squares can be used with existing self
routing Benes networks which are less expensive and provide fast routing for classes,
of permutations. The resulting parallel memory system provides fast access to data
with relatively low hardware cost.
3.5 Self-R outing for Perfect Latin Squares
Even if a skewing scheme provide conflict free access to many subsets of an array
with constant time address generation, a parallel memory system can not perform
_____________________________________________________________________________________
well if the required permutations between the processing elements and the memory
modules can not be realized by the interconnection network efficiently. In this sec
tion, we show the permutations required by perfect latin square P 2k can be realized
by existing self-routing Benes networks leading to a fast and cost effective parallel
memory system.
Let I = {in-\ ... Zo)* and O = (on_j ... oo)* be the binary representations of an
input port and an output port of a network of size 2n respectively. Notice that I and
O are column vectors. When n is an even number, we use /„ to represent n /2 most
significant bits of /, i.e., Iu = {in~ i .. -in/vY- In a similar way, Ii = (zn/2— i • •. «o)*-
Iu ^ / - \
The concatenation of Iu and /; is represented by | , i.e
h
h
/
= 1 and
/
= (2 7 1 /2— 1 • • • *'o*'n-i • • • infeY- A permutation matrix P is a binary matrix-
\ /« j
such th at each row and each column has exactly one 1. From now on, all binary
additions are modulo 2
D efin itio n 3 A permutation is a bit-permute-complement (BP C ) permutation if
there exist a permutation matrix P and a binary vector C which satisfy the folloiving
relation for all pairs of input I and output 0 , Q < I , 0 < 2 n — l.[73]
0 = P x I + C
D efin itio n 4 A permutation is a linear-complement (LC) permutation if there existj
a non-singular binary matrix L and a binary vector C which satisfy the folloiving
relation for all pairs of input I and output 0 , 0 < I, O < 2” — l.[13j
0 = L x I + C
It should be noted that LC contains B P C since every perm utation m atrix is a
non-singular binary matrix.
Self-routing Benes networks which can realize B P C permutations [73] and LC
permutations [13] are known. These networks provide fast routing for classes of
47
permutations by eliminating time consuming network set-up process. Permutations
not suitable for self-routing can always be realized by conventional routing methods.
A parallel memory system can provide fast access to data with relative low hardware
cost if the permutations needed for frequently used data can be realized by self
routing Benes networks.
Let P lkn = ( p i ... plkY and Dk m = (dk k_x . .. d£)* be the binary representations
of p ^ n and d ^ n respectively. Also, let M = (m 2 f c - 1 • • • m oy and N = (n2 k - 1 ■ ■ ■ no)*
be the binary representations of m and n respectively. We assume the length of M
and N is k when M and N are used with D ^ n. From D 2, D 3 and (3.5), we can
express D k m n as shown below. A blank entry of a m atrix represents 0.
When k is even,
( (Ik \
ak-\
4-2
( i i
1
I I
1 1
V
1
(
m k-1
m k - 2
+
nk-1
^k— 2
\
(3.7)
V n °
when k is odd,
' 4 - 1 )
( i i
4-2
i
,
ii
i
i
I 4 )
\
i i
\ / \ / \
ra*_ i n f c _ i
m*_ 2 2
+
(3.8)
^ mo / \ n° /
48j
Let A* represent the matrices in (3.7) and (3.8). When k is even, A k represents
the m atrix in (3.7) and, when k is odd, A k represents the m atrix in (3.8). Then,
Dk mtn = A k x M + N
Notice that A k is a non-singular matrix. The determinant of A k is
From (3.6), we can express P™n as follows,
P lk = A k x Mi + Nu
A k = ± 1 ^ 0.
P fk = A « x M u + Ni
or,
~}2k
p 2 k
r u
P?k A k
x
M u
M x
+
Ni
(3.9)
Recall that P 2k and P fk represent k most-significant bits and k least-significant bits
rtk
of Pm,* respectively.
Suppose we are using perfect Latin square P 2k as the skewing scheme to store
22 f c x 22 f c array A = (am < n) into 22 f c memory modules. The array element aT O i„ is
stored in the memory module p ^ n. Now we investigate what kind of permutations
between processing elements and memory modules (alternatively, between the input
and output of the interconnection network) are needed to access various subsets o :
the array A (rows, columns, etc.). We show that these permutations can be realizec
by self-routing Benes networks by showing that they belong to LC.
The permutations required between processing elements and memory modules
are determined not only by the skewing scheme, but also by the ordering used!
Throughout this chapter, we assume row-major order. For any pair o f elements a ; J
and a,kj, a,itj comes first (i.e., accessed by the processing element with smaller iridex^
if k > i , or k = i and I > j.
Rows: The m°-th row of the array A is defined as R mo = {amo „ | 0 < n < 22 f c — 1}
Since we are assuming row-major order, processing element PEi should get amot.
which is stored in memory module t- The required perm utation between the
input and output of the interconnection network for R m o can be expressed as follows.
/
o = p3
2k
m°
A k
= U x I + C°
\
x M ° + r
(3.10)
where,
U and C° = x M
The identity m atrix U is non-singular and C° is a constant binary vector, hence the
perm utation represented by (3.10) is in LC. Hence, any permutation needed for row
access can be realized by a self-routing Benes network.
C o lu m n s: The n°-th column of the array A is defined as Cno = { a m ,n° I 0 <
m < 22h — 1}. The required permutation between the input and output of the
interconnection network for Cno can be expressed as follows.
O = i$ o
A k
x I + N° (3.11)
The binary m atrix shown above is non-singular as the determinant is non-zero as
shown below. Hence, the permutations needed for column access are in LC and can
be realized by a self-routing Benes Network.
A k
A k
50
D iagonals: The diagonal of the array A is the set D = {a
The perm utation needed for the diagonal can be expressed as follows,
m ,m | 0 < m < 2 2k - 1}
O = Pj
2k
(
A f c
^ A f c
( U A f c
^ A f c U
X / + /
X I (3.12)
/
The binary m atrix shown above is non-singular since the determinant is non-zero
as shown below. Therefore, the perm utation is in LC and can be realized by a
self-routing Benes network.
\U\ ■ \U - A*C/_1A*|
U - A k 2
± 1
u A k
A k U
The anti-diagonal of the array A is the set D' = {ami2 2k-i-m | 0 < m < 22k — 1}.
The perm utation needed for the anti-diagonal can be expressed as follows,
0 = Pi
2k
_
A k
1-.
^ A f c
( U
A k
^ A k U
X / + ( / +
\ 1 /
X/ + (3.13)
\ 1 /
Since the perm utation shown above is in L C , the perm utation can be realized by a
self-routing Benes Network.
M ain subsquares: A main subsquare whose top left cell is (m°, rc°), m° = 0 mod 2k
and n° = 0 mod 2 k is defined as Smo t„o = {amo+p n o+q \ 0 < p, q < 2 k — 1}. Notice
51
that k most-significant bits of M and N are fixed. Since row-major order is assumed,
processing element P E 2kp+ q should access array element amo+ Ptno+ q which is stored
in Pm O +p,n°+q • The permutation between the input and output of the interconnection
network becomes as follows,
O
(
A k
\ A f c
f A k
\
U
I
( I ^
±U
+
+
' iv ;
. }>
u
A*
K
M?.
(3.14)
The second term in the right side of (3.14) becomes a constant binary vector. The
m atrix in the first term is non-singular since the determ inant of the m atrix is non
zero as shown below. Hence the permutation is in LC and they can be realized by
a self-routing Benes network.
A*
U
A
Sam e positions: Same position SPmono is the set of cells whose relative position
within each main subsquare is (m°,n°).
SPm°,n° = {dm,n | 1TI EE TTl mod 2 , T l = T l mod 2 }, 0 < 1 7 1 ,12 < 2 ' — 1
Notice that k least-significant bits of M and N are fixed. The perm utation needed
for a same position can be expressed as follows,
/
O =
A k \
i, \ /, )
-------- x
+
A k
J 1 " ? /
U )
( In '
+
( A k
\
' Mi S
A*
U J
\ v )
(3.15)
52
The second term in the right side of (3.15) becomes a constant vector. The matrix
in the first term is non-singular. Hence the perm utation is in LC and they can be
realized by a self-routing Benes network.
We summarize the analysis in this section with the following theorem
T h e o re m 8 A self-routing Benes network can realize the permutations between pro
cessing elements and memory modules when a perfect Latin square is used as the
skewing scheme and rows, columns, diagonals, main subsquares or same positions
are accessed.
3.6 Efficient A ccess to T hree-D im ensional
Arrays
Efficient access to various subsets of a 3-dimensional array is needed in many areas
including medical imaging, fluid flow computations, and earth science [24, 68, 72, 90].
The access patterns needed in 3-dimension arrays are slightly different from those
needed in case of 2-dimensional arrays. 1-dimensional vectors are still im portant.
Diagonals are not so frequently needed as in the 2-dimensional case. Subcubes be
come im portant as partitioned algorithms can be developed based on subcube data.
Another im portant subset is 2-dimensional planes of the 3-dimensional array since
they contain cross-sectional data as needed in computer tomography. In this section,
we will show two skewing schemes for three dimensional arrays. The first scheme
provides conflict free access to 1-dimensional vectors and subcubes. The second
scheme provides conflict free access to 2-dimensional planes. Both schemes provide
maximum memory utilization. Permutations needed in both skewing schemes can
be realized by self-routing Benes networks.
53
Skew ing schem e $
Given an N x N X N array A = where N 1 ? 3 = n = 2k, we define rows,
columns, files and subcubes as follows:
row(y°,z°) = {aX tyo zo | 0 < x < N - 1}
column(x°, z°) — {axotV tZ o | 0 < y < N — 1}
file(x°,y°) = {axoyoz 1}
subcube(x°, y°, z°) — {aX tV tZ | x° < x < x° + n — 1,
y° < y < y ° n — 1, z° < z < z° + n — 1}
When x° = 0 mod n, y° = 0 mod n and z° = 0 mod n, the subcube{x°,y°, z°) is
called a main subcube.
Let X = (x 3 k-i ■ ■ ■ xqY be the binary representation of x. Also, let X u, X m, Xi
represent ( a ^ - i ... X2kYi (®2fc-i • • • x kY and (^fc-i • • • ^o)* respectively. Concatena
tions of binary vectors are represented in the same way used in the previous section.
Our skewing scheme is to store array element ax > V iZ in memory module ( j > x,y,z,
where the 3-dimensional cube $ = ( < f> x,y,z) is defined as follows, recall that binary
additions are modulo 2.
f Xu '
f 1 1 1
( 7 ^
4*x,y,z — x m + Y u + Zi
(3.16)
{ X , J \ Y m )
K Zu }
Notice that 0 < < f > x,yjZ < X — 1, i.e., we are using N memory modules to store an
N x N x N array.
The following two theorems show that the skewing scheme provide conflict free
access to rows, columns, files and main subcubes.
T h e o re m 9 No memory conflict occurs when any row, any column or any file of
the array A is accessed.
5 4 1
Proof: Suppose there is a memory conflict when the row(y°, z°) is accessed. There
should be two different elements axi< y otZ o and ax2,y°,z° that get mapped to the same
memory module. From 4>xi,y0,z° — < f > X2,y°,zo , w e g e t
X 2,
X 2 r.
( X U N
f Y? '
^ yO
^m
X l m +
Y 2
+
z?
\ X l > ) [K)
z° .
A
f Y ? )
( 7 0 \
m
m
+
Yu °
+
z?
J
N J
7°
\ ZU J
X l u
X \ w
XI,
/
X 2U
X2„
X2,
(317)
/
(3.17) contradicts that x l ^ x2, hence no memory occurs when a row is accessed.
Conflict free access to columns and files can be proved in the same way. ■
T h e o re m 10 No memory conflict occurs when any main subcube of the array A is
accessed.
Proof: Suppose there is a memory conflict when the main subcube(x°, y°, z°) is
accessed. There should be two different elements ax\,yi,z\ and 0 ^ 2,3 ,2 ,2 2 of the main
subcube that get mapped to the same memory module. Since both of the elements
are in the same main subcube, we have
‘ xu x
X l m
X 2 t
X2„
( Y U ^
Y l m
Y2t
Y 2„
Z 2,
Z2„
(3.18)
From < f > x \ , y i , z i — 4*3:2,y2,z2i ^ have
f xu )
f F1/ 1
' Zlm >
f X 2 U ^ ' Y2X N f Z2m
X l m +
Xlu + Z h
-
X 2 m + Y 2U + z%
V J
^ Y l m ,
^ zu )
\ )
^ X2m i K Z2U
(3.19)
55
From (3.18) and (3.19), we get
Y h N f Y2, N
Z h = Z%
(3.20)
X I, j
\ X2t J
The relations (3.18) and (3.20) mean x l — x2, yl = y2 and z 1 = z2, which
contradicts that ax\,y\y Z i and aX 2,y2 ,z2 are two different elements. Hence, no memory
conflict occurs when a main subcube is accessed. ■
Note that, once any main subcube can be accessed in constant time, any non-
main subcube can also be can be accessed in constant time. Any non-main subcube
can be divided into 8 parts, such that each of them belongs to a main subcube.
The memory module address generation for $ can be performed in constant time
using 6 k exclusive-or gates. Furthermore, it is easy to show that the permutations
needed for the skewing scheme $ is in B P C assuming row-major order. Therefore,
the parallel memory system with $ and self-routing Benes networks can provide
efficient access from processing elements to memory modules for rows, columns, files
and subcubes.
Skew ing schem e $
Given an N x N x N array A, where N = 2fc , we define two-dimensional planes as
follows:
xy - plane(z°) = | 0 < x, y < 2 k - 1}
yz - plane(x°) = {axotV tZ | 0 < y, z < 2 k - 1}
zx - plane(y°) = {ax< yotZ | 0 < z, x < 2 k - 1}
Our goal is storing the array A into N 2 modules such that any N x N two-
dimensional plane can be accessed without memory conflicts. Our skewing scheme
56
is to store the array element aX tV tZ in the memory module ^ x,y,z where
is defined as follows,
/ \ /
Z
Z
/
\
+ ^Xyll,Z ---
Now we have the following theorem for two-dimensional plane access,
(3.21)
/
T h e o re m 11 No memory conflict occurs when any two-dimensional plane is ac
cessed.
Proof: Suppose there is a conflict when xy — plane(z°) is accessed. There should be
two different elements axiiyi,zo and ax2,j/2,z° that get mapped into the same memory
module. From Vbu.gM0 = f ’x2 ,y2 ,*«, we get
1 X I N
Y 1
to
+
, Y2
(3.22)
The above relation contradict x l / x2 or yl ^ y2. Hence, x y —plane can be accessed
without memory conflict. Conflict free access to other planes can be proved in a
similar way. ■
The memory module address generation for \P can be performed in constant time
using 2k exclusive-or gates. The permutations needed for the skewing scheme $ is
in LC assuming row-major order. Hence the parallel memory system with 'k and a
self-routing Benes network can provide efficient access from processing elements to
memory modules for two-dimensional planes.
3.7 C onclusion
We have proposed a new parallel memory system to solve the parallel array access
problem efficiently. We have introduced new combinatorial objects called perfect
latin squares to be used as the skewing scheme of the memory system. We have
also shown simple construction methods for perfect latin squares. The new skewing
scheme provides conflict free access to various subsets of an N x N array using N
57
memory modules. When the number of memory modules is an even power of two,
address generation can be performed in constant tim e using a simple circuit. Hence
the skewing scheme is the first skewing scheme that can provide constant time access
to rows, columns, diagonals, and N 1 /1 2 x N 1 ^ 2 subarrays of an N x iV array using
the minimum number of memory modules.
We have shown that the permutations between processing elements and mem
ory modules required by the skewing scheme can be realized by self-routing Benes
networks. The resulting parallel memory system with perfect latin squares and self
routing Benes networks provides efficient access from processing elements to memory
modules for various subsets of an array.
We have also proposed new skewing schemes to provide efficient access to three-
dimensional arrays. These skewing schemes provide conflict free access to various
subsets of three-dimensional arrays. The address generation can also be performed in
constant time. Combined with self-routing interconnection networks, these schemes
provide efficient access to frequently used subsets of three-dimensional arrays.
An interesting question not treated in this dissertation is the existence of perfect
latin squares for general cases. The existence of perfect latin squares of order n 2 for
all n > 0 is proven in [40].
58
C hapter 4
A n Efficient M apping o f D irected Graph B ased
C om putations onto H ypercube Arrays
In this chapter, we present a simple, efficient way of mapping solutions to problems
that can be modeled as directed graphs onto fine grain hypercube arrays. The
mapping uses m + e processing elements to map a solution whose underlying directed
graph has m nodes and e edges. The data transport problems that arise in the
mapping are solved (asymptotically) optimally. One iteration step can be performed
in 0(log(m -f e)) time. The mapping technique can be applied to problems including,
iterative solutions to sparse linear systems, neural network implementations and logic
simulations to result in efficient parallel implementations. The mapping method has
very small constant factors and is well suited for implementations on fine grain
hyper cube arrays.
4.1 Introduction
Directed graphs have been used to formulate solutions to many scientific and en
gineering problems. In these problems, the computations occur in the nodes using,
the values of neighbor nodes and the weight of the edges connecting the neighbors.
Examples of such problems include neural networks and logic simulations. In these
computations, the state of a node is determined by the state of neighboring nodes
including itself and by the underlying topology and the weight function associated
59
with the edges and by the computing function of the node. Other examples are
some iterative sparse linear system solvers, where the solution vector is determined
by the previous solution vector and the iteration matrix. The non-zero elements of
the iteration m atrix determine the topology and the weight functions of the directed
graph in this case.
An example of a directed graph is shown in Figure 4.1. There are m nodes and
e edges (e < m 2). The value of the kth node Vk at time i + 1 is determined by the
equation
)J (4-1)
where fk represents the computing function of node Vk and Wkj represents the weight
function of the edge from node Vj to node Vk- The computation of the above equation
is called an iteration step. The matrix W = {wfcj} is called the weight matrix. For
some problems, it is possible that a node does not use the weighted sum of its inputs
but uses some other function on the inputs (e.g. the maximum or the minimum of
the inputs). The mapping method in this chapter can be easily modified to cover
any associative function on the inputs.
From equation (4.1), it is clear that the solution can be characterized by the
topology of the underlying graph, by the computing functions and by the weight
functions. In many scientific and engineering problems, the topology is determined
by the underlying problem (e.g. the logic diagram in case of logic simulations)
and this topology is fixed and already known. In many problems, the underlying
graph does not have a regular structure and the number of edges is far less than
m 2. In some cases the underlying graph is planar. The irregularity and sparsity of
the graph offer a challenge to developing parallel techniques which can fully exploit
the structure of the directed graphs without introducing excessive communication
overheads.
There are several possible ways for implementing directed graph based compu
tations. At one extreme, special purpose hardware can be designed for a specific
problem or for a class of problems [21, 33, 58]. Many researchers have proposed
several ways of building special hardware for neural networks [58]. This approach
60
(
m
J^wkj(vj
j=i
w
w
w,
Figure 4.1: A directed graph model
has the m erit of speed since special purpose hardware can be designed to lead to
high performance. However, special purpose hardware can be expensive and majj
lack flexibility (with respect to minor changes in the network and weight functions)
which can be critical in many problems. At the other extreme, the serial computer
implementations are very flexible. However, many practical problems formulated bjj
directed graphs involve enormous number of nodes and edges. A neural network can
easily have thousands of neurons and synapses and a logic simulation can involve
thousands of logic elements and much more connections. Thus, parallel implementa
tions become a necessity. In this chapter, we consider fast parallel implementations
on general purpose parallel computers which are also flexible.
There have been many efforts to obtain flexible parallel implementations of so
lutions to problems that can be formulated as directed graphs [9, 37, 84, 103, 104]
M ajority of these works are based the experience of the researches on the target par
allel machines and the specific applications. Tomboulian proposed an algorithmic
method which can provide routing for directed graphs on SIMD arrays and its appli
cation to neural networks [100, 101]. Tomboulian’s method is based on the concept
of conflict free space time labelling which can be thought as a graph embedding on
a space-time grid. The problem is very similar to the traditional routing problem in
CAD [61]. One drawback of the method is that the edge traversal (i.e. an update
of the state of nodes) takes 0 ( T ) time where T is empirically proportional to the
average path length times the average degree of the graph. Another drawback of
the method is the low utilization of the space-time grid points. It is well known that
traditional routing methods in CAD suffer from low utilization of grid points [1].
In this chapter, we present an algorithmic method to map general directed graph
based computations into fine grain hypercube arrays. The mapping method uses
m -f e processing elements where m and e are the number of nodes and edges in the
directed graph respectively. The data transport problems are solved optimally bjj
preprocessing the directed graph. The preprocessing takes 0((m + e) log(m +e)) time
on a serial computer and 0(log4 (m + e)) time on a hypercube with m + e processing
elements. Each iteration step can be performed in 0(log (m -f e)) running time. The
62
method is very flexible and can be applied to a variety of problems. We also present*
several applications of the mapping technique which include iterative sparse linear
system solvers, neural network implementations and logic simulations.
In the next section, we present an overview of the mapping. In section 4.3,
data movement techniques for the mapping is explained in detail with an analysis of
time complexity. In section 4.4, several applications of the mapping are shown. In
section 4.5, the implementation of the mapping technique on other related parallel
machines is discussed. Section 4.6 concludes the chapter.
4.2 A n O verview o f th e M apping
A hypercube of d dimensions has N = 2d processing elements whose indices range
from 0 to 2d — 1. Two processing elements are connected by a communication channel
if and only if the binary representation of their indices differ only in one position.
To compute equation (4.1), we should first assign the information associated
with nodes and edges onto the processing elements. In our approach, every node
and every edge is assigned to a processing element. Assume that the hypercube is
viewed in two dimensions as shown in Figure 4.2. The initial values of the nodes
and the tables for computing functions are stored in the first m processing elements,
where m is the number of nodes in the directed graph. The weight functions are,
stored in the remaining e processing elements in column m ajor order, where e is the
number of edges in the directed graph. Thus the array has N — m + e processing
elements. The column leaders and the row leaders of the weight m atrix {rep } in the
above mapping are defined as follows: The processing element with the least i index
among processing elements with same j index is called the leader of the j th column |
The leader of a row is defined similarly. Figure 4.2 shows the initial data mapping
of the graph shown in Figure 4.1. Each column leader is marked with
After having the initial data mapping of the directed graph, one iteration step,
i.e., the computation of equation (4.1), consists of the following steps:
63
Vl V2 ^3
U4
h / f'l h /
/ t
Figure 4.2: An example of the initial data mapping
Step 1 : The value of the node u ,- is routed to the leader of the * < f e -column of the
weight m atrix W . This step is shown in Figure 4.3, for v\ and the leader of
the first column of W.
Step 2 : Each column leader broadcasts the received value to the members of the
column, where weight function values Wij(vj)’s are computed. This step is
illustrated in Figure 4.4.
Step 3 : The weight function values Wij(vj)’s are routed to new processing ele
ments such that the new distribution forms a row m ajor order. The resulting
distribution is shown in Figure 4.5, where each row leader is marked with ‘*’j
Step 4 : The weight function values in each row are summed to provide the argu
ments for the computing functions of the equation (4.1).
Step 5 : The sum Si of the ith-row is routed to the processing element containing
the node V { where the the computing function /,• is applied to St to update u j
This step is explained in Figure 4.6 for Si and vx.
64
" l "2 " 3 "4
r
h r h / U
r~
"5
fs
V
A
w6,i
?-
ve
fe
*"5,2
*"l,l
V
A
*"3,3
*"4,1
*"5,3
*"2,4 \
*"1,5
V
*"5,6 \ *"6,6
Figure 4.3: The routing of t> ; to the leader of z^-column
In the next section, we will show that the data transport problems can be solved
in 0(log N ) time with small constant factors by preprocessing the weight matrix.
4.3 D ata Transport
All the steps described in the previous section involve data transport among the
processing elements in the array. Each of these data transport problems falls into one
of the following three problems: data routing, broadcast of a value to all processing,
elements within a group simultaneously for all group and summation of values withi J
a group simultaneously for all group. We show optimal solutions to these data
transport problems in this section.
4.3.1 D ata R outing
During the iteration step, each value of the nodes should be routed to the leader
of an appropriate column of W in step 1. Then, in step 3, each *< *;,j("-() should be
__________________________ 65
Z
v5
V
z
™6,1
r
T'
r
l> 6
h
^5,2
V2
r
H
v\ V 2 vs
v4
r h
f
f 2
r
h / n
u>i,i
>yl-
^3,3
V3-
z:
W 4<1
™ 5,3
W2A
\
m,5
V
^ 5 ,6
\
W e,e
V4 Vs ^6
Figure 4.4: w ; is broadcast within it/l-column.
r
Z
v5
h
V
Z
2,4
(^ 4 )
r
“5 ,3 (^ 3 )
r
v 6
h
z
"3 .3 (” 3 )
r
"sZ'Z
a
V
A .
6,1
(vl)
A
«1 v 2 v 3
* 4
h
r
/2
r
h
f
u
A
j(«2)
A
w 6, 6(v 6)
Figure 4.5: The distribution after the transformation to row m ajor order
66
Figure 4.6: The sum of products in ith-row is routed to re
routed so that the resulting distribution forms a row major order. Finally, in step 5,
the sum of ith-row should be routed to the processing element containing vt.
These routing problems can be solved easily using sorting algorithms which takes
0 (log2 N ) time for each iteration step. Using a sorting algorithm is a viable choice
if the computation is to be performed a few times only. However, if the computation
is to be performed many times, a more efficient method can be designed which
takes 0(log N) running time for each iteration step, by preprocessing the weight
m atrix. Notice that the data routing among processing elements can be considered
as a realization of a particular permutation of elements in processing elements. We
solve this problem by having the hypercube simulate a well known interconnection
network.
A Benes network of size 8 is shown in Figure 4.7 by a channel representation
[54]. Horizontal lines are called channels. A vertical line between channels i and j.
represents a 2 x 2 switch between the two channels. Notice that, if we associate the
hth channel with the hth processing element in the hypercube, whenever there exists
67
n
i
2
8
4
5
6
7
stage 0 stage 1 stage 2 stage 3 stage 4
Figure 4.7: The channel representation of a Benes network of size 8
a switch between channels i and j in the Benes network, processing elements with
indices i and j are connected in the hypercube. Furthermore, switches in a stage
connect pairs of channels whose indices differ by the same number which is a power
of 2.
A hypercube can simulate a Benes network of the same size in a direct manner.
Whenever channels i and j exchange data through a switch in the Benes network,
processing elements i and j in the hypercube exchange data in their routing registers.
At the preprocessing time, a datum to be routed is given a routing tag of 2 log N — 1
bits. If the datum is to be exchanged between two channels in stage *, the ith bit of
the routing tag is set to 1, otherwise it is set to 0. The data routing algorithm uses
communication links in the sequence of 2°, 21, ..., 2rf-2, 2 d~x, 2d_2, ..., 21, 2°, where
d = log N. When the ith link of the above sequence is used, the data with the ith bit
of the routing tag set to 1 are exchanged between nodes connected by the ith link of
the above sequence.
There is a well known routing algorithm for Benes networks with 0 ( N log N) tim e
complexity on a serial computer as shown in chapter 1 [74]. There are also several
parallel algorithms for setting up Benes networks [65, 74]. Nassimi and Sahni’s
parallel algorithm for setting up a Benes network of size N takes 0 ( S ^ log2 N) time,!
68
procedure ROUTE (RR );
//D a ta to be routed are in routing registers with their routing ta g s//
Begin
For i=0 to d-1
Do in parallel in all PEs
Begin
If ( 0th bit of routing tag = 1 )
Then Exchange data using 2 * link with its tag;
Shift right the routing tag
End;
For i=d-2 to 0
Do in parallel in all PEs
Begin
If ( 0th bit of routing tag = 1 )
Then Exchange data using 2 * link with its tag;
Shift right the routing tag
End
End
Figure 4.8: The routing algorithm
where 5jv is the time needed to sort on a computer with N processing elements [74].
Since sorting takes 0(log2 N) time, it takes 0(log4 N) time on a hypercube of size
N. The complete data routing algorithm is shown in Figure 4.8.
4.3.2 B roadcast
During each iteration step, in step 2, each column leader of the weight m atrix W
has to broadcast a value to the members of the column. When the number of the
members in each column exactly fits a subhypercube, there is a trivial broadcast:
algorithm. When the size of the columns do not fit subhypercubes, it is somewhat
_______________ 69
nontrivial. We show a broadcast algorithm with O(log N ) time complexity, which is
optimal. For broadcast, the routing register and two data registers in each processing
elements are used. At the beginning of the broadcast algorithm, the leader of a
column has the value to be broadcast along with two tags. One of the tags is
the link tag which shows the farthest distance link the datum should use in the
broadcast. If ^-dimensional hypercube is the smallest subhypercube which contains
all the members of a column in the given mapping, then the link tag of the columr
is k — 1. The other tag is the group tag which is set to the column index of the
datum in the processing element. The datum to be broadcast and the tags are
placed in registers RR, R l, and R2 of each leader processing element. Each element
of a column knows its group number which is given to the processing element during
preprocessing.
Communication links are used in the sequence of 2°,2a, . . . , 2d~1 in the broad
cast algorithm. During each iteration of the algorithm, data which have not been
broadcast to all of its target members are copied to larger subhypercubes. After
the completion of the algorithm, the data register R2 in each processing element
will have the broadcast value. The time complexity of the broadcast algorithm is
O(\og N ). The complete broadcast algorithm is shown in Figure 4.9.
An example of the broadcast is shown for a 4-dimensional hypercube in Fig
ure 4.10. Processing elements marked with are the leaders and the content o:
registers are shown after each link usage. Notice that the leader 1* should have a
link tag of 1 since a 2-dimensional hypercube is the smallest subhypercube which
contains all the members 1, 2 and 3. By the same way, the leader 7*, whose members
are 7, 8, 9, 10 and 11, should have a link tag of 3.
4.3.3 Sum m ation
During each iteration step, in step 4, Wij(vj)’s in each row should be summed. The
algorithm for the summation of Wij(vjys in each row is similar to the broadcast
algorithm and is shown in Figure 4.11. The summation algorithm also uses link tags
and group tags which are similar to those used in the broadcast algorithm.
70
__I
procedure BROADCAST ( RR, R l, R2 );
For i=0 to d-1
Do in parallel in all PEs
Begin
Exchange data in RR with its tags using 2! link;
If (ith bit of the index = 1 AND
group tag of RR = group number)
Then R2 := RR;
If (link tag of RR < i )
Then RR :=R1
Else R l := RR
End
Figure 4.9: The broadcast algorithm
PE 0 1 * 2 3 4* 5 6 7* 8 9 10 11 12* 13 14 15*
RR 1 4 7 12 15
initial R l 1 4 7 12 15
R2 1 4 7 12 15
RR 1 1 4 4 7 7 12 12 15
i = 0 R l 1 1 4 4 7 7 12 12 15
R2 1 4 4 7 12 12 15
RR 1 1 7 7 7 7 12 12 15
i — 1 R l 1 1 7 7 7 7 12 12 15
R2 1 1 1 4 4 4 7 12 12 12 15
RR 7 7 7 7 7 7 7 7 12 12 15
i = 2 Rl 7 7 7 7 7 7 7 7 12 12 15
R2 1 1 1 4 4 4 7 12 12 12 15
RR 7 7 7 7 7 7 7 7 12 12 15
* = 3 Rl 7 7 7 7 7 7 7 7 12 12 15
R2 1 1 1 4 4 4 7 7 7 7 7 12 12 12 15
Figure 4.10: An example of broadcast for 4-dimensional hypercube
71
Procedure ROWSUM (RR, R l, R2 );
For i=0 to d-1
Do in parallel in all PEs
Begin
Exchange data in RR with its tags using 2 * link;
If ( group tag of RR = group number )
Then
Begin
RR := RR -j- R2;
R2 := RR;
If ( group tag of R l = group number )
Then R l := RR
End;
If ( ith bit of index = 0 AND link tag of R l > i )
Then RR := R l
Else R l := RR
End
Figure 4.11: The summation algorithm
PE 0 1 * 2 3 4 5* 6 7* 8 9 10 11 12* 13 14 15*
RR 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
initial Rl 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
R2 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
RR 1 1 5 5 4 4 6 6 17 17 21 21 25 25 14 14
< s > ,
I t
o
R l 1 1 5 5 4 4 6 6 17 17 21 21 25 25 14 14
R2 1 5 5 4 5 6 7 17 17 21 21 25 25 14 15
RR 1 6 6 6 4 4 4 4 38 38 38 38 39 39 39 25
i = 1 Rl 1 6 6 6 4 4 4 4 38 38 38 38 39 39 39 25
R2 6 6 6 4 11 6 7 38 38 38 38 39 39 39 15
RR 4 10 10 10 5 6 6 6 38 38 38 38 38 38 38 38
i = 2 Rl 4 10 10 10 5 6 6 6 38 38 38 38 38 38 38 38
R2 10 10 10 5 11 6 7 38 38 38 38 39 39 39 15
RR 38 38 38 38 38 38 38 45 4 10 10 10 5 6 6 6
i = 3 Rl 38 38 38 38 38 38 38 45 4 10 10 10 5 6 6 6
R2 10 10 10 5 11 6 45 38 38 38 38 39 39 39 15
Figure 4.12: An example of row summation for 4-dimensional hypercube
At the beginning of the algorithm, each processing element have the value to be
summed along with the link tag and the group tag in registers RR, R l and R2. If
not all the values of a row are completely summed after using the T link, local sums
of the values in (i -f l)-dimensional subhypercubes which do not contain the leader
are copied to larger hypercubes. At the end of the algorithm, each leader has the
sum of its row in register R2. An example of the summation of rows is shown in
Figure 4.12, where the leaders are marked with ‘*’ and the value to be summed in
each processing element is same as the index of the processing element. It should
be noted that the summation algorithm can be easily modified to implement any
associative function.
73
4.4 A pplications
In this section we show some applications of the above mapping technique. The
applications include iterative sparse linear system solvers, neural network imple-j
mentations and logic simulations. All these applications can be described by the
same graph model discussed earlier. They differ only in the computing functions
and the weight functions.
4.4.1 A n Efficient Iterative Sparse Linear S ystem Solver
A system of linear equations can be written in m atrix form as:
Ax = b (4.2)
where A represents the invertible coefficient m atrix of order m x m and x is the
solution vector of unknowns, x and b are both vectors of order m. The problem
of solving these equations is one of determining a solution vector x for which the
above equation holds. A sparse matrix is one that has enough zero entries to justify
a special method of solution. Systems of linear equations with sparse coefficient
matrices can be solved using direct methods, like Gaussian Elimination. These
methods, however, have two main drawbacks [103].
• Direct methods can introduce fill-ins, the replacement of zero entries by non
zero entries, and so are not very efficient for parallel processing implemen
tations where one wants to map only the non-zero entries to the individual;
processing elements.
• Direct methods are impractical for large sparse linear systems since they neec
very large storage
We will show a parallel implementation of Jacobi’s method which exploits the
sparsity of matrices by not introducing extra fill-ins [79]. Assuming that the diagona
74
elements of A are all non-zero, equation (4.2) can be transformed to the equivalent1
linear one-point m atrix iteration
x i + 1 = D~ 1 B x i 4- D~l b
= P x { + q (4.3)
where D is the diagonal m atrix made with the diagonal elements of A and B = D —A.
The multiplication of P and xl constitutes the matrix-vector multiplication step.
The kth component of x t + 1 can be represented as
m
xk + 1 = J2 Pkix) + (4-4)
3 = 1
where {pkj} represent the elements of P. Equation (4.3), therefore, requires the
calculation of m such inner products with a total number of multiplications equal
to the number of non-zero entries in P.
Equation (4.4) can be obtained from equation (4.1) by substituting lOij(vj) withj
PijVj and by substituting /,- with Hence, the solving a system of linear equations
based on the matrix-vector multiplication iteration can be easily implemented using
the directed graph mapping technique. Notice that our method can be applied to anyj
iterative linear system solver based on vector-matrix multiplication. The complete
algorithm for the iterative sparse linear system solver has the preprocessing part and
the iteration part.
The preprocessing part is as follows:
1. Compute the components of q and store them in the first n processing elements
along with the initial components of x°.
2. Compute the mapping of the non-zero entries of P onto the hypercube anc
store them in column major order into the remaining processing elements.
3. Identify the leaders of each column and each row. Compute link tags, group
tags and routing tags and store them in appropriate processing elements.
7 5 ;
The iteration part is as follows:
1. Route the elements of x to the leaders of appropriate columns using the pro
cedure ROUTE.
2. Broadcast the received values of elements of x within each column using the
procedure BROADCAST. Compute the products of Xj and pk,j.
3. Convert the distribution of the products in the previous step to row major
order using the procedure ROUTE.
4. Perform summation over the product terms in each row using the procedure
ROWSUM.
5. Route the sum of each row to appropriate elements of x using the procedure
ROUTE. Update the elements of x.
4.4.2 N eural N etw ork Im p lem entations
In the recent years, artificial neural networks have become a subject of a very dy
namic and extensive research. One very im portant issue for the progress of neural
networks is their implementation. In this subsection, we discuss an efficient parallel
implementation using our mapping technique.
We use the following general model for neural networks. A neural network consist
of interconnected simple neurons. The model of a neuron assumes a single output
and multiple inputs. The input signals are multiplied by weights, in general different
for different inputs, and added. The output is produced by applying some function
/ , called activation function, to the weighted sum, see Figure 4.13(a). The weights
of the input signals can be modified in the process referred to as learning. The
modifications are done according to a learning algorithm. An example of a neural
network with four neurons is shown in Figure 4.13(b). The signals entering the
networks undergo computations in the neurons and are transferred through the
network via the interconnections, appearing at the output neurons after some timej
delay. Such a forward pass of data, which does not involve changes of weights, is
(a)
Figure 4.13: (a) A model of a neuron and (b) A neural network
referred to as a recall operation. The learning may be executed during a forward
pass, by means of additional operations in the neurons, which determine the new
values of weights, or may require a separate pass of data in the opposite direction
than the forward pass (e.g. back-propagation model), or even along a different path.
The update step of a neural network can be described as
m
4 +1 = fQ>2wkjx)) (4.5)
j =i
where x\. is the value stored in the kth neuron at the beginning of the ith iteration
and Wkj is the weight of the input to kth neuron from j th neuron. The recall and
learning process of a neural network can be viewed as a sequence of steps each one
of which consists of data transfer operations and computations.
Comparing equation (4.5) with equation (4.1), it is obvious th at we can directly
use our mapping technique for parallel neural network implementations. The com
plete algorithm for the neural network implementation is very similar to that of the
iterative sparse linear system solver.
IN 1Q
IN 2Q
* □ OUT
IN3D
IN4Q
AND
AND
OR
Figure 4.14: A simple logic diagram
4.4.3 Logic Sim ulations
A simple logic diagram with two AND gates and one OR gate is shown in Figure 4.14.
The purpose of a logic simulation is to compute the sequence of output vectors given
an initial state of the circuit and a sequence of input vectors. The output vector
refers to the vector formed by the binary output values of logic elements. Our logic
circuits are based upon unit delay model with the following assumptions [15]:
1. A logic element (gate, Flip-Flop etc.) can have several inputs but only one
output.
2. Every logic elements have the same unit delay At.
3. There is no delay associated with wires.
In the case of logic simulation, we don’t have intrinsic weight functions. Everj
input to an logic element is an output of another logic element. However, each input
to a logic element needs to be distinguished from each other. We distinguish the
inputs by giving each input an exclusive weight which is a power of two. For example]
when a logic element node u; has 3 input edges e^, eu and eim, etk is given the weight
1, while eu is given the weight 2 and ejm is given the weight 4. Figure 4.15 shows
the directed graph representation of the circuit shown in Figure 4.14. The numbers
above each edge represent the weights. Another silent feature of logic simulations is
the need for the states of logic elements. A sequential logic element needs the state
Figure 4.15: A directed graph for a logic circuit
as an input to the output function. Therefore, each sequential logic element should
compute the state function besides the output function. Since the value of the state!
is used in computing the output and the next state, we should give a weight to eachi
state. When the logic element has k inputs, its state is given the weight of 2 k+1.
From the above discussions, the output vk and the state sk of the a logic element
can be expressed as:
,*'+ i
„t+i
fk WkjVj + wks\
(
m
Y ^ w t j v ) + w i t ' k
(4.6)
(4.7)
Our technique can be used to compute the equations (4.6) and (4.7) with a slight
modification, since each node should compute the state function besides the output
function. The complete algorithm for logic simulations is very similar to that of the1
iterative sparse linear system solver and will not be discussed in detail.
Several variations of the above mapping are possible. For example, if the logicj
elements have different delays which are multiples of a unit delay A t, we can associate;
79'
a FIFO delay queue with the output of each logic element. The length of the delay!
queue is proportional to the delay of the logic element. After a node computes thej
state function and the output function, the output value is placed in the delay queue J
The entry at the output end of the queue is used as the active output value of the!
logic element.
4.5 Im plem entations on O ther Parallel M achines
The communication links are used in two sequences by the data transport algorithms
in this chapter. They are: 2°, 21, ..., 2d~2, 2d~x, 2d~2, ..., 21, 2° and 2°, 21, ..., 2d~x.
This makes it possible to implement the algorithm on other parallel machines in al
straightforward manner.
A hypercube with N processing elements can be considered as a butterfly network;
with (log N + l) processing elements in each column collapsed into a single processing!
I
element. Thus one data exchange step in our data transport algorithms can be1
i
i
simulated by a data exchange step between two adjacent processing elements in a 1
column followed by a data exchange step between two processing elements in twoj
adjacent ranks forming a butterfly network. The number of processing elementsj
used in the butterfly network is (m + e)(log (m + e) + 1) and the time complexity
remains the same. The same simulation method can be applied to a cube-connected-
cycles networks having (m + e)log(m + e) processing elements with the same time
complexity.
The algorithms used in the butterfly network are normalized algorithms and can
be simulated by a shuffle-exchange network with (m T e) processing elements with
the same time complexity [105].
In the next chapter, we will show that the same mapping method can be used in
star graphs leading to efficient parallel solutions to directed graph oriented problems
4.6 C onclusion j
i
. i
We have proposed a simple and efficient way of mapping solutions to problems that |
can be modeled as directed graphs onto fine grain hypercube arrays. The mapping
uses (m -f e) processing elements to map a solution to a problem whose underlying!
directed graph has m nodes and e edges. We have showed optimal solutions to thei
i
data transport problems that arise in the mapping. Preprocessing of the underlying!
directed graph is performed to get required data transfer information. The pre-J
processing can be done in 0 ((m -f e)log(m + e)) time on a serial computer and in!
0(log4 (m -f e)) time on a hypercube with (ra + e) processing elements. One iteration
step can be performed in 0 (log(rn + e)) running time while a direct implementation
using sort based approaches takes 0 (log2 (m + e)) time.
We have shown several applications of the mapping technique which include
iterative sparse linear system solvers, parallel implementations of neural networks
and parallel logic simulations. All these applications utilize the same data transport
techniques but differ only in realizing the computing functions and weight functions.
The mapping method has very small constant factors and is well suited for im
plementations on fine grain hypercube arrays.
81
I
C hapter 5
A n Efficient M apping o f D irected Graph B ased
C om putations onto Star Graphs
In this chapter, we present an efficient implementation of directed graph based com
putations on star graphs. The same mapping method used in chapter 4 is used to
map directed graphs onto a star graph with (m + e) processing elements, where ra is
the number of nodes and e is the number of edges. Each iteration of the computation
can be done in 0 (n 2) time for a star graph with n! = m + e nodes. To solve the data
transport problem arising in the mapping, new algorithms for star graphs are devel
oped for routing, simultaneous broadcasting and simultaneous summation. These]
algorithms are based on special multi-dimensional grids which can be emulated by|
star graphs without penalty in time complexity. Especially, the well known routing^
method for three-stage Clos networks and Benes networks is modified to provide an!
efficient routing in star graphs. As shown in chapter 4, the implementation can be
easily modified to solve many directed graph based computations.
5.1 Introduction
Star graphs were introduced by Akers and Krishnamurthy as an alternative to well
known hypercubes [2]. Being a special case of Cayley graphs, star graphs are very*
rich in symmetry and have very useful hierarchical structures. Furthermore, star
82
graphs have significant advantages over hypercubes in degree per node, diameter, j
average diameter, and fault-tolerance as shown in Table 5.1 [2, 3].
In spite of their superiority over hypercubes in many graph theoretic properties,
only few result on star graphs has been reported [4, 5, 69, 76, 80]. Especially, parallel
algorithms for star graphs have been developed only for basic problems including
routing and sorting [3, 4, 69, 80]. Palis et al. developed an optimal randomized
routing algorithm on star graphs [80]. Their algorithm can route data in 0 (D )
steps in the worst case with high probability, where D is the diameter of the star
'graph. Annexstein and Baumslag developed a deterministic routing algorithm for j
star graphs [4]. Their algorithm runs in 0 (n ) tim e on star graphs with n\ processing :
elements with powerful communication capability. In [4], each processing element
lean route n messages through all its n ports in one unit of time. If a processing ■
element can route only constant number of messages at a time, their algorithm !
results in 0 ( n 2) time complexity. Using a modified Shear.-sort, Menn and Somani j
succeed in developing a sorting algorithm on star graphs which is comparable in |
performance to the best known sorting algorithm for hypercubes [69]. They also
{showed the possibility of emulating multi-dimensional grids with star graphs.
In this chapter, we present an efficient mapping of directed graph oriented com
mutations onto star graphs. As in chapter 4, a directed graph is mapped onto a
star graph with N = m + e processing elements, where m is the number of nodes
and e is the number of edges. To solve the data transport problem arising in the
mapping, new algorithms for star graphs are developed for routing, simultaneous
jbroadcasting and simultaneous summation. These algorithms are based on special
multi-dimensional grids which can be emulated by star graphs without penalty in
jtime complexity. Especially, the well known routing method for three-stage Clos net-
jworks and Benes networks is modified to provide an efficient routing in star graphs. \
Each iteration of the computation can be done in 0 ( n 2) time for a star graph with;
n! = N = m + e nodes. Preprocessing takes 0 ( N 2 n) time. As shown in chapter 4,!
| I
this mapping has various practical applications including sparse linear system solver j
and neural networks. ;
_83J
# o f
Nodes
Degree Diameter
Average
Distance
Fault
Diameter
5-star 120 4 6 3.7 < 9
7-cube 128 7 7 3.5 8
7-star 5040 6 9 5.9 < 12
12-cube 4096 12 12 6 13
9-star 362880 8 12 8.1 < 15
18-cube 262144 18 18 9 19
Table 5.1: A comparison of star graphs and n-cubes
The rest of the chapter is organized as follows: Section 5.2 presents a brief de-:
scription of star graphs. Section 5.3 shows how star graphs can emulate special,
multi-dimensional grids without penalty in time complexity. Section 5.4 shows an1
overview of the algorithm followed by details of data transport in section 5.5. Sec-;
tion 5.6 concludes the chapter. ;
1
5.2 Star Graphs
This subsection briefly introduces star graphs. More details on star graphs can be
found in [2, 3]. |
!
D efin itio n 5 An n-star graph has n\ nodes. The nodes are labelled by n\ permuta
tions on n different symbols. Two nodes u and v in a star graph are connected to,
each other if and only if the permutations of u and v differ in exactly two positions^
including the first position. A star graph with n\ nodes is denoted by Sn. j
1
I
Figure 5.2 and Figure 5.2 show examples of S 3 and 54 respectively. For simplicity]
we will use n-star to denote n-star graph. In the rest of this chapter, we use numbers
from 1 to n to denote the n different symbols and N represents n!.
A graph is vertex symmetric if, for any two nodes in the graph, there exists anj
automorphism of the graph that maps one node into the other. Edge symmetry is
defined similarly. Since star graphs are a special case of Cayley graphs, star graphs
_____________________ d
123
321
213
312
231
132
Figure 5.1: A 3-star
I
i
I
1234 4231
2431 3241 3214 2134
2341
3124
3421 2314
1324 4321
2413 3412
1423
4312
1432
4213
4123 1243 1342 4132
3142 2143
Figure 5.2: A 4-star
,85j
are vertex symmetric. Furthermore star graphs are also edge symmetric [2]. The
following lemma summarize the symmetry of star graphs.
L em m a 7 Star graphs are both vertex and edge symmetric. [3]
Besides symmetry, star graphs also have a very useful hierarchical structure. Letj
r,-, 1 < r, i < n denote the induced subgraph of a star graph consisting of all the
permutations that contain the symbol r in the i-th position. For example, when
n = 4, 44 is a subgraph of S 4 whose nodes are all the permutations that have the
symbol 4 in the fourth position, i.e., {1234,2134,1324,3124,2314,3214}. Note that
44 is a star graph by itself as shown in Figure 5.2. In general, for 2 < i < n, the
subgraph r; is isomorphic to Sn-\. Since any of n symbols can be placed in the z-thf
position, we can partition an n-star into n (n — l)-stars, i.e., 1,-, 2 i,... n f. Since n — lj
positions, except the first position, can be used for fixing symbols, there are n — 1|
ways to partition an n-star as summarized in the following lemma. |
1
L e m m a 8 There are n — 1 ways to partition an n-star into n (n — l)-s^ars. [3]
We can also partition a star graph into li,2i,..., n\, i.e., by fixing symbols in the
first position. The subgraphs are not isomorphic to Sn- 1- It is easy to see that two1
nodes in ri are not connected to each other by an edge of a star graph. The following
lemma shows an interesting property between r} and r,-,2 < i < n. ,
L em m a 9 In an n-star, each node in r;,2 < i < n is connected to exactly one node
in r\. [3] ■
I
5.3 Indexing Schem e and M apping a Grid onto'
Star Graph j
1
I
To perform various algorithms based on multi-dimensional grids, we need an indexing
scheme and a mapping of star graphs onto multi-dimensional grids. In this section,!
we briefly show an indexing scheme and a mapping of a grid onto a star graph used'
by Menn and Somani [69]. Interested readers can refer [69] for more details.
Menn and Somani defined a row major indexing as follows.
86
1 4321 2 3421 3 4231 4 2431 5 3241
6
2341
7 4312 8 3412 9 4132
10
1432 1 1 3142
12
1342
13 4213 14 2413 15 4123 16 1423 17 2143 18
1243
19 3214
20
2314 2 1 3124 22 1324
23
2134
24
1234
Figure 5.3: Row Major Indexing Scheme for 4-star !
I
D efinition 6 Let f : V — > [0, |V| — 1] be an isomorphism between a vertex set Vj
i
and the range of integers [0, |V| — 1]. Assume that V is partitioned into R sets o/j
equal size. These sets are called rows, and they are numbered from 1 to R. The map^
/ is a row major indexing scheme ifVr, 1 < r < (R — 1), a node u is in row r, and
a node v is in row (r + 1) implies f : u < f : v. [69]
i
i
Let an n-star Sn be partitioned into n (n — l)-stars: 1„, 2r a , ..., n n. The substars
are of the same size and we can consider each of them as a row. If we map the
nodes of substar rn, 1 < r < n, to integers [(r — l)(n — l)!,r(n — 1)! — 1], this
indexing scheme is a row major indexing scheme by definition 6. We can renumber!
the symbols 1,..., r — 1, r + 1,..., n of rn to 1,2,..., n — 1 and apply the sante
indexing scheme. By applying the indexing scheme recursively to all substars, we
have a completely defined row major indexing scheme. Menn and Somani called this
I
scheme the Row Major Indexing Scheme (RMIS) [69]. Figure 5.3 shows the RMIS'
for a 4-star. It is noteworthy that getting permutation from an index is as easy as*
getting index from a permutation.
The following definitions show two mappings of n x (n — 1)! grid onto n-star.
D efinition 7 R M (Row Mode) maps an n x (n — 1)! grid onto an n-star. The rows^
of the grid are the n independent (n — 1)-stars: 1„,2 nn . Two nodes in a pair^
of consecutive rows are in the same column if and only if their permutations differ^
exclusively in the last and any (one) other position. [69]
D efin itio n 8 CM (Column Mode) also maps an n x (n — 1)! grid onto an n-star.j
The rows of the grid are n independent sets: l i , 2i,..., rai. Two nodes in a pairj
of consecutive rows are in the same column if and only if their permutations diffef
exclusively in the first and any (one) other position. By definition of the star graph,I
such two nodes are connected. Therefore, the columns in CM are connected in a
linear array. [69] j
The mappings RM and CM are not unique since if we do any column perm utation
on a RM (CM) the resulting mapping will be also a RM (CM). As shown in lemma 9,
each node in rn is connected to exactly one node in r\. Each node in the r-th
row of RM is connected to exactly one node in the r-th row of CM. Therefore,
transformation between any CM and any RM can be done in a single step. In other
words, for all a: and y, a datum in a processing element whose position is .r-th row
and y-th column in RM can be sent to a processing element whose position is r-th
row and y-th column in CM in one parallel step. Suppose we map node u of S n to;
a grid point whose row major index is RMIS{u). The resulting mapping is shown
in Figure 5.3 for a 4-star. It is easy to see that the mapping satisfy the definition1
of RM. Throughout this chapter, this mapping will be called as RM for simplicity.'
If we exchange the first symbol and the last symbol of each perm utation in RM, we
obtain another mapping as shown in Figure 5.3. It is easy to see th at the mapping
satisfy the definition of CM and each column of the mapping is connected in a linear
array. As same as RM, this mapping will be called as CM for simplicity. Readers can
check that a node mapped to (x,y ), r-th row and y-th column, in RM is connected!
to a node mapped to (x, y) in CM. ;
Since the transformation between RM and CM is possible in one parallel step and'
the columns of CM are connected in linear arrays, RM can easily emulate (n — l)!j
I
linear arrays of size n as follows,
{
1. Transform RM into CM.
i
2. Perform communications using linear arrays of size n in CM. j
3. Transform CM back into RM I
4321 3421 4231 2431 3241 2341
4312 3412 4132 1432 3142 1342
4213
2413 4123 1423 2143 1243
3214
2314 3124 1324 2134 1234
Figure 5.4: Row Mode for a 4-star
1324 1423 1234 1432
1243 1342
2314 2413 2134 2431 2143 2341
3214 3412 3124 3421 3142 3241
4213 4312 4123 4321 4132 4231
Figure 5.5: Column Mode for a 4-star
The only difference between real linear arrays and columns of RM is that we need
jtwo extra steps for transformation between RM and CM. Since each row of the RM
is a star graph of smaller size, we can consider each row of RM as (n — 2)! linear
Lrrays of size (n — 1). By applying this idea recursively, we can consider RM as an
n-dimensional grid whose size isnxn — 1 x n - 2 x . . . x l . The difference between
(RM and a real n-dimensional grid is that we can not use two different kinds of link
jat the same time, e.g., some links in the 4-th dimension and some links in the 5-th
^dimension at the same time, and communication in each dimension takes two extra
steps for transformations.
89
X 0 0 0 X 0 0 0
0 0 0 X 0 0 X 0
0 0 X 0 0 0 0 0
X 0 0 0 0 0 0 0
0 X X 0 0 X 0 X
X 0 0 0 0 X 0 X
X 0 0 0 X 0 0 0
0 0 X 0 0 0 0 0
I
x1 x2 x3 x4 x5 x6
q i q2 q3 q4 q5 q6
x7
q7
x8
q8
(1,1) (4,1) (6,1) (7,1)
(5,2) (3,3) (5,3) (8,3) (2,4) (1,5)
(7,5) (5,6) (6,6) (2,7) (5,8) (6,8)
Figure 5.6: A sparse iteration m atrix and the initial data mapping
J
5.4 A n O verview of th e A lgorithm i
*
«
In this section, we briefly review the mapping algorithm shown in chapter 4 for
convenience. For simplicity, we will explain the algorithm for iterative linear system
solvers. Other problems mentioned in chapter 4 can be solved on star graphs in
similar ways.
The elements of the initial solution vector x° are placed in the first m processing
elements. The elements of the right hand side vector q are also placed in the first!
m processing elements. Non-zero entries of the sparse iteration m atrix are placed
in the remaining processing elements in column major order. Thus the array hasj
N = n! = m + e processing elements. The processing element with the least index
within a column is called the leader of the column. A 8 x 8 sparse iteration m atrix
and its initial data mapping is shown in Figure 5.4, where (i,j) represents p.hJ.
90
In each iteration step, the algorithm computes equation (4.4) which is shown
Delow for convenience.
m
^ k 1 - ^ P k j x ) + qk (5.1)
j = i
The computation of equation (5.1) consists of the following steps.
Step 1 : Elements of the solution vector x are routed to the leaders of appropriate
columns of the iteration m atrix P. This step is shown in Figure 5.4, for nq
and the leader of the first column of P.
I
i
S tep 2 : Each column leader broadcasts the received value to the elements of the
column. This step is illustrated in Figure 5.4.
I
i
Step 3 : The products of pkj and Xj are routed to processing elements such that j
the new distribution of the products forms a row m ajor order. This step is
shown in Figure 5.4, where pkj and Xj in a processing element represents the
product of the two items.
S tep 4 : The products in a row are summed to complete the inner product of I
equation (5.1).
Step 5 : The sum Sk of the row k is routed to the processing element containing |
Xk and qk where the sum is added to qu to update Xk■ This step is explained
in Figure 5.4 for Si and x\.
The number of iterations to be used depends on the particular iterative scheme
I
being used and on the desired accuracy of the final result [79, 106].
In our analysis, we use the standard assumption that a communication step as j
jwell as a computation step takes one unit of time. As shown in the next section, j
all data transport problem involved in the algorithm can be solved in 0 (n2) time.!
'Hence, one iteration of the algorithm takes 0(n2) time. i
91
xl
q 1 \
x2
q2
x3
q3
x4
q4
x5
q5
x6
q6
x7
q7
X 8
q8
"^ (1 ,1 ) (4,1) (6,1) (7,1)
(5,2) (3,3) (5,3) (8,3) (2,4) (1,5)
(7,5) (5,6) (6,6) (2,7) (5,8) (6,8)
Figure 5.7: The routing of Xk to the leader of column k
x 1
q1
x2
q2
x3
q3
x4
q4
x5
q5
x6
q6
x7
q7
x8
qs
(1,1)
xl
/ 4 , „ ^ 6 ,1 )
J "
(5,2)
x2
(3,3)
x3 .......
^ (5,3) ^ (8,3) (2,4)
x4
(1,5) .
x5 x
(7,5)
(5,6) (6,6) (2,7)
x7
(5,8) (6,8)
x6 " x8
Figure 5.8: Xk is broadcast within column k of P
x1 x2 x3 x4 x5 x6
q1 q2 q3 q4 q5 q6
x7 x8 (1,1) (1,5) (2,4) (2,7)
q7 q8
x1 x5 X 4 x7
(3,3) (4,1) (5,2) (5,3) (5,6) (5,8)
x3 X 1 x2 x3 x6 x8
(6,1) (6,6) (6,8) (7,1) (7,5) (8,3)
x1 x6 x8 X 1 x5 x3
Figure 5.9: The distribution after the transformation to row major order
x 1 x2 x3 x4 x5 x6
qi \ q2 q3 q4 qs
q6
\
x7 x8 ~ \
S1 S2
q7 q8
S3 S4 S5
S6 S7
S8
Figure 5.10: The sum of products in. row k is routed to Xk
5.5 D ata Transport
All the steps described in the previous section involve data transport among the
processing elements in the star graph. Each of these data transport problems falls
into one of the following three problems: data routing, simultaneous broadcasting
within each group and simultaneous summation within each group. We show efficient
solutions to these data transport problems in this section.
5.5.1 D ata R outing
l
I
During each iteration, each element of the solution vector should be routed to the
leader of an appropriate column of P in step 1. Then, in step 3, each product of anj
element of the solution vector and a member of a column should also be routed so
that the resulting distribution forms a row major order. Finally, in step 5, the sum?
of row k should be routed to the processing element containing Xk- j
As mentioned in the introduction, two routing algorithms are known for star!
graphs. Palis’s randomized routing algorithm can route data in 0(D) time for the;
worst input with high probability, where D is the diameter of the star graph [80].;
I
Annexstein and Baumslag’s routing algorithm can route data in 0(n ) time using,
processing elements with powerful communication capability [4]. In the following, we
show a routing algorithm which can route data in 0 (n2) time using simple processing
i
_________________________ 93]
elements. The algorithm is based on the well known routing algorithm for three stage |
Clos networks. A similar technique for two-dimensional mesh can be found in [86]. j
Suppose we know how to route data in (n — l)-star, Sn- 1. Recall that we can[
regard columns of RM as linear arrays of size n and each row of RM is a Sn-i- This ]
means that we can perform row-wise routing and column-wise routing in RM. Using!
, i
these partial routings, we can perform routing in S n as follows. j
l
1. Construct a bipartite graph G = (Vi, V 2, E ) such that Vi = V 2 = {1, 2,..., (n — j
1
1)!} and whenever a datum in column i should be routed to column j , an edge!
i
(i,j) is added to E. Hence, each node has n edges and G is a bipartite
multigraph with n! edges.
2. Find disjoint complete matchings Mi, M 2 ,... Mn in G. The existence of such
complete matchings is also guaranteed by Hall’s theorem.
3. Perform column-wise routing such that a datum in matching Mi is routed to
the i-th row. This can be done easily by transforming RM to CM as explained
in section 3.
4. Route data to their destination columns using row-wise routing on Sn- 1 of
each row.
5. Route data to their final destination processing elements using column-wise
routing.
It is easier to understand the above algorithm by visualizing the star graph (RM) as
a three stage network as shown in Figure 5.5.1. Each n x n switch in the first and
last stage represents a linear array of size n and each (n — 1)! x (n — 1)! switch in
the middle stage represents an (n — l)-star. There are (n — 1)! switches in the first
and last stages and there are n switches in the middle stage.
We can get a complete routing algorithm for an n-star by recursively applying
the above algorithm to smaller star graphs. For real routing on star graphs, we
can precompute routing tags for each column-wise routing. Since finding complete
matchings for n-star takes 0 ( N 2) time, the whole preprocessing will take 0 ( N 2 n )
.94.
00
£
> - »
( i
Cn
>
o
r
CO
3
n >
erh
o
e s a
t t >
s
3
co
_ Q V
time. When we use swapping operation between adjacent processing elements, rout-
i
ing in a linear array of size n takes n — 1 time. Routing in columns of RM takesj
n + 1 time because of the transformation RM between CM. Routing for n-star takes,
I
R(n ) time where, j
R(n) — n + 1 + R(n - l ) + n + l (5-2)
R( 2) = 1 (5.3)
Hence the routing algorithm runs in 0(n 2) time. In [4], a processing element can
route n messages through all n ports in unit time. When we allow routing through
only two ports at a time, Annexstein and Baumslag’s routing algorithm also results
I
in 0 (n 2) time complexity. The main difference between the routing algorithms inj
[4] and this chapter is that spoke structures are used for the basic routing in [4] andj
linear arrays are used for the basic routing in this chapter. j
5.5.2 Sim ultaneous B roadcasting and Sum m ation
During each iteration, in step 2, the leader of a column of the iteration matrix,
P has to broadcast the value of appropriate element of the solution vector to the
elements of the column. During each iteration, in step 4, the product terms in
each row should be summed to complete the inner product of equation (5.1). In;
i
this subsection, we show a simultaneous broadcasting algorithm and a simultaneous
summation algorithm both with 0 (n2) time complexity.
In simultaneous broadcasting and summation, nodes in a star graph is dividedj
into several groups of different size. The RMIS (Row Major Indexing Scheme) of
members of a group are continuous. The processing element with the least index in'
i
a group is called the leader of the group. Each member of a group knows the size
of the group and position of the group, i.e, the index of the leader. Figure 5.5.2
shows an example of grouping for simultaneous broadcasting and summation. The
i processing elements marked with ‘X ’ are the leaders of each group. The problem1
1 . . ^
of simultaneous broadcasting is how to send the datum in a group leader to all
96
1 2 3 4 5 6 7 8 9 10 1112 131415 161718 19 20 21 2223 24
X X X X X
j Figure 5.12: An example of grouping for simultaneous broadcasting and summationj
! i
the members of the same group, for all groups simultaneously. The problem ofj
simultaneous summation is how to sum the values in each group member into the
group leader, for all groups simultaneously.
J Our approach to the simultaneous broadcasting is that a datum is broadcast to
a larger substar until it is sent to all the group members. The actual broadcasting
is performed by the linear arrays of CM. In this approach, a datum to be broadcast
to a larger substar should be broadcast to the entire current substar before being
broadcast to the larger substar. Note that each substar has only one datum to be'
broadcast to the larger substar since there is at most one leader in a substar which'
l
should broadcast to other substars. The broadcasting algorithm has n — 1 stages.;
Broadcasting on i-star is performed in the (i — l)-th stage. Each datum from a leader,
has the information for the size and position of the group. The following outlines,
1
the (i — l)-th stage of the broadcasting. j
1. Transform RM of i-star to CM. !
i
2. Perform column broadcasting in each column of CM. To perform column broad-1
casting each processing element in a (column) linear array does the following
i — 1 times. J
(a) Get a datum from the upper processing element, i.e., an adjacent pro
cessing element with smaller index.
(b) If the datum is from the right leader, take it. ]
1
(c) If the datum is from the right leader and it should be sent to the next (i —
I
l)-substar, send it to the lower processing element in the next iteration^
si
This decision can be easily made since each processing element knows the
size and position of the group.
(d) Get a datum from the lower processing, i.e., an adjacent processing ele
ment with larger index.
(e) If the datum should be broadcast to the larger substar, i.e., (i-fl)-substar,
store it and send it to the upper processing element in the next iteration.
Otherwise discard the datum. !
i
3. Transform CM back to RM. ;
i
I
i
Assume that, at the start of (e — l)-th stage, every (i — l)-substar have right datal
to broadcast to other (i — l)-substars in the same i-star. At the end of (i — l)-th!
stage of the broadcasting, we can see the following. |
i
I
1. Every processing element has received a datum from the right leader if thej
leader is in the same i-substar. |
2. Every processing element has a datum to be broadcast to the larger substar|
((i + l)-star). This datum is originated from the leader with the largest indexi
among leaders in the Tstar. Therefore, in the beginning of i-th stage, all the
I
processing elements in a row (Tstar) of (i -f l)-star has the same datum to be
broadcast.
At the beginning of the broadcasting, every 1-star have the right datum to broadcast
to other 1-stars. Hence, the simultaneous broadcasting algorithm gives the right;
result.
The i-th stage of the simultaneous broadcasting takes 0(i) time. Hence the'
whole simultaneous broadcasting algorithm takes 0 (n2) time. j
The simultaneous summation algorithm can be done in a similar way to the
simultaneous broadcasting. Summation is done in smaller substars and partial sums
of smaller substars are broadcast to larger substars until the leader of a group gets
the right sum. The direction of data movement is opposite to that of simultaneous^
broadcasting, i.e., partial sums are broadcast to leaders from processing elements
98
with larger indices. At any stage, each substar has only one partial sum to be
broadcast to the larger substar since there is at most one group whose leader is
in another substar. Since simultaneous summation algorithm is very similar to
jsimultaneous broadcasting algorithm, an outline of the algorithm is not presented
there. Simultaneous summation algorithm also has 0 (n 2) tim e complexity.
5.6 C onclusion
We presented an efficient implementation of directed graph orientation based com
putations on star graphs. The same mapping method used in chapter 4 is used to
m ap directed graphs onto a star graph with (m -f e) processing elements, where m -
| |
■ is the number of nodes and e is the number of edges. To solve the data tran sp o rt'
problem arising in the mapping, new algorithms for star graphs are developed for |
touting, simultaneous broadcasting and simultaneous summation. These algorithms j
1
are based on special multi-dimensional grids which can be emulated by star graphs
jwithout penalty in time complexity. Especially, the well known routing methods for j
jthree-stage Clos networks and Benes networks are modified to provide an efficient;
routing in star graphs. Each iteration of the computation can be done in 0{n2) tim e :
for a star graph with nl = m + e nodes. The preprocessing takes 0 ( N 2n ) time on a;
serial computer. As shown in chapter 4, the implementation can be easily modified j
to solve many directed graph oriented computations. J
.99 J
C hapter 6
«
I
C oncluding R em arks
In this chapter, we briefly summarize the works described in this dissertation. We
also address open problems and possible future research.
Chapter 2 presents a conceptually simple proof for the rearrangeability of five
stage shuffle/exchange networks for N = 8. The proof is an extension to the well
known proof technique for three-stage Clos networks. Obviously, the most interesting
question regarding shuffle/exchange networks is the following:
Is it possible to realize arbitrary permutation on shuffle/exchange net
works with 2 log N — 1 stages?
Answering the above general question is very challenging. One possible method isj
!
to investigate the question when N is small. The answer to the general question will
be negative if shuffle/exchange networks with 2 log N — 1 stages when N = 16, 32 are
not rearrangeable. On the other hand, if it turns out that shuffle/exchange networks
with 2 log N — 1 stages when N = 16,32 are rearrangeable, the proof technique might
be useful in proving the general question. However, the rearrangeability problem of
seven stage shuffle/exchange networks when N = 16 is also very challenging. The
number of cases to be considered for N = 16 case is much larger than N = 8 case.I
Approaches with both analytical method and computer search method seem to be
more promising than analytical method alone or computer search method alone, j
Chapter 3 presents a new solution to the classical parallel array access problem.j
One interesting aspect of our solution is the combinatorial approach we have taken to
I
solve the parallel array access problem. This should be contrasted with traditional
I
_____________________ „ i o o i
approaches which assume an address generation scheme (linear skewing scheme,'
XOR scheme etc.), and try to find out what types of subsets can be supported by
them or try to find out conflict free conditions for each subset of interest. We first
define a combinatorial object (perfect latin square) which can provide conflict free]
access to subsets we consider as im portant. Then we show how to build them an d !
how to perform address generation efficiently. The resulting memory system with
self-routing Benes network is a quite efficient complete memory system. However,
one can devise a less expensive network that can realize permutation needed be
tween processing elements and memory modules. One can also define new subsets
of interest and try to provide conflict free access to these subsets.
Chapter 4 presents an efficient way of mapping solutions to problems that can
be modeled as directed graphs onto fine grain hypercube arrays. Applications of the
mapping include iterative sparse linear system solvers, parallel implementations of
neural networks and parallel logic simulations. One can try to find new applications!
of the mapping technique. Another im portant work to be done is developing ai
I
mapping technique when the number of nodes and edges are larger than the size!
of the processor array. The mapping technique shown in chapter 4 can not provide]
i
efficient routing for this case. '
Chapter 5 presents an efficient way of mapping solutions to problems that can be
modeled as directed graphs onto star graph. Since the same mapping is used as in
chapter 4, the same problems should be addressed. In addition, there might be more
efficient data data movement techniques for star graphs. Currently, data movement
algorithms are devised on a special multi-dimensional grid which can be emulated
without loss in time complexity. Finding efficient data movement techniques which
exploit the topological properties of star graphs remains open.
lOlj
B ibliography
[1] P. Agrawal and M. A. Breuer, “A Probabilistic Model for the Analysis of thej
Routing Process for Circuits,” Networks, Vol. 10, pp. 111-128, 1980. |
[2] S. B. Akers and B. Krishnamurthy, “A Group Theoretic Model for Symmet-j
ric Interconnection Networks,” International Conference on Parallel Processing,'
pp. 216-223, 1986. j
i
[3] S. B. Akers, D. Harel and B. Krishnamurthy, “The Star Graph: An Attrac-:
tive Alternative to the n-Cube,” International Conference on Parallel Processing,
pp. 393-400, 1987.
[4] F. Annexstein and M. Baumslag, “A Unified Approach to Off-Line Perm uta
tion Routing on Parallel Networks,” AC M Symposium on Parallel Algorithms and•
Architectures, pp. 398-406, 1990. I
[5] F. Annexstein, M. Baumslag and A. L. Rosenberg, “Group Action Graphs andj
Parallel Architectures,” SIA M Journal of Computing, Vol. 19, No. 3, pp. 544-569,
1990.
[6] R. Arlauskas “IPSC/2 System: A Second Generation Hypercube,” Hypercube j
Concurrent Computers and Applications, Vol. 1, pp. 38-43, 1988.
I
[7] M. Balakrishnan, R. Jain and C. S. Raghavendra, “On Array Storage For!
Conflict-Free Memory Access For Parallel Processors,” International Conference
on Parallel Processing, Vol. 1, pp. 103-107, 1988. I
i
[8] K. E. Batcher, “The Multidimensional Access Memory in STRAN,” IEEE Trans
actions on Computers, Vol. C-26, pp. 174-177, 1977. j
[9] G. Belloch and C. Rosenberg, “Network Learning on the Connection Machine, ”j
10th Intional Conference on Artificial Intelligence, 1987. I
]
[10] V. E. Benes,“On Rearrangeable Three-Stage Connecting Networks,” Bell Sys
tem Technical Journal, Vol. 41, pp. 117-125, September 1962. j
i
[11] V. E. Benes, Mathematical Theory of Connecting Networks and Telephone Traf
fic, Academic Press, New York, 1965.
102
[12] J. A. Bondy and U. S. R. Murthy, Graph Theory and Applications, American
Elsevier Publishing Company, New York, 1976.
[13] R. Boppana and C. S. Raghavendra, “On Self Routing in Benes and Shuffle
Exchange Networks,” International Conference on Parallel Processing, Vol. 1,
pp. 196-200, 1988.
[14] R. Boppana and C. S. Raghavendra, “Self-routing Schemes in Parallel Memory
Access,” Manuscript, Department of Electrical Engineering-Systems, University
of Southern California, November 1989.
[15] M. A. Breuer and A. D. Friedman, Diagnosis & Reliable Design of Digital
Systems, Computer Science Press, 1976.
[16] P. Budnik and D. J. Kuck, “The Organization and Use of Parallel Memories,”
IEEE Transactions on Computers, Vol. C-20, pp. 1566-1569, 1971.
[17] H. Cam and J. A. B. Fortes, “Rearrangeability of Shuffle-Exchange Networks,”
Frontiers of Massively Parallel Computation, pp. 303-314, 1990.
[18] A. C. Chen and C. Wu, “Optimum Solution to Dense Linear Systems of Equa
tions”, International Conference on Parallel Processing, pp. 417-424, 1984.
[19] M. Chern and T. Murata, “A Fast Algorithm for Concurrent LU Decomposition
and M atrix Inversions” , International Conference on Parallel Processing, pp. 79-
86, 1983.
[20] C. Clos, “A Study of Non-Blocking Switching Networks,” Bell System Technical
Journal, Vol. 32, pp. 406-424, 1953.
[21] B. Codenotti and F. Romani, “A Compact and Modular VLSI Design for the So
lution of General Sparse Linear Systems,” Integration, The VLSI Journal, Vol. 5,
pp. 77-86, 1987.
[22] A. Deb, “Conflict-free Access of Arrays-A Counter Example,” Information Pro
cessing Letters, Vol. 10, No. 1, pp. 20, 1980.
[23] A. Deb, “Multiskewing - A Novel Technique for Optimal Parallel Memory Ac-j
cess,” Technical Report #9004, Department of Computer Science, Memorial Uni-j
versity of Newfoundland, 1990.
[24] T. A. Defanti, M. D. Brown and B. H. McCormick, “Visualization: Expanding!
Scientific and Engineering Research Opportunities,” IEEE Computer, pp. 12-25,
August 1989.
[25] J. Denes and A. D. Keedwell, Latin Squares and Their Applications, Academic
Press, New York, 1974.
_______________ ...______________ 103
[26] I. S. Duff, A. M. Erisman and J. K. Reid, Direct Methods for Sparse Matrices,
Oxford Science Publications, 1986.
[27] D. J. Evans, “Iterative Methods for Sparse Linear Equations,” In D. J. Evans j
Ed., Sparsity and its Applications, Cambridge University Press, pp. 45-111, 1984. j
[28] M. J. Flynn, “Very High-Speed Computing Systems,” IEEE Proceedings,
Vol. 54, pp. 1901-1909, 1966. i
[29] J. M. Frailong, W. Jalby and J. Lenfant, “XOR-Schemes : A Flexible Data
Organization in Parallel Memories,” International Conference on Parallel Pro
cessing, pp. 276-283, 1985.
[30] D. Gale, “A Theorem on Flow in Networks, ” Pacific Journal o f Mathematics,
pp. 1073-1082, 1957.
[31] E. Gergely, “A Simple Method for Constructing Doubly Diagonalizecl Latinj
Squares,” Journal of Combinatorial Theory, A 16, pp. 266-272, 1974.
i
[32] P. A. Gilmore, Matrix Computations on an Associative Processor, In Lecture
Notes in Computer Science, Vol. 24, Parallel Processing, Springer-Verlag, Newj
York, 1974.
[33] J. Gotze and U. Schwiegelshohn, “Sparse-Matrix-Vector Multiplication on a
Systolic Array,” International Conference on Acoustics, Speech, and Signal Pro
cessing, 1988.
[34] M. Hall, Combinatorial Theory, 2nd Edition, John Wiley & Sons, 1986.
I
[35] D. T. Harper III and J. R. Jump, “Vector Access Performance in Parallel Mem-|
ories Using a Skewed Storage Scheme,” IEEE Transactions on Computers, Vol. C-!
36, pp. 1440-1449, 1987. !
[36] D. T. Harper III and D. A. Linebarger, “A Dynamic Storage Scheme for Conflict
Free Vector Access,” International Symposium on Computer Architecture, pp. 72-
77, 1989.
[37] H. M. Hastings and S. Waner, “Neural Nets on the M PP,” Frontiers of Massively
Parallel Scientific Computation, James R. Fisher, Editor, NASA, 1987. |
i
[38] J. P. Hayes, T. N. Mudge, Q. F. Stout, S. Colley and J. Palmer “Architecture of
a Hypercube Supercomputer,” International Conference on Parallel Processing}
pp. 653-660, 1986.
[39] A. Hedayat, “A Complete Solution to the Existence and Nonexistence of Knut
Vik Designs and Orthogonal Knut Vik Designs,” Journal of Combinatorial The
ory, A 22, pp. 331-337, 1977.
104,
[40] K. Heinrich, K. Kim and V. K. Prasanna Kumar, “Perfect Latin Squares,” to
appear in Discrete Applied Mathematics.
[41] Daniel Hillis, The Connection Machine, The MIT Press, Cambridge, M ass,1
1985. ;
[42] C. D. Howe and B. Moxon, “How to Program Parallel Processors” , IEEE Spec-
j trum, Vol. 24, No. 9, pp. 36-41, 1987.
[43] S. Huang and S. K. Tripathi, “Finite State Model and Compatibility Theory:
New Analysis Tools for Permutation Networks”, IEEE Transactions on Comput
ers, Vol. C-35, No. 7, pp. 591-601, July 1986.
[44] F. K. Hwang, “Crisscross Latin Squares,” Journal of Combinatorial Theory,!
A 27, pp. 371-375, 1979. 1
I
[45] K. Hwang and F. A. Briggs, Computer Architecture and Parallel Processing,,
McGraw-Hill, 1984. j
1
[46] K. Kim and V. K. Prasanna Kumar, “A Simple Proof of Rearrangeability ofj
Five Stage Shuffle/Exchange Network for N = 8,” to appear in IEEE Transactions
on Computers.
[47] K. Kim and V. K. Prasanna Kumar, “Perfect Latin Squares and Parallel Arrayi
Access,” International Symposium on Computer Architecture, pp. 372-379, 1989.
[48] K. Kim and V. K. Prasanna Kumar, “Parallel Memory Systems for Image Pro
cessing,” IEEE Conference on Computer Vision and Pattern Recognition, pp. 654-!
659, 1989. I
I
[49] K. Kim and V. K. Prasanna Kumar, “Efficient Implementation of Neural Net-j
works on Hypercube SIMD Arrays,” International Joint Conference on Neural
Networks, 1989.
[50] K. Kim and V. K. prasanna Kumar, “An Efficient Mapping of Directed Graph
Based Computations onto SIMD Hypercube Arrays and Applications,” Interna
tional Conference on Parallel Processing, Vol II, pp. 296-297, 1990.
[51] K. Kim and C. S. Raghavendra, “A Simple Algorithm to Route Arbitrary Per
mutations on 8-Input 5-stage Shuffle/Exchange Network,” International Parallel
Processing Symposium, 1991.
[52] K. Kim and V. K. Prasanna Kumar, “On Efficient Parallel Memory Systems,”
unpublished.
105!
[53] K. Kim and V. K. Prasanna Kumar, “A Sparse Linear System Solver on Star
Graphs,” accepted for publication on International Conference on Parallel Pro
cessing, 1991.
[54] D. E. Knuth, The Art of Computer Programming, Vol. 3/Sorting and Searching,
Addison-Wesley Publishing Company, 1973.
[55] C. K. Kothari, S. Lakshmivarahan and H. Peyravi, “A Note on Rearrange-j
able Networks,” Technical Report, School of Engineering and Computer Science/
University of Oklahoma, November 1983.
[56] D. J. Kuck, “ILLIAC IV Software and Application Programming,” IEEE Trans
actions on Computers, Vol. C-17, pp. 758-770, 1968.
[57] S. Y. Kung, VLSI Array Processors, Prentice Hall, 1988.
[58] S. Y. Kung and J. N. Hwang, “Systolic Architectures for Artificial Neural Nets,”
Intional Conference on Neural Networks, Vol. 2, pp. 165-172, 1988.
I
[59] D. H. Lawrie, “Access and Alignment of Data in an Array Processor,” IEEE
Transactions on Computers, Vol. C-24, pp. 1145-1155, 1975.
[60] D. H. Lawrie and C. R. Vora, “The Prime Memory System for Array Access,”'
IEEE Transactions on Computers, Vol. C-31, pp. 435-442, 1982. j
[61] C. Y. Lee, “An Algorithm for Path Connection and its Applications,” IRE,
Transactions on Electronic Computers, Vol. EC-10, pp. 346-365, 1961.
[62] D. Lee, “Scrambled Storage for Parallel Memory Systems,” International Sym
posium on Computer Architecture, pp. 232-239, 1988.
[63] D. Lee and Y. H. Wang, “Conflict-free Access of Arrays in a Parallel Processor,”
AC M Symposium on Parallel Algorithms and Architectures, 1989.
[64] J. Lenfant, “Parallel Permutations of Data: A Benes Network Control Algo
rithm for Frequently Used Permutations,” IEEE Transactions on Computers,
Vol. C-27, pp. 637-647, 1978. ;
j
[65] G. F. Lev, N. Pippenger and L. G. Valiant, “A Fast Parallel Algorithm forj
Routing in Permutation Networks,” IEEE Transactions on Computers, Vol. C-j
30, No. 2, pp. 93-100, 1987. |
[66] N. Linial and M. Tarsi, “Efficient Generation of Permutations with the Shuf-;
fle/Exchange Network,” manuscript, Department of Computer Science, UCLA,
1983.
106,
[67] N. Linial and M. Tarsi, “Interpolation Between Bases and the Shuffle Exchange
Network,” European Journal of Combinatorics, Vol 10, pp. 29-39, 1989.
[68] M. B. Long, K. Lyons and J. K. Lam, “Acquisition and Representation of 2D!
and 3D Data from Turbulent Flows and Flames,” IEEE Computer, pp. 39-45,,
August 1989. |
[69] A. Menn and A. Somani, “An Efficient Sorting Algorithm for the Star Graph In
terconnection Network,” International Conference on Parallel Processing, Vol III,
pp. 1-8, 1990.
[70] D. I. Moldovan, W. Lee and C. Lin, “SNAP: A Marker-Propagation Architec
ture for Knowledge Processing,” Technical Report CENG 89-10, Department of
Electrical Engineering-Systems, University of Southern California, 1989.
!
[71] D. I. Moldovan, W. Lee, C. Lin and S. Chung, “Parallel Knowledge Processing I
on SNAP,” International Conference on Parallel Processing, Vol. I, pp. 474-481, j
1990.
[72] 0 . Monga and R. Deriche, “3D Edge Detection Using Recursive Filtering: Ap
plication to Scanner Images,” IEEE Conference on Computer Vision and Pattern>
Recognition, pp. 28-35, 1989. i
[73] D. Nassimi and S. Sahni, “A Self-Routing Benes Network and Parallel Permu-j
tation Algorithms,” IEEE Transactions on Computers, Vol. C-30, pp. 332-340,,
1981. j
*
[74] D. Nassimi and S. Sahni, “Parallel Algorithms to Set Up the Benes Permuta-I
tion Network,” IEEE Transactions on Computers, Vol. C-31, No. 2, pp. 148-154, i
February 1982. j
[75] J. N. Navarro, J. M. Llaberia and M. Valero, “Partitioning: An Essential Step in
Mapping Algorithms into Systolic Array Processors,” IEEE Computer, pp. 77-89,
July 1987.
[76] M. Nigam, S. Sahni and B. Krishnamurthy “Embedding Hamiltonians and Hy
percubes in Star Interconnection Graphs,” International Conference on Parallel
Processing, Vol III, pp. 340-343, 1990.
I
I
[77] W. Oed and 0 . Lange, “On the Effective Bandwidth of Interleaved Memories in
Vector Processor Systems,” IEEE Transactions on Computers, Vol. C-34, pp. 949-
957, October 1985.
[78] D. C. Opferman and N. T. Tsao-Wu, “On a Class of Rearrangeable Switching
Networks,” Bell Systems Technical Journal, Vol 50, pp. 1579-1618, May-June
1971.
i
i
_______________________________ __ 107]
[79] J. M. Ortega, Introduction to Parallel and Vector Solution of Linear Systems,
Plenum Press, New York, 1988.
[80] M. A. Palis, S. Rajasekaran and D. S. L. Wei, “General Routing Algorithms for
Star Graphs,” International Parallel Processing Symposium, pp. 597-611, 1990.
[81] J. W. Park, “An Efficient Memory System for Image Processing,” IEEE Trans
actions on Computers, Vol. C-35, pp. 669-674, 1986.
[82] D. S. Parker, “Notes on Shuffle/Exchange-Type Switching Networks,” IEEE
Transactions on Computers, Vol. C-29, No. 3, pp. 213-222, March 1980.
[83] G. F. Pfister, W. C. Brantley, D. A. Gerge, S. L. Harvey, W. J. Kleinfelder, K. P.
McAuliffe, E. A. Melton, V. A. Norton, and J. Weiss, “The IBM Research Parallel
Prototype (RP3): Introduction and Architecture,” International Conference on
Parallel Processing, pp. 764-771, 1985.
[84] D. A. Pomerleau, G. L. Gusciora, D. S. Touretzky and H. T. Kung, “Neural
Networks Simulation at Wrap Speed: How We Got 17 Million Connections Per
Second,” International Conference on Neural Networks, Vol. 2, pp. 143-150, 1988.
[85] M. J. Quinn, Designing Efficient Algorithms for Parallel Computers, McGraw-
Hill, 1987.
[86] C. S. Raghavendra and V. K. Prasanna Kumar, “Permutations on Illiac IV-Type
Networks,” IEEE Transactions on Computers, Vol. C-35, No. 7, pp 662-669, July
1986.
[87] C. S. Raghavendra and A. Varma, “Rearrangeability of the 5 Stage Shuf
fle/Exchange Network for N = 8,” International Conference on Parallel Pro-
cessing, 1986. j
[88] C. S. Raghavendra and A. Varma, “Rearrangeability of Multistage Shuf-j
fle/Exchange Networks,” International Symposium on Computer Architecture,,
p p . 1 5 4 - 1 6 2 , 1 9 8 7 . j
t
[89] C. S. Raghavendra and R. Boppana, “On Methods for Fast and Efficient Parallel;
Memory Access,” International Conference on Parallel Processing, Vol. 1, pp. 76-J
83, 1990. j
[90] S. J. Riederer,“Recent Advances in Magnetic Resonance Imaging,” IEEE pro
ceedings, pp. 1095-1105, September 1988.
j [91] H. J. Ryser, Combinatorial Mathematics, The Carus Mathematical Mono-j
graphs, No. 14, The Mathematical Association of America, 1963. !
[92] U. Schwiegelshohn, “A Short Periodic Two-Dimensional Systolic Sorting Algo-1
rithm ”, International Conference on Systolic Arrays, pp. 257-264, 1988. j
I
[93] H. D. Shapiro, “Theoretical Limitations on the Efficient Use of Parallel Mem
ories,” IEEE Transactions on Computers, Vol. C-27, pp. 421-428, 1978. \
[94] H. Shirakawa and T. Kumagai, “An Organization of a Three-Dimensional Ac
cess Memory,” International Conference on Parallel Processing, pp. 137-138, 1980.!
[95] D. B. Shu, J. G. Nash, M. M. Eshaghian and K. Kim “Straight-Line Detec-j
tion on a Gated-Connection VLSI Network,” International Conference on Pattern
Recognition, Vol II, pp. 456-461, June 1990.
[96] D. B. Shu, J. G. Nash and K. Kim, “Parallel Implementation of Image Under
standing Tasks on Gated-Connection Networks,” International Parallel Processing
Symposium, 1991.
[97] D. B. Shu, J. G. Nash, M. M. Eshaghian and K. Kichul, “Implementation and
Application of A Gated-Connection Network in Image Understanding,”, to be
published as a chapter in Reconfigurable Massively Parallel Computers, H. Li andj
Q. F. Stout, Ed. Prentice Hall, 1991. I
i
[98] H. J. Siegel, “The Universality of Various Types of SIMD Machine Interconnec-1
tions Networks,” International Symposium on Computer Architecture, pp. 70-79,
1977. '
[99] H. S. Stone, “Parallel Processing with the Perfect Shuffle,” IEE E Transactions
on Computers, Vol. C-20, No. 2, pp. 153-161, February 1971. ;
[100] S. Tomboulian, “A System for Routing Arbitrary Directed Graphs on SIMD^
Architectures,” ICASE Report No. 87-14, Institute for Computer Applications in
Science and Engineering, NASA Langley Research Center, March 1987.
[101] S. Tomboulian, “Introduction to a System for Implementing Neural Net Con
nections on SIMD Architectures,” ICASE Report No. 88-3, Institute for Computer
Applications in Science and Engineering, NASA Langley Research Center, Jan
uary 1988.
[102] S. Tomboulian, “Overview and Extensions of a System for Routing Directed
Graphs on SIMD Architectures,” Frontiers of Parallel Processing, 1988.
[103] P. S. Tseng, “Iterative Sparse Linear System Solvers on Wrap” , International^
Conference on Parallel Processing, pp. 32-37, 1988. J
[104] L. W. Tucker and G. G. Robertson, “Architecture and Applications of the I
Connection Machine”, IEEE Computer, pp. 26-38, August 1988. |
109
[105] J. D. Ullman, Computational Aspects o f VLSI, Computer Science Press, 1984.
[106] R. S. Varga, Matrix Iterative Analysis, Prentice Hall, Englewood Cliffs, 1962. !
[107] D. C. V. Voorhis and T. H. Morrin, “Memory Systems for Image Processing,”
IEEE Transactions on Computers, Vol. C-27, pp. 113-125, 1978. j
[108] H. A. G. Wijshoff and J. V. Leeuwen, “The Structure of Periodic Storage
Schemes for Parallel Memories,” IEEE Transactions on Computers, Vol. C-34,
pp. 501-505, 1985.
I
f
[109] H. A. G. Wijshoff and J. V. Leeuwen, “On Linear Skewing Schemes and'
d—Ordered Vectors,” IEEE Transactions on Computers, Vol. C-36, pp. 233-239,!
1987. |
[110] C. L. Wu and T. Y. Feng, “The Universality of the Shuffle-Exchange Network,” j
IEEE Transactions on Computers, Vol. C-30, No. 5, pp. 324-332, May 1981. I
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
Asset Metadata
Core Title
00001.tif
Tag
OAI-PMH Harvest
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC11257275
Unique identifier
UC11257275
Legacy Identifier
DP22820