Close
About
FAQ
Home
Collections
Login
USC Login
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Algorithmic approaches for reducing communication costs in distributed memory machines
(USC Thesis Other)
Algorithmic approaches for reducing communication costs in distributed memory machines
PDF
Download
Share
Open document
Flip pages
Copy asset link
Request this asset
Request accessible transcript
Transcript (if available)
Content
INFORMATION TO USERS
This manuscript has been reproduced from the microfilm master. UMI
films the text directly from the original or copy submitted. Thus, some
thesis and dissertation copies are in typewriter face, while others may
be from any type of computer printer.
The quality of this reproduction is dependent upon the quality of the
copy submitted. Broken or indistinct print, colored or poor quality
illustrations and photographs, print bleedthrough, substandard margins,
and improper alignment can adversely affect reproduction.
In the unlikely event that the author did not send UMI a complete
manuscript and there are missing pages, these will be noted. Also, if
unauthorized copyright material had to be removed, a note will indicate
the deletion.
Oversize materials (e.g., maps, drawings, charts) are reproduced by
sectioning the original, beginning at the upper left-hand comer and
continuing from left to right in equal sections with small overlaps. Each
original is also photographed in one exposure and is included in
reduced form at the back of the book.
Photographs included in the original manuscript have been reproduced
xerographically in this copy. Higher quality 6” x 9" black and white
photographic prints are available for any photographs or illustrations
appearing in this copy for an additional charge. Contact UMI directly
to order.
A Bell & Howell Information Company
300 North Z eeb Road. Ann Arbor. M l 48106-1346 USA
313/761-4700 800/521-0600
ALGORITHMIC APPROACHES FOR
REDUCING COMMUNICATION COSTS IN
DISTRIBUTED MEMORY MACHINES
by
Cho-chin Lin
A D issertation Presented to the
FACULTY OF TH E G R A D U A TE SCHOOL
U N IV ER SITY OF SO U T H E R N C A L IFO R N IA
In Partial Fulfillm ent of th e
R equirem ents for the Degree
D O C TO R O F PH IL O SO PH Y
(C om puter Engineering)
August 1995
Copyright 1995 Cho-chin Lin
UMI Number: 9621717
UMI Microform 9621717
Copyright 1996, by UMI Company. All rights reserved.
This microform edition is protected against unauthorized
copying under Title 17, United States Code.
UMI
300 North Zeeb Road
Ann Arbor, MI 48103
UNIVERSITY OF SOUTHERN CALIFORNIA
THE GRADUATE SCHOOL
UNIVERSITY PARK
LOS ANGELES, CALIFORNIA 90007
This dissertation, written by
C h o - c h i n L in
under the direction of h..Xs. Dissertation
Committee, and approved by all its members,
has been presented to and accepted by The
Graduate School, in partial fulfillment of re
quirements for the degree of
DOCTOR OF PHILOSOPHY
Dean of Graduate Studies
Date
DISSERTATION COMMITTEE
....
Chairperson
To m y wife
Hui-ling
A cknow ledgem ents
It is a pleasure to express m y deep g ratitu d e to Professor Viktor K . Prasanna,
the chairm an of m y dissertation com m ittee, for his consistent assistance, encour
agem ent, and guidance throughout my grad u ate studies a t U niversity of Southern
California. This dissertation would not have been com pleted w ithout th e proper
m otivational support he offered when it was most needed. Numerous ideas on m y
research have been given to m e through his extensive research experience.
T hanks to Professor Jean-Luc G audiot and Professor Kai Hwang for guiding
me through the P h D qualifier. My deep appreciation is also given to Professor
Douglas Ierardi and Professor T im othy M. Pinkston for being on m y dissertation
com m ittee and for m any invaluable suggestions.
I appreciate th e help of Yongwha C hung in w riting th e code for m y algorithm
on a parallel m achine. In addition, thanks to my colleagues W ei-m ing Lin, Ash-
faq A. Khokhar, Cho-li Wang, Jongwoo B ae, W en-heng Liu, P rash an th B. B hat,
Suriyaprakash N atarajan , and Young-won Lim for th e ir support a n d help in dis
sertatio n related m atters, or otherwise.
Finally, but not least, I would like to express my m ost profound g ratitu d e to
my parents, and m y wife who have shown th eir continued endurance, sacrifice, and
love throughout m y doctorate study.
C ontents
D edication ii
A cknowledgem ents iii
List O f Figures vi
A bstract viii
1 Prem ise 1
1.1 In tro d u c tio n ........................................................................................................... 1
1.2 M otivation and A p p r o a c h e s ........................................................................... 8
1.3 A Sum m ary of R e s u l t s ..................................................................................... 12
2 A C om putational M odel 18
2.1 B a c k g r o u n d .......................................................................................................... 18
2.2 A R ealistic C om putational M o d e l ................................................................ 21
2.3 C om parison with O th er M o d e ls ................................................................... 26
2.3.1 T h e PRA M M o d e l .............................................................................. 27
2.3.2 T h e Network M o d e ls ........................................................................... 28
2.3.3 T h e LogP M o d e l .................................................................................. 28
2.3.4 T h e Postal M o d e l.................................................................................. 31
2.3.5 T h e BSP M o d e l..................................................................................... 32
2.4 Sum m ary .............................................................................................................. 33
3 Initial Data-m apping 35
3.1 B a c k g r o u n d .......................................................................................................... 36
3.2 M apping D ata to M odules with B ounded M em ory S iz e s..................... 37
3.2.1 Definitions and N otations ................................................................ 38
3.2.2 Scalability Analysis w ith B ounded M em ory Sizes .................. 40
3.2.3 O bservations an d D is c u s s io n s ......................................................... 45
3.3 M apping D ata to M em ory Locations ........................................................ 46
3.4 Sum m ary .............................................................................................................. 54
iv
4 Com m unication Latency R educing 55
4.1 B a c k g ro u n d ........................................................................................................... 56
4.2 Com m unication A ctivity H id in g .................................................................... 57
4.2.1 O ne-to-one ............................................................................................... 59
4.2.2 O n e - to - m a n y ........................................................................................... 62
4.2.3 M an y -to -o n e............................................................................................... 66
4.3 M essage-grain B a la n c in g ................................................................................... 70
4.3.1 D ata D is trib u tio n .................................................................................... 70
4.3.2 On the O verhead of Balancing M essag e -g ra in s........................... 76
4.4 Sum m ary ............................................................................................................... 80
5 Data-rem apping 82
5.1 B a c k g ro u n d ............................................................................................................ 83
5.2 Fixed D a ta -re m a p p in g ...................................................................................... 84
5.2.1 O n the Com plexity of Perform ing D ata-re m a p p in g .................... 87
5.2.2 On the Message Sizes for Perform ing D ata-rem apping . . . . 88
5.3 Dynam ic D ata-rem apping ............................................................................... 90
5.3.1 Definitions and N otations .................................................................. 90
5.3.2 In-Place C o m p u ta tio n s .......................................................................... 92
5.3.3 C om putations with D a ta -R e m a p p in g ............................................. 94
5.3.4 Im plem entation D etails and E xperim ental R e s u lts .................... 99
5.4 Sum m ary .................................................................................................................. 105
6 Conclusion 106
6.1 C ontributions of this D is s e rta tio n .................................................................... 106
6.2 F uture D ire c tio n s ....................................................................................................107
Bibliograpgy 111
v
L ist O f Figures
1.1 A diagram for a centralized m em ory m achine............................................ 3
1.2 A diagram for a distributed m em ory m achine............................................ 4
1.3 A diagram for a current d istributed memory m achine............................ 5
1.4 An approach to achieve large speed-ups on parallel m achines............... 9
1.5 Issues of m apping d ata to distributed m em ory m achines........................ 11
2.1 An overview of our com putational m odel...................................................... 22
2.2 An illustration for the d ata receptions at a m essage h an d ler......................24
3.1 Sequential Floyd A lgorithm ................................................................................ 42
3.2 D ata dependency in th e checkerboard p artitio n .......................................... 43
3.3 Expansion ranges of various parallel system s................................................ 45
3.4 M odule-m apping for window operations......................................................... 49
3.5 Eight sets of d ata groups...................................................................................... 50
3.6 M em ory-m apping A ................................................................................................ 51
3.7 M em ory-m apping B ................................................................................................ 53
4.1 An illustration for one-to-one com m unication.............................................. 59
4.2 O verlapping in perform ing one-to-one com m unication............................. 60
4.3 An illustration for one-to-m any com m unication......................................... 63
4.4 An illustration for a m odule-partition approach......................................... 64
4.5 Trade-off in the m odule-partition approach.................................................. 65
4.6 An illustration for m any-to-one com m unication.......................................... 67
4.7 O verlapping in perform ing m any-to-one com m unication..............................69
4.8 An illustration for m odule-access contention................................................ 72
4.9 A com m unication p attern for distributing d a ta am ong 20 modules. . 73
4.10 Algorithm _l for d ata d istrib u tio n..................................................................... 74
4.11 A lgorithm -2 for d ata d istrib u tio n ...................................................................... 76
4.12 N otations used in the closed form ..................................................................... 78
4.13 An illustration for th e am ount of d ata th a t needs to be balanced. . 79
5.1 An illustration for 16-point F F T butterfly g rap h ........................................ 84
5.2 An illustration for cyclic layout......................................................................... 85
5.3 An illustration for hybrid layout....................................................................... 86
vi
5.4 A 128 by 128 raw im age....................................................................................... 91
5.5 A 128 by 128 image w ith m arked pixels........................................................ 92
5.6 M ajor steps in perform ing data-rem apping................................................... 94
5.7 M ajor steps in perform ing cluster-labeling.................................................... 95
5.8 An illustration for the divide-and-conquer approach................................ 96
5.9 An exam ple for labeling clusters...................................................................... 97
5.10 Com m unication tim es for moving d a ta using various approaches. . . 101
5.11 Execution tim es for a 128 x 128 im age.................................... 102
5.12 Execution tim es for a 256 x 256 im age....................................103
5.13 Execution tim es for a 512 x 512 im age.................................... 103
5.14 Execution tim es for a 1024 x 1024 im age................................................. 104
A bstract
M any applications which concern issues of hum an welfare have enorm ous com pu
tatio n al requirem ents. However, to d ay ’s sequential com puters cannot m eet th e
requirem ents needed to im plem ent th e applications. C urrently, m any vendors are
offering d istributed m em ory parallel machines. In these m achines, th e tim e for a
processor to access d a ta from rem ote m odules is much longer th an th a t for th e
processor to access d a ta from its module.
S ubstantial overhead exists in current d istributed m em ory m achines to in itial
ize message delivery and schedule message reception. To achieve large speed-ups,
algorithm s should be designed in such a way as to m inim ize the overhead in ac
cessing rem ote m odules. In this dissertation, a realistic m odel is proposed to serve
as a bridge betw een algorithm designers and d istributed m em ory m achines. B ased
on the m odel, our approaches for achieving large speed-ups focus on: localizing
m ost of th e data-access and reducing the overhead in accessing rem ote d ata. These
approaches are developed in three aspects.
F irst, a m ethodology is developed for evaluating various strategies for m apping
d a ta to th e modules; then the effect of using various m apping strategies on a
parallel m achine w ith bounded m em ory size can be evaluated. A relationship
betw een th e size of a problem and the size of a m achine exists for th e m achine to
m aintain its efficiency a t a desired level. The relationship depends on th e m apping
strategy, in which th e m em ory requirem ent could increase non-linearly w ith th e
problem size. We have derived a range on th e num ber of m odules over which th e
efficiency of th e parallel machine can be m aintained at a desired level.
Second, techniques for reducing latency in moving d a ta am ong th e m odules
are considered. These techniques are developed for perform ing window operations,
for perform ing fundam ental com m unication p attern s, and for distrib u tin g d a ta
am ong th e modules. O ur results show th at for a given m odule-m apping of window
viii
operations, th e tim e for constructing messages can be reduced by storing the d a ta
at ap p ro p riate m em ory locations; by hiding partial com m unication activities, th e
com m unication latency in perform ing fundam ental com m unication p attern s can be
reduced; and th e m odule-access contention problem in perform ing d a ta d istribution
can be solved by balancing message-grains.
Finally, techniques are proposed for perform ing data-rem apping for image p ro
cessing applications with arb itrary d a ta dependency. For the data-rem apping, we
show th a t an algorithm designed for less am ount of rem ote data-access may degen
erate th e perform ance of som e types of parallel m achines. Then, a technique for
dynam ic data-rem apping is developed to elim inate inter-m odule d a ta dependency
as well as load im balance during run-tim e. We also rep o rt results of experim ents
conducted on an IBM SP-2.
ix
C hapter 1
P rem ise
Parallel processing has been an active area for a couple of decades. T he success
in parallel processing needs th e advances in hardw are technology an d software
technology. T h e goal of the hardw are design in a p arallel m achine is to provide
enorm ous com puting power for large-scale applications. Based on th e nature of
an application, parallel software should be tailored to effectively u tiliz e the com
p u tin g power of th e parallel m achine to suit the com putational requirem ent of th e
application.
In this ch ap ter, the need of parallel machines an d th e trend in developing
parallel m achines are presented in Section 1.1. In S ection 1.2, the m otivation of
th is research an d the approaches used in th is research are given. A sum m ary of
our results is listed in Section 1.3.
1.1 Introduction
W hen we look a t hum an history, we find th a t human beings have relied mainly on
th eir brains to perform various activities: such as rem em brance or calculation. In
order to reduce th e burden on hum an brains, varieties o f tools or ru les have been
developed to ease th e activities. Exam ples are: perform ing calculations by the aid
of abacus and th e slide rule, a n d keeping rem em brance b y the aid of ropes and the
knot-tieing rules. However, th e tools and rules did n ot totally relieve th e hum an
brains from th e burdens. As th e size and th e com plexity of activities to be carried
o ut is increasing, lim itations for hum an beings in perform ing such activities are
becom ing more an d more obvious. For exam ple,
1
• th e tools and rules for carrying out th e activities generally need th e cooper
ation of hum an hands and brains; this leads to a slow processing speed,
• hum an beings are notoriously prone to error, so th a t com plex activities per
formed by hands are unreliable, and
• th e rem em brance capacity of hum an beings is lim ited; m em ory becomes ob
scure after a period of tim e.
T hus, com puting m achines (com puters) are developed and em ployed to assist hu
m an beings in handling activities of increasing com plexity. T he m ajor com ponents
of to d a y ’s electronic sequential com puter consist of a m icroprocessor, a m em ory
unit, and in p u t/o u tp u t equipm ent.
M icroprocessor perform ance is advancing at a rate of 50 to 100% per year [27].
T he perform ance of state-of-the-art m icroprocessors can have com puting speed up-
to hundreds of M FLO PS. M em ory capacity is increasing at a ra te com parable to
the increase in capacity of DRAM chips: quadrupling in size every three years [24].
T oday’s personal com puters or w orkstations use tens or hundreds of M Bytes. It
seems th a t substantial progress has been achieved in sequential com puter technol
ogy. However, th e perform ance still cannot suit the com putational requirem ents of
th e grand challenge problem s, which are identified in the U.S. High-Perform ance
C om puting and C om m unication (HPCC) program . These problem s consider the
following applications:
C lim ate M odeling Fluid Turbulence
Pollution Dispersion H um an Genome
Ocean C irculation Q uantum C hrom odynam ics
Sem iconductor M odeling Superconductor M odeling
C om bustion Systems Vision and Cognition
T hese applications concern th e issues of hum an welfare and th e science th a t may
lead to a b etter living environm ent. T he applications have enorm ous com puta
tional requirem ents. Consider, for exam ple, th e problem of m odeling th e weather.
In 5 years tim e, d a ta collection facilities will be in place to define detailed atm o
spheric structures and perm it significant advances in forecasting capabilities. T he
2
a-) a-2
Interconnection network
b-1 b-2
P ,: processor /
: memory Unit j
Figure 1.1: A diagram for a centralized m em ory m achine.
goal of im proving atm ospheric m odeling resolution to a 5-km scale and providing
tim in g results is believed to require 20 T FLO PS of perform ance [54]. However,
to d a y ’s m ost powerful sequential com puters cannot m eet th e com putational re
quirem ent needed to im plem ent this approach. Thus, it is obvious th a t a serious
a tta c k on th is application will require high perform ance parallel m achines.
M any parallel m achines w ith various architectures appear to pursue th e goal
of high perform ance com puting. According to the m em ory organization of the
M IM D (m ultiple instruction and m ultiple data) parallel m achines, th e m achines
can be classified into two categories: centralized m em ory m achines and d istrib u ted
m em ory m achines [48]. C entralized m em ory machines p u t th e processors all on one
side and th e m em ories on the o ther side. T he stru ctu re of a centralized m em ory
m achine is illu strated in Figure 1.1. T he tim e for accessing d a ta from any m em ory
location is th e sam e for all the processors. The reason is th a t every data-access
should go over th e interconnection netw ork. Exam ples of these m achines are:
• C ray Y /M P • SGI 4/360
3
a-1 a-.
a-2
Interconnection network
P i : processor i
M j: memory unit j
Figure 1.2: A diagram for a d istributed m em ory m achine.
• IBM ES9000 • Sequent Sym m etry
For a d istrib u ted m em ory m achine, th e m em ory u n its are d istrib u ted across the
m achine. Each of the processors has som e memory units close to it. We refer to
those m em ory u nits as th e local m em ory of the processor. In contrast to th e local
m em ory, the o th er m em ory units are called the rem ote m em ory of th e processor.
T h e stru ctu re of a d istributed memory m achine is illustrated in Figure 1.2. In such
a m em ory organization, th e tim e for accessing d ata from local m em ory is shorter
th an th a t for accessing d a ta from rem ote memory. D ue to th e cost/perform ance
advantage, current d istributed memory parallel m achines are form ed by a collection
of com plete com puters (processor/m em ory m odules) com bined by an interconnec
tion netw ork. Figure 1.3 illustrates th e current tre n d in building a d istrib u ted
m em ory m achine. In these machines, d a ta can be accessed from local m em ory
w ithout th e aid of an interconnection netw ork. Exam ples of such m achines include
• T hinking M achines CM-5. • Intel Paragon.
• C luster of W orkstations. • IBM SP1/SP2.
• Stanford D ASH /FLA SH . • C ray T3D.
4
a-2 a-1
Interconnection network
f ---------\
a
Mi
processor/memory module i
Figure 1.3: A diagram for a cu rren t d istributed m em ory m achine.
Basically, those parallel machines are built from a sm all num ber of basic com
ponents, w ith no single bottleneck com ponent. Thus, th e parallel m achines can
be increm entally expanded over th e ir designed scaling range and potentially can
deliver linear increm ental perform ance for a well-defined set of applications. Such
parallel m achines are considered to be scalable [6].
According to the rem ote data-access capability of the m odules in the distributed
m em ory m achines, th e machines have either m ultiple disjoint addressing-spaces or
a global addressing-space. For those machines having m ultiple disjoint addressing-
space, m essage-passing is used to access data from rem ote memory. For those ma
chines having a global addressing space, distributed-shared-m em ory (D SM ) is used
to access d a ta from rem ote memory. The principal difference between message-
passing and DSM is in their protocols [32], A m ajor problem encountered in
DSM is th a t a module needs to use a sequence of loads an d stores to perform re
m ote data-access using messages of fixed size. T his results in high com m unication
overhead in transferring a block of d a ta in DSM. A m ajor criticism of m ost com
m ercial m essage-passing machines is their high software overhead associated with
user-level m essage-passing. This can be inefficient if a p air of modules frequently
com m unicate short messages. R ecent work [16] has shown th a t to su it th e tim
ing requirem ents of various applications, the com m unication among th e modules
should be perform ed well with eith er very short messages or very long m essages as
well as w ith a m ixture of very short and very long messages. Thus, th e architec
tu res of cu rren t parallel machines ten d to support both ty p es of com m unications.
M essage-passing m achines are m oving towards efficient su p p o rt for com m unicat
ing short messages of fixed size (e.g. active messages in C M -5), features norm ally
associated w ith DSM m achines. Similarly, DSM m achines are beginning to pro
vide efficient message-like block transfers (e.g. th e block tran sfer engine in T3D
and th e M agic chip in FLASH). T hus, the users of these parallel m achines should
choose th e appropriate com m unication m echanism according to the requirem ents
of th eir com putations or redesign th e algorithm s to exploit the com m unication
m echanism s.
6
A lthough various com m ercial or experim ental d istributed m em ory m achines
are available or under construction, they have th e com m on inherent characteris
tics. T hose characteristics which cannot be overlooked for algorithm designers and
application developers to achieve high perform ance com puting are listed as follows.
• Since a m odule th a t accesses d ata from rem ote m em ory m ust go over the
interconnection network, th e tim e for perform ing rem ote data-access is much
longer th an th a t for perform ing local data-access. T he steps for a m odule to
access d ata from rem ote m em ory in a d istributed m em ory m achine include
m essage construction, network startu p processing, and d ata tran sp o rtatio n .
• Various com m unication m echanism s m ay be provided by parallel machines;
each of the m echanism s is designed to suit the requirem ent in perform ing a
specific class of com m unications. Thus, users of the parallel m achine should
carefully choose an appropriate m echanism according to the com m unication
requirem ent of an application.
• T h e interconnection network supports point-to-point com m unication. Thus,
a m odule accessing d ata from rem ote m em ory can be perform ed w ithout
disturbing th e com putational activities perform ed in th e other m odules.
• Each m odule has lim ited com m unication bandw idth, th a t is, each m odule
can deliver or receive d ata of fixed am ount in a constant period of time.
T hus, m ultiple m odules attem p tin g to access a specified m odule can cause
th e problem of m odule-access contention.
• T h e network capacity is finite. A source m odule will suspend its d a ta delivery
activities if th e network is saturated by th e data.
• T h e im portance of the com m unication protocols is increasing. T h e protocols
define how th e com m unication activity of a pair of com m unicating m odules
is operated. In general, accessing d ata from rem ote m em ory using different
com m unication protocols can lead to very different com m unication tim es.
Recently, several models [5,14, 28, 58] have been proposed. These m odels are to
capture th e characteristics of current distributed m em ory m achines for algorithm
7
designers tailoring their algorithm s to suit th e n atu re of an application. Several
m echanism s have also been built into the d istributed m em ory m achines for hiding
th e long latency in accessing d a ta from rem ote memory. These m echanism s in
clude d ata pre-fetching, shared virtual m em ory [39], coherent caches [38], scalable
coherence interface [29], and relaxed m em ory consistency [17]. In this dissertation,
we propose a realistic m odel serving as a bridge for th e algorithm designers and
th e d istributed m em ory m achines. T hen, techniques for reducing com m unication
costs in distributed m em ory m achines using algorithm ic approaches are proposed.
T he rest of this chapter gives the m otivation of this research and approaches used
in this research, and sum m arizes our results.
1.2 M otivation and A pproaches
O ur research is m otivated by the technological trends in designing state-of-the-
a rt com m ercial or experim ental parallel m achines. Today, m ost M IM D parallel
m achines are essentially formed by a collection of processor/m em ory m odules con
nected by an interconnection netw ork. These m achines are nam ed “d istrib u ted
m em ory m achines” . In the d istributed m em ory m achines, each m odule accesses
d a ta from o th er m odules (rem ote data-access) over an interconnection network.
T he com m on arch itectu ral feature of th e parallel m achines encourages th e users to
em ploy the locality of data-access in solving application problem s on th e m achines.
T hus, several techniques for achieving or restoring d a ta locality are proposed and
investigated.
T he hardw are of m any current M IM D distributed m em ory m achines is designed
to deliver com puting power in the G FLO PS range. T he T FL O P S perform ance of
parallel m achines is the next goal to be pursued. A general approach for achiev
ing large speed-ups on these m achines is to p artitio n a task into several subtasks.
T h en , each of these subtasks is m apped to th e m odules of the m achine. Figure
1.4 illustrates this approach. R em ote data-access m ay be necessary due to th e
possible d ata dependency among th e subtasks. T h e ex tra tim e for rem ote d ata-
access adds to th e to tal execution tim e. T he overhead of accessing d ata from
rem ote m em ory using m essage-passing or distributed-shared-m em ory can lead to
u nder utilization of the com puting power of th e parallel m achines. T his is a m ajor
< T task partition
useful
work
communication
extra
work
or
communication
overhead
( subtask ~ )
module
^ subtask ^
module
message construction
hardware overhead
software overhead
Figure 1.4: An approach to achieve large speed-ups on parallel m achines.
overhead incurred in parallelizing a task. In order to reduce the adverse effects
of inter-m odule data-access, novel approaches for reducing com m unication costs
need to be investigated. C om piler approaches to m inim ize th e com m unication tim e
have been proposed (see, for exam ple,[47]). However, for m any large-scale applica
tions, th e com putations generally involve arb itrary d a ta distribution, dependencies
am ong com putations to be perform ed in various processors, and frequent irregular
interprocessor com m unication. Due to the large com putational com plexity of these
applications, algorithm ic approaches for reducing th e tim e in accessing d a ta from
rem ote m em ory are also needed.
9
To investigate th e techniques developed for achieving high perform ance com
p u tin g , a model of parallel com putation w hich bridges parallel m achines and par
allel software is very im portant. In this dissertation, we first propose a realistic
com putational m odel. A realistic model should reflect the users’ system view point
of th e parallel m achines. T he param eters of th e model m ust capture those features
th a t are fundam ental to predict the perform ance of the parallel softw are running
on distributed m em ory m achines. The m odel should be concise enough to develop
a n d analyze algorithm s easily. T he model should provide th e stage for develop
ing portable algorithm s, which can be run on different architectures w ith little
m odification.
Since the com m on feature of distributed m em ory m achines is th eir large re
m o te data-access tim e, it im plies th at the d a ta locality is a key issue in achieving
high perform ance on such m achines. This suggests th at developing effective data-
m apping techniques should be a useful approach for reducing the to tal execution
tim e. In this dissertation, based on our m odel, several techniques for achiev
ing effective data-m apping are proposed, and th e lim itations and usefulness of the
techniques are investigated. In general, the issues in data-m apping consist of initial
data-m apping and data-rem apping as shown in Figure 1.5. Initial data-m apping
considers how to assign data to each of the m odules to m inim ize the com m unica
tio n costs for th e en tire or p arts of the execution. In general, initial data-m apping
is to capture th e d a ta locality of the com putation at the beginning, th u s, most of
th e data-accesses can be localized. For some applications, th e localized data-access
m ay become obscure after a period of com putations and th e num ber of rem o te data-
accesses increases. Each rem ote data-access will experience a high com m unication
latency. Thus, an approach of data-rem apping to restore th e data locality of the
com putations is necessary. D ata-rem apping needs to consider the d a ta re-layout
and d ata tran sp o rtatio n am ong the modules. These are th e overheads incurred
in data-rem apping. For those applications w hich have arb itra ry d ata dependency
and im balanced lo ad among th e modules, gathering global inform ation is necessary
for determ ining th e policy of d a ta re-layout. Global inform ation can b e gathered
in to a module by accessing d a ta from rem ote memory. It is obvious th a t data
tran sp o rtatio n am ong modules plays an im p o rtan t part in data-rem apping. Due
10
c
Approach
)
Technique used to carry out an approach
GOAL:
To achieve or restore
data locality
Data-mapping
Data-remapping
Initial data-mapping
Replication
Data re-layout
Data partition
Data transportation
^ ^ S c h e d u lin g ^ (^nT in-packjng) (^Overlapping^
re 1.5: Issues of m apping d ata to distributed m em ory m achines.
to technological constraints, th e lim ited bandw idth am ong any p air of com m uni
cating m odules may lead to an intolerant com m unication tim e if an inappropriate
com m unication scheduling strateg y is applied. For exam ple, d a ta tran sp o rtatio n
am ong a group of m odules may be poorly scheduled such th a t m any m odules access
th e sam e sm all group of m odules a t the sam e tim e. A straightforw ard approach
to avoid this can be achieved using the linear perm u tatio n algorithm [8]. In th e
approach, d a ta transportation am ong the m odules can be perform ed using several
p erm u tatio n steps; in each perm utation step, each m odule in th e group directly
sends (receives) one message to (from ) the o th er m odule. However, th e approach
cannot be directly applied to applications w hich require com m unicating messages
w ith high variance in sizes. T his suggests th a t novel approaches are required for
d a ta tran sp o rtatio n am ong m odules to ensure gains from data-rem apping larger
th an the overheads paid for th e data-rem apping. As illustrated in Figure 1.5,
techniques such as com m unication scheduling, m essage-grain packing, overlapping
com m unication activities, etc. should be considered for developing fast d ata tra n s
p o rtatio n algorithm s.
1.3 A Sum m ary o f R esults
T h e contributions of this dissertation are two fold: propose a com putational m odel
for bridging parallel software and distributed m em ory architectures, and investi
gate th e im pacts of our data-m apping techniques based on the m odel.
In C h ap ter 2, a realistic structure-independent m odel is proposed to capture th e
com m on architectural features of state-of-the-art d istrib u ted m em ory m achines.
To illu strate th e uniqueness of our model, th e model is com pared w ith several re
cently proposed models (the P ostal model [5], the LogP model [18], and the B SP
m odel [58]). T he proposed functional elem ents of our m odel represent an algo
rith m designer’s perspective. To m odel the rem ote data-access in th e d istrib u ted
m em ory m achines, we abstract th e com m unication facilities into point-to-point du
plex channels and duplex message handlers, ( “Duplex” refers to a com m unication
facility which can perform d ata delivery and d ata reception at th e same tim e).
A channel is dedicated to a pair of modules for tran sp o rtin g d ata an d a m essage
handler is attached at each m odule for delivering (and receiving) d a ta to (from )
12
the channel. If th ere are m ultiple messages arriving a t a target m odule at th e
same tim e interval, the m essage handler at th e target m odule should choose a
m essage to receive according to its reception policy. A message handler can be a
hardw are device (such as the support circuit in T3D), it can be a com bination of
software and hardw are (such as the com m unication processor in SP-2), or it can be
a softw are routine executed by a processor (such as in CM -5). Intuitively, the role
of th e m essage handlers in o u r model is to capture th e com m unication protocol
overheads used for rem ote data-access. A p air of message handlers should cooper
ate to tran sp o rt d a ta from consecutive m em ory locations in the source m odule to
consecutive m em ory locations in the target m odule. In general, th e step s for com
m unicating a m essage between a pair of m odules include constructing a message
(in w hich m achine users copy d ata from local m em ory locations into consecutive
local m em ory locations), starting-up the netw ork, delivering d ata to th e netw ork,
and receiving d a ta from the network. T he m ajor differences betw een our m odel
and o th e rs’ are th a t we capture th e message construction activity and th e message
reception policy a t th e target module.
In C h ap ter 3, th e suitability of applying various initial data-m apping strategies
on d istrib u ted m em ory m achines to achieve scalability is evaluated. T h e evaluation
is m ade by considering the situ atio n in w hich the num ber of processor/m em ory
m odules is increased to suit th e com putational requirem ent of increasing problem
size. F urtherm ore, an approach for reducing com m unication costs by totally or
partially elim inating the m essage construction activity is proposed.
• C u rren t scalable parallel machines can be scaled up by adding m ore proces
sor/m em ory modules. T h u s, the available com puting power an d available
physical m em ory is scaled up in proportion to th e num ber of m odules. It
is well known th a t a parallel system of increasing size m ay m ain tain its effi
ciency constant by increasing the am ount of input d ata. A parallel system
refers to a m achine-algorithm pair. B ased on this, isoefficiency function has
been defined to measure th e scalability of a parallel system in [33]. In general,
as th e problem size increases, the m em ory requirem ent for an initial d ata-
m apping technique may increase in different rates. However, isoefficiency
13
analysis ignores the available m em ory in the parallel m achines. In this chap
ter, the scalability of th e parallel system s w ith lim ited available m em ory is
discussed. Assum e th a t th e to tal m em ory of a d istributed m em ory m achine
is linearly proportional to th e num ber of m odules and th e efficiency of the
parallel system needs to be m aintained at a desired level. Then, our result
shows th a t although the isoefficiency function exists for a parallel system , th e
num ber of m odules of th e distributed m em ory m achines m ay be bounded.
We call the u pper bound of the num ber as ‘expansion range’ of th e parallel
m achine. This implies th a t the available physical m em ory of a d istrib u ted
m em ory m achine also affects the scalability of th e parallel system . O ur re
sult also show th a t a parallel system with a b e tte r isoefficiency m easure m ay
possibly have a sm aller expansion range th an th a t w ith poor isoefficiency
m easure. T hus, the expansion range of a parallel m achines should also be
taken into consideration for m easuring the scalability of a parallel system .
• fn general, d a ta item s need to be placed in consecutive m em ory locations in
a source m odule to be delivered to consecutive m em ory locations in a target
m odule. To place the d a ta in consecutive m em ory locations, m em ory to
m em ory copy m ay be necessary. T he tim e for copying d a ta from a m em ory
location to another m em ory location is not negligible. For exam ple, the
m em ory to m em ory copy tim e is 0.01 /zsec/byte in SP-2. ft is m ore th an one
th ird of th e u n it d ata transm ission tim e (0.028 ^isec/byte) in th e m achine,
fn this chapter, we propose an approach for partially or to tally elim inating
m em ory to m em ory copy activities. This is achieved by eith er (1) delivering
d a ta of the sam e destination using m ultiple messages of sm aller size or (2)
placing d ata of the sam e destination as close as possible in th e local memory.
T he im portance of reducing m em ory to m em ory copy activities is increasing
as th e technology of new com m unication m edia (such as optical fiber) is
advancing which leads to even faster d ata transm ission rate. In this case,
th e m em ory to m em ory copy tim e will dom inate the d a ta tran sp o rtatio n
tim e.
In C hapter 4, techniques are proposed to reduce th e com m unication latency.
Users of parallel m achines can w rite th eir codes using different program m ing
14
paradigm s. For exam ple, program m ers can write th eir codes such th at com pu
tation phases of som e m odules can overlap the com m unication phases of the o ther
modules; or all th e m odules synchronously alternate between com putation phases
and com m unication phases. T he advantages of the la tte r is th a t th e codes can be
easily w ritten and debugged. To achieve large speed-ups for solving an application
problem on a d istributed m em ory m achine, the tim e spent in a com m unication
phase should be m inim ized. In this chapter, techniques are proposed to reduce the
tim e required for perform ing basic com m unications am ong the m odules.
• A straightforw ard approach for com m unicating a message takes the follow
ing steps: constructing th e message, starting up th e com m unication, and
d a ta transportation. In general, a processor needs to cooperate w ith th e
com m unication facility to start up a com m unication. However, there is an
o p p o rtu n ity for overlapping the activity of m essage construction w ith th a t
of d a ta transportation. We show th a t by choosing app ro p riate message sizes
and overlapping message construction w ith d a ta tran sp o rtatio n , the com m u
nication latency can be significantly reduced. T h e size of th e message should
be chosen according to th e architectural features of the parallel m achines
and the am ount of d ata to be transported. For exam ple, th e tim e for com
m unicating m bytes am ong a pair of modules can be up to t0 + m{td + 2tc),
w here t0 is th e startu p tim e, t is th e d ata tran sp o rtatio n ra te , and tc is the
m em ory to m em ory copy rate. However, using a sequence of messages of size
eclual to M es, where a = m ax ^f^,tc}, and overlapping th e d a ta
tran sp o rtatio n with message packing-and-unpacking (construction), one-to-
one com m unication can be efficiently perform ed in — a.){2tc + td)—a,
for m > / g, T7 ~ bytes.
— 2 tc+tj j
• D a ta redistribution am ong the m odules can be achieved by perform ing all-
to-all com m unication. To avoid congestion due to m any m odules com m uni
cating with a m odule, th e linear perm utation algorithm [8] can be applied.
However, as th e am ount of d ata delivered to or received from o th er m odules
are very different, the delay effect which causes th e com m unication activities
of som e m odules to be suspended m ay happen due to (1) th e lim ited band
w id th betw een any pair of com m unicating m odules and (2) finite capacity
15
of the network. If this delay effect happens a t each of th e p-1 perm u tatio n
steps, then th e tim e for perform ing data-redistribution can be up to p ta-\-E Y d,
w here L is th e m axim um of th e am ount of data to be delivered or received
in a module. O ur technique is proposed to elim inate this adverse effect. O ur
d a ta redistribution technique consists of tw o phases. D uring the first phase,
o u r technique attem p ts to balance the am ount of d a ta am ong th e m odules
which is to be delivered to o ther modules. During th e second phase, th e d ata
is sent to their final destination. Using our technique, th e tim e for perform ing
d a ta redistribution can be reduced to 2pt0 + 2L(td + 2C ) + tbs, where ff,s is the
tim e for perform ing a barrier synchronization. To get th e benefit from the
two-phase technique, the gain from em ploying the balancing phase should be
larger than th e overhead of employing th e balancing phase. A m ethod for
estim ating th e gain and overhead in using th e technique is also given. Based
on this, the necessity of perform ing d a ta redistribution using the tw o-phase
technique for a com m unication p attern can be judged.
In C hapter 5, approaches for data-rem apping are investigated. In general, d ata
rem apping can be used to balance th e load am ong the m odules as well as restore
the locality of data-access during com putation. For th e applications in which
the com m unication p attern s and load distribution can be known beforehand, the
approach used to rem ap d ata to th e modules can be pre-determ ined. For th e other
cases, dynam ic-data rem apping techniques are required. Two im portant problem s
(F F T an d contour tracing) in im age processing which have different com m unication
requirem ents have been chosen as an exam ple for dem onstrating the suitability in
applying data-rem apping techniques.
• T h e fast Fourier Transform (F F T ) com putes the D iscrete Fourier T ransform
(D F T ) of an n-dim ensional complex vector (# o ,# i,. 1). B utterfly al
gorithm [12] can be used to com pute F F T . T he initial data-m apping, which
m aps the butterfly algorithm [12] to th e processor/m em ory m odules of a
d istrib u ted m em ory m achine, can be cyclic layout or blocked layout. A hy
b rid layout is a sequence of various data-m appings, which are alternatively
applied during com puting F F T . Since th e F F T have fixed com m unication
16
patterns, fixed data-rem apping can be applied to rem ap cyclic (blocked) lay
out to blocked (cyclic) layout. C om puting F F T using hybrid layout has been
considered to have sm aller com m unication tim e com pared w ith blocked (or
cyclic) layout [14]. However, our result show th a t blocked (or cyclic) layout
can lead to a faster com m unication in some cases. This is due to th e large
startu p tim e dom inating th e d ata tran sp o rtatio n tim e in th e data-rem apping
process.
• In this dissertation, a technology for dynam ic data-rem apping is proposed to
eliminate inter-m odule d a ta dependencies as well as balance the load am ong
the m odules. O ur data-rem apping technique consists of two m ajor steps: (1)
decentralized cluster-labeling and (2) d ata m ovem ent. T he ‘cluster-labeling’
assigns a label to each of th e d ata clusters based on the inform ation of global
load and d a ta dependency among th e modules. This approach can be applied
to m any interm ediate level IU tasks for load redistribution. These include,
for exam ple, finding convex hull, approxim ating contours using linear seg
m ents, perceptual grouping, etc. To illustrate our ideas, we use th e contour
tracing problem as an exam ple. A contour is a cluster of m arked pixels which
denote the boundary of an object in an image. Based on the contours, m any
operations can be perform ed to ex tract the features of th e objects in th e im-
q
age. To show th e usefulness of our methodology, experim ents were conducted
on an IBM SP-2 w ith a dedicated pool of 64 m odules. O ur experim ental re
sults indicate th a t inter-m odule d a ta dependency is the m ajor factor th a t
lim its the speed-ups th a t can be achieved in tracing contours. In addition,
unbalanced work load on th e m odules can fu rth er reduce th e speedups.
17
C hapter 2
A C om putational M odel
T he developm ent of a parallel m odel that bridges parallel software an d parallel
m achines is fundam ental to the success of m assively parallel com putation. In
Section 2.1, the classification of com puter m odels is given a n d th e current trend in
developing models for parallel com putation is discussed. T h e nature of o u r model
and th e param eters used in our m odel are given in Section 2.2. In Section 2.3,
to illu strate the uniqueness and reality of our m odel, our m odel is com pared with
several well-known m odels.
2.1 Background
According to the purposes of the proposed m odels for parallel com putation, the
m odels can be classified into four basic types: th e program m ing model, th e ma
chine m odel, the architectural m odel, and th e com putational models [25]. The
program m ing model is to verify a program in te rm s of the sem antics of a particu
lar im plem entation of a program m ing language. T he other ones are m ore related
to th e perform ance evaluation on a parallel system (a software-m achine p air). The
m achine m odel consists of the d etail description of the h ard w are and operating
system s. T h e architectural model is at the n e x t higher level of abstraction over
th e m achine model. T h e architectural model describes the topology of th e inter
connection network of a parallel m achine and th e functions of the fundam ental
resources; b u t not its im plem entation details. T h e com putational model is at the
1 8
next higher level of abstraction. Ideally, a m odel of com putation provides an ab
stract view of a class of com puters. The basic requirem ents of the com putational
m odel of a parallel m achine are described as follows.
• T h e model m ust be capable of providing enough inform ation ab o u t th e rel
ative cost of parallel com putations and com m unications.
• T h e n atu re of th e model m ust provide th e basis for easily developing an
efficient algorithm as well as analyzing th e algorithm .
• For machine users to focus on the m ost im portant factors which affect the
algorithm designers, a com putational m odel should not use too m any param
eters for describing the perform ance of a parallel system . In our opinion, the
num ber of param eters should be no m ore th an five.
A num ber of m odels [5, 14, 28, 58] have recently been proposed to satisfy these
requirem ents. T he com m on feature of these m odels is th at th ey do not describe the
topology of the interconnection netw ork for th e class of parallel m achines which
th e models attem p t to capture.
B ased on the current technology, the sta rtu p tim e for delivering a m essage is
large com pared w ith the network latency. T he small netw ork latency is due to
th e fast network switches. For exam ple, the netw ork switch latency of IB M SP-2
is 125 nsec (the sta rtu p tim e is 40 (isec), and th e network latency of T M C CM-5
is 200 nsec (the startu p tim e 86 fisec). T he netw ork topologies of IBM SP-2 and
TM C CM-5 are m ultistage network and fat-tree respectively. T he types of inter
connection networks have very low diam eter. For those m achines of reasonable
sizes (less th an 16K processors), t h e ‘ com m unication tim e is insensitive to th e dis
tance between any pair of com m unicating processors. T he reason is th a t th e effect
of th e topology of an interconnection network is masked by th e large s ta rtu p tim e.
C urrently, the strategies for fault tolerant routing have been paid m uch atten tio n
in designing an interconnection network. Exam ple is the 3D tours netw ork struc
tu re used in th e Cray T3D in which d ata will b e routed using altern ativ e p ath s if
a broken link or a fault switch has been detected in its regular path. T h e data-
routing strategy guarantees th e parallel m achines can work u n d er th e environm ent
of broken links or fault switches. It also implies th a t the physical distance between
19
a pair of com m unicating processors is not necessary to be same as th e distance
of the routes for tran sp o rtin g d a ta between th e pair of com m unicating processors.
Since (1) th e network latency is sm all, (2) th e size of of current M IM D parallel m a
chines is reasonable, and (3) m ultiple routes can be used for tran sp o rtin g d ata; a
com putational model for current MIMD parallel m achines which do not im ply any
stru ctu ral inform ation about th e topology of their network is realistic. However,
the com putational m odel should capture th e relative costs of parallel com puta
tions and com m unications. We refer this ty p e of m odel as structure-independent
model. A m ajor advantage comes w ith a structure-independent m odel is th a t the
algorithm s designed on the m odel have larger portab ility and can operate in the
presence of broken links and broken netw ork switches.
Based on the above reasons, our model uses a structure-independent approach.
C urrently, m any parallel m achines provide different m echanism s for various re
quirem ents in perform ing com m unications. Exam ples are Cray T3D, TM C CM-5,
and Stanford FLASH. In these m achines, m echanism s are provided for delivering
d ata using long messages of various sizes or using short messages of fixed size.
For a m odule which delivers d a ta using long messages of various sizes, m em ory to
mem ory copy operations may be required for the m odule to construct a message.
The m essage construction is operated to m ove the d a ta which will be delivered to
consecutive m em ory locations. T he param eters of our m odel should be able to dis
tinguish th e costs for tran sp o rtin g d ata using various com m unication m echanism s.
In addition, our m odel can distinguish th e costs in perform ing com m unications
using long messages for transporting d a ta item s stored at consecutive m em ory
locations from th a t for transporting d ata stored at scattered m em ory locations.
C urrently, th e com m unication protocol in a parallel m achine plays an im p o rtan t
role in effectively accessing d ata from rem ote memory. T h e protocol is a p art of the
parallel m achine. T hus, we ab stract the com m unication protocol into a functional
elem ent in our model. T he functional elem ent schedules th e d a ta receptions of the
incom ing messages.
20
2.2 A R ealistic C om putational M odel
O ur developm ent of a structure-independent m odel for distributed m em ory m a
chines is m otivated by the technological trend in constructing parallel m achines.
T he n atu re and param eters of our m odel absorb th e factors which affect th e
algorithm designs and the application developm ents. C urrent scalable parallel
m achines are form ed by a collection of essentially com plete com puters (proces
sor/m em ory m odules) com bined by an interconnection network. In these m achines,
the m em ory is physically distributed across the parallel machines. We refer this
class of parallel m achines as distributed m em ory m achines. In these m achines,
accessing d ata from rem ote m em ory is m uch slower th an accessing d a ta stored a t
local memory. For developing effective parallel algorithm s on current d istrib u ted
m em ory m achines, several factors related to rem ote data-access need to be care
fully taken into consideration. These factors include: (1) large s ta rtu p tim e for
delivering a message, (2) lim ited com m unication bandw idth available to a m od
ule, (3) th e requirem ent of m em ory to m em ory copy operations for constructing a
message, (4) finite netw ork capacity, and (5) the policies for scheduling message
receptions a t the targ et modules. To m odel the rem ote data-access from algorithm
designers’ perspective, we abstract the communication facilities (including hard
ware and software supports) of a distributed m em ory m achine into point-to-point
channels and message handlers. Each of th e channels and message handlers can
perform d ata delivery and d ata reception at the sam e tim e. T he overview of our
m odel is illustrated in Figure 2.1. A message delivery is initiated by th e processor
in a source m odule which starts up the message handler to push a sequence of d a ta
item s into the channel. W ithout additional intervention by the processors, th e
message handlers at a source m odule and at a targ et m odule cooperate to tran s
port d a ta from consecutive m em ory locations in th e source m odule to consecutive
m em ory locations in the target module. At th e sam e tim e, the processors in both
of th e m odules can either be idle or perform other com putations.
T he detail descriptions of th e channels and th e message handlers are given as
follows:
21
■ — z -
data in consecutive
memory locations
I
control line
channel or data path
1 :
data processor] .processor)
memory
memory
data
message
handler
message
handler
data
Figure 2.1: An overview of our com putational m odel.
Point-to-point channel : channels are used to model a d istrib u ted m em
ory m achine w ith p modules. Any pair of m odules is connected using a chan
nel. D a ta exchange among a p air of com m unicating m odules is through the
channel which connects the com m unicating m odules. D ata are delivered and
received from the ends of th e channels. T h e function of th e point-to-point
channels is to cap tu re the featu re of current distributed m em ory m achines in
which a pair of m odules com m unicate w ithout disturbing th e com putation
or com m unication activities in o th er modules.
M essage handlers : p message handlers are used to model a d istrib u ted m em ory
m achines with p modules. E ach of the m essage handlers is attached a t a
22
m odule for delivering (receiving) d a ta from consecutive m em ory locations
in the m odule to (from) another m odule. T he message handler of a m odule
switches over th e channels connected to the m odule for message deliveries and
message receptions. Messages can be of various sizes. If th e message handler
of a m odule is busy in delivering a pending message, the message handler will
not respond to any request of message delivery from the processor in th e same
m odule. In this case, the processor in the m odule waits idly until th e message
handler can respond to its request. For receiving a message from a channel,
th e message handler cannot receive any new message before th e pending
message has been com pletely received. If there are m ultiple messages queued
a t the channels connected to a m odule, a decision of choosing a m essage from
th e queued messages will be m ade. T he decision is m ade according to the
d a ta reception policy of the message handlers. In a d istrib u ted m em ory
m achine, a message handler can be a hardw are device (such as th e support
circuit in T 3D ), it can be a com bination of softw are and hardw are (such as
th e com m unication processor in SP-2), or it can be a software executed by a
processor (such as the case in CM-5). The function of the message handlers is
to capture (1) th e lim ited com m unication bandw idth available to a m odule,
and (2) the com m unication protocol for effectively accessing rem ote d ata.
Basic assum ptions about our m odel are m ade as follows:
• Each of th e channels has capacity of at m ost one byte. T he activ ity of a
message handler in pushing d ata into a channel will be suspended if the
channel is satu rated . T he suspended message handler tem porarily ignores
any new request of m essage delivery. If the processor of th e sam e m odule
makes a new request of message delivery, the processor keeps idle u n til the
pending m essage has been com pletely delivered. T hus, our m odel encourages
th e m achine users to develop a good scheduling policy for delivering messages
in th eir program s. Note th a t the scheduling policy for receiving m essages is
captured by th e message handlers.
• T he message handler at a target m odule should switch am ong the p — 1 chan
nels, which connect to th e m odule, for receiving messages. T his is illu strated
23
channel rmnminmniimn data
source side
message
handler.
m em ory
target side
message
^h an d ler.
m em ory
Figure 2.2: An illustration for th e data receptions at a message h an d ler.
in Figure 2.2. T h e message reception policy of th e message h an d ler at a
targ et m odule can affect th e to tal com m unication tim e . The analyses m ade
in th e following chapters assum e that th e message handlers receive messages
using first-come-first-served message reception policy.
In our m odel, we use three param eters to cap tu re the com m unication cost.
Td (fisec/byte): th e rate at which the m essage handler transports d a ta
from consecutive memory locations in its source m odule to consecu
tive m em ory locations in its target m odule. For S P -2, Tj was m ea
sured to be 0.028/fsec/byte. This was m easured b y com puting th e
ratio ~ n^ e~ h~ rae for large messages on SP-2 using M P L .
24
T0 (/isec/m essage): the startu p tim e for delivering a message. It is
the interval from the tim e th e message handler responds to a request
of m essage delivery to the tim e the first d a ta byte has been pushed
into th e channel. For SP-2, Ta was observed to be 40^sec/m essage.
T his is the m inim um tim e for deliver a sm all message on SP-2 using
M PL. Since a s ta rtu p process is needed for delivering a message, th e
costs of transporting data am ong m odules using a long message or
using a sequence of short messages can be distinguished. Note th a t th e
sta rtu p tim e for long messages and th at for short messages may not be
the sam e.
Tc (/isec/byte): th e tim e for a u ser’s program to copy d a ta from a m em
ory location to another m em ory location. For SP-2, Tc was m easured to
be 0.01/zsec/byte. This is m easured by copying a large am ount of d a ta
betw een two ID arrays. T he copy rate is th e ratio of th e total copy
tim e to the to tal num ber of copied bytes. Since d ata item s need to be
m oved to consecutive m em ory locations for delivery using a long m es
sage, thus, the cost of transporting data sto red at consecutive m em ory
locations can be distinguished from that of transporting d a ta stored a t
non-consecutive m em ory locations.
In [40, 43], these param eters have been shown to have significant im pact on the
strategies used to reduce the rem ote data-access tim e, to m ap initial d a ta to the
modules, an d to rem ap the interm ediate output to the m odules. Our param eters
do not try to capture (1) the possible d a ta contention in an interconnection network
and (2) th e network latency. The rationales are based on the following observations.
• In th e construction of the cu rren t generation of d istrib u ted m em ory m a
chines, much atten tio n has been paid to avoiding d a ta contention a t the
netw ork switches. T he interconnection netw ork has been designed such th at
the to ta l network bandw idth will be scaled up as the m achine size increases.
T his will reduce th e possible d a ta contention in the netw ork.
25
• In addition, due to the current network sw itch technology, th e netw ork la
tency is a very small fraction of th e com m unication tim e if th ere is no d a ta
contention a t the netw ork switches.
A ssum ing T0 = t a, Tj = tj, and Tc — tc. In our model, th e tim e for a pair of
m odules to com m unicate a message consisting of m bytes (stored in consecutive
m em ory locations) is t0-\-mti. Note th a t copying th e d ata into consecutive m em ory
locations takes m tc tim e. Copying th e d ata is perform ed by th e user program . In
this dissertation, com putation tim e denotes th e tim e taken by th e processors to
perform useful com putations, and com m unication tim e denotes th e sum of message
construction tim e, startu p tim e, and th e tim e for transporting messages am ong
m odules.
O ur model does not a ttem p t to capture the classes of parallel m achines in which
th e netw ork latency is large or th e prefetch m echanism is provided. T h e rationales
are given as follows. A parallel m achine with large netw ork latency m ay be due to
(1) insufficient network bandw idth or (2) slow netw ork switch. C urrent scalable
parallel m achines are designed w ith no single bottleneck com ponent. T hus, the
problem s of insufficient network bandw idth and slow network sw itch are usually
solved before th e m achines are constructed. However, w ithout capturing th e n et
work latency in our m odel, th e techniques for hiding th e network latency cannot
be investigated. For the parallel m achines which support prefetch m echanism s,
any message delivery from a m odule will not delay any com putation activ ity to be
perform ed in th e module. In our m odel, a m odule cannot deliver a new message
u n til th e previous message has been com pletely pushed into th e netw ork. T hus, a
m odule which tries to deliver a new m essage before th e old message has been com
pletely pushed into th e netw ork will keep idle. It im plies th a t our m odel encourages
algorithm designers to schedule the com m unication activities in th eir algorithm s.
2.3 C om parison w ith O ther M odels
M any models of parallel m achines have been proposed. In this section, several
m odels of wide acceptance or trend of th e tim es are discussed. T he PR A M m odel
and th e netw ork models have been used widely in analyzing th e algorithm s. Those
26
m odels will be com pared with our model first. Due to th e technological trends,
several m odels for distributed m em ory m achines has been recently proposed. The
models include the LogP model, the Postal m odel, and th e BSP m odel. These
models do not consider the topology of the interconnection networks. T hese models
encourage algorithm designers to reduce the num bers of rem ote data-access. The
com parisons between our m odel w ith those structure-independent m odels will be
m ade in this section.
2.3.1 The PR A M M odel
T h e PRA M m odel [19] is the m ost popular m odel for representing and analyzing
th e com plexity of parallel algorithm s. T he key assum ption regarding algorithm ic
perform ance in the PRA M m odel is th a t the running tim e can be m easured as
th e num ber of parallel m em ory accesses [13]. It can serve as a good m odel for
representing th e logical structures of a parallel algorithm . However, by using the
PRA M m odel to capture the com m unication activity in a distributed m em ory m a
chine, the cost of perform ing rem ote data-access cannot be distinguished from th at
of perform ing local data-access. T h at is, the m odel ignores th e extra cost in ac
cessing d a ta from rem ote memory. Thus, this m odel does not encourage algorithm
designers to reduce th e am ount of th e rem ote data-access in using d istrib u ted m em
ory parallel m achines. In our m odel, rem ote data-access and local data-access has
been distinguished by including functional elem ents (th e point-to-point channels
and the m essage handlers) in our model. An ex tra cost is needed for accessing
d a ta from rem ote m em ory com pared w ith th a t from local memory. P aram eters T0,
Td-, and Tc are used to represent th e tim e com plexity of rem ote data-access.
T he p processors in the PR A M model can access p m em ory locations concur
rently w ithout causing any access contention. However, th e rem ote data-access
in the current distributed m em ory m achines m ay have data-access contention in
accessing d a ta item s from p different m em ory locations. T his can happen if some
of th e d a ta item s to be accessed are stored a t th e sam e m odule. Thus, th e lim ited
com m unication bandw idth available to a m odule in a d istrib u ted m em ory m achine
cannot be captured if the PRA M model is used. In our m odel, we use message
handlers to capture th e lim ited com m unication bandw idth available to a model.
27
T he message handler can only deliver d ata to and receive d a ta from th e channel
using a fixed rate.
2.3.2 The Network M odels
In netw ork models, th e key assum ption is th a t com m unication is only allowed
betw een the the p air of processor/m em ory m odules directly connected by an in
terconnection netw ork. Many algorithm s [2, 41, 42, 49] have been developed for
parallel machines of specific netw orks. The com m unication structures of th e algo
rithm s employ the topology of th e network to achieve high perform ance com puting.
O ne of the common features in th e current distributed m em ory m achines is th at
the parallel m achines allow the d a ta to be pipelined into th e high bandw idth net
works, and the d a ta can be tran sp o rted betw een a pair of com m unicating modules
w ithout disturbing th e com putation or com m unication activities in o th er m od
ules. Unlike tran sp o rtin g d ata using a stored-and-forw ard routing schem e [57], the
com m unication cost is not sensitive to the num ber of hops between any pair of
com m unicating m odules. This suggests th a t th e topology of the interconnection
netw ork in current distributed m achines is n o t a key issue in designing effective
algorithm s. In our model, instead of capturing the netw ork using its topology,
we use point-to-point channels an d message handlers to cap tu re the featu re of an
interconnection netw ork.
T h e syntax of th e M PL provided by the current d istrib u ted m em ory m achines
supports point-to-point com m unications to increase the portability. W ith a little
m odification, the w ritten program s using M PL and be transform ed into those
using M PI [44], However, for those algorithm s developed on a network m odel for
a specified architecture, significant effort is required to m odify the algorithm s to
be ru n on different architectures.
2.3.3 The LogP M odel
T he LogP model [14] is based on four param eters th at specify ab stractly th e net
work latency, the efficiency of coupling com putation to com m unication, th e com
m unication bandw idth, and th e com puting bandw idth. In th e LogP m odel, the
28
size of a message is fixed. T he m otivation for proposing th e LogP m odel is based
on the technological trends in constructing d istributed m em ory m achines. The
param eters of the LogP m odel are described as follows.
L: an upper bound on the latency, or delay incurred in com m unicating
a message containing a word (or sm all num ber of words) from its source
m odule to its target module.
o: th e overhead, defined as th e length of tim e th a t a processor is en
gaged in the delivery or reception of each message; during this tim e
intervals, th e processor cannot perform other operations.
g: the gap, defined as the m inim um tim e interval betw een consecutive
message transm issions or consecutive message receptions at a proces
sor. T he reciprocal of g corresponds to th e available per-processor
com m unication bandw idth.
P: th e num ber of processor/m em ory m odules.
F urther, it is assum ed th a t the interconnection netw ork has a finite capacity, such
th a t at m ost \^ \ messages can be in transm it from any processor or to any pro
cessor at any tim e. T he m odel is asynchronous, th a t is, processors w ork asyn-
chronously and the hardw are latency experienced by any m essage is unpredictable,
b u t is bounded above by L in the absence of stalls.
O ur m odel is different from the LogP model in three aspects: th e associated
cost in delivering a message, th e associated cost in preparing a message, and the
message reception policy.
• In LogP m odel, L seems to cap tu re the switch latency of an interconnection
network. The tim e for delivering a message includes L, o, and th e tim e for
d ata tran sp o rtatio n . Due to th e advancing in the netw ork technology, the
switch latency in the network is ju st a sm all fraction of the startu p tim e for
a parallel m achine of reasonable size. In our model, we do not a tte m p t to
capture th e netw ork switch latency. T he T0 in our m odel is to c a p tu re the
software overhead in delivering a message. Software overhead refers to the
period of tim e a processor cannot perform any useful com putation.
29
• In the LogP model, the message to be delivered is of fixed size. T hus, the
overhead o is in linear proportion to the am ount of d a ta to be transported.
The syntax of the m essage-passing library (M PL) provided in the cu rren t dis
trib u ted m em ory m achines supports d ata delivery using messages of various
sizes. T he d ata in each of the messages are delivered followed by a startu p
process. A lthough th e d ata to be delivered can be packed into packets of
fixed size, however, algorithm designers are usually n o t allowed to control
this process. In our model, a startu p cost (which is a software overhead) is
associated w ith a m essage delivery, regardless of the size of the m essage. In
general, a message is the d ata stored at consecutive m em ory locations in a
source m odule and will be delivered to consecutive m em ory locations at a
target module. To construct a message, m em ory to m em ory copy operations
are required. In our model, th e operations for constructing a m essage has
been captured using th e p aram eter Tc. Since th e data in a message can not be
used until the m essage has been com pletely received. T h is encourages users
of the parallel m achines to choose an appropriate size of sm aller m essages for
delivering d ata as soon as possible. Thus, p a rt of the d a ta can be used by
the processor in th e target m odel for perform ing useful com putations. Due
to delivering m ultiple messages, the to tal software overhead has increased.
Thus, th e suitability of delivering data in an application problem using a
sequence of short messages or a long m essage should be judged.
• T he com m unication protocol can affect th e to tal com m unication tim e. Dif
ferent m achines m ay provide different protocols. Furtherm ore, a m achine
can also provide various com m unication protocols to su it the requirem ents
in perform ing efficient com m unications for various com m unication p attern s.
T he LogP model leave algorithm designers to decide th e message reception
policy. However, th e program m ing languages or m ost M P L ’s do n o t pro
vide any prim itive for users to tailor th eir message reception policies. In
our m odel, the m essage reception policy is captured by th e message handlers
which is a part of th e parallel machines.
30
2.3.4 The Postal M odel
T he Postal m odel [5] is designed to model m essage-passing parallel m achines. T he
model focuses on three aspects of com m unication features in the machines: full
connectivity, sim ultaneous I/O , and network latency. In th e Postal m odel, the size
of messages is fixed. T he n atu re of the P ostal model is described as follows.
Full Connectivity: each processor P, in th e m achine can send a point-
to-point message to any o ther processor Pj in the machine.
Sim ultaneous I/O : each processor P, can sim ultaneously send one
atomic piece of d ata to processor Pj and receive another atomic piece
of d a ta from processor P*.
Com m unication Latency: if at tim e t processor Pi sends an atomic
piece of d a ta to a processor Pj, then processor p is busy sending the
d ata during the tim e interval [<, t + 1], and processor Pj is busy receiving
the d a ta during the tim e interval [f -f A — 1, t + A],
T he Postal m odel is operated in asynchronous mode. A fter a m odel delivered a
message, it is free to perform other operations including o ther m essage deliveries.
T hus, this m odel captures the send-and-forget nature of com m unicating messages
in a m essage-passing parallel m achine.
O ur m odel differs from th e Postal m odel in five aspects: the associated cost
in delivering a message, the associated cost in preparing a message, th e message
reception policy, th e capacity of th e network, and the lim ited com m unication b and
w idth available to a module. T he reasons for th e first th ree are sim ilar to those
given to th e LogP model. O nly the reasons for the last two will be explained.
• In th e P ostal, th e lim ited capacity of an interconnection netw ork has not been
captured. By connecting a source m odule to an interconnection netw ork w ith
infinite capacity, the source models can push any am ount of d a ta into th e
netw ork as it w ant w ithout considering th e d ata reception ra te a t its targ et
m odule. However, the netw ork capacity in the current d istrib u ted m em ory
m achines is lim ited. In th is case, T he com m unication activity of th e source
m odules can be blocked due to a satu rated network. Since th e P ostal m odel
does not consider the lim ited capacity of the netw ork, the block effect can
31
be overlooked. In our model, the network capacity has been cap tu red by
specifying th e m axim um allowable of d ata in th e point-to-point-channels.
• It seems th a t th e Postal m odel using software overhead paid for each message
to capture the lim ited com m unication bandw idth available to a m odule. For
a m achine which can com m unicate messages of various size u n d er a finite
capacity netw ork, the distinguish between software overhead and th e lim ited
com m unication bandw idth is very im portant. For example, for a m achine
in which each of the m odules has infinite com m unication bandw idth, the
messages delivered at the sam e tim e will arrive a t its target m odule at the
sam e tim e. T he message reception policy is less im portant for a m achine
w ith infinite com m unication bandw idth. However, for a m achine w ith fi
n ite com m unication bandw idth, its message reception policy will affect the
com m unication or com putation activities in th e source module.
2.3.5 The B SP M odel
T he perform ance of the BSP m odel [58] is described in term s of th ree types of
functional elem ents:
P rocessor/m em ory modules. Each m odule perform s arith m etic
a n d /o r m em ory function.
Facility for synchronization: It synchronizes all or a subset of th e
m odules a t regular intervals of T tim e units. T he execution of a pro
gram consists of a sequence of super-steps. In each super-step, each
of th e m odules is allocated a task consisting of som e com bination of
local com putations, message deliveries to and m essage receptions from
o th er m odules. After each period of T tim e u n its, a global check is
m ade to determ ine w hether the super-step has been com pleted by all
th e com ponents. If it has, th e m achine proceeds to th e next super-step.
A router: It delivers point-to-point atom ic-inform ation between pairs
of processor/m em ory m odules. T he basic task of th e router is to realize
a rb itra ry /^-relations, or, in other words, super-steps in which each
m odule sends and receives at m ost h atom ic-inform ation.
32
T h e B SP m odel is designed from th e view point of parallel m achines users.
However, the m ajor differences between our model and the BSP m odel can be
found in two aspects: the associated cost in constructing a message an d the func
tional elem ents for describing an interconnection netw ork. T he reason for th e first
one is sim ilar to th a t given to th e LogP model. T h e latter will be explained as
follows. D uring a super-step, a processor m ay send a t m ost h atom ic-inform ation
and receive at m ost h atom ic-inform ation. Such a com m unication p a tte rn is called
a /i-relation. In th e BSP m odel, either th e netw ork should be powerful enough
to realize any h-relation in a constant tim e or th e super-step should b e able to
be ad ju sted to accom m odate any /i-relation. T he capacity of th e netw ork m ay
need to be infinite for realizing any /i-relation through a network in a constant
tim e. T his is not realistic under current technology. To adjust a super-step to
accom m odate any /i-relation on a netw ork w ith finite capacity, the netw ork should
have an intelligent d a ta tran sp o rtatio n strategy which can change dynam ically to
suit th e requirem ents in perform ing a given com m unication p attern . In our m odel,
we use m essage handlers and point-to-point channels to capture th e featu re of the
interconnection netw ork in current d istributed m em ory machines: finite netw ork
capacity, lim ited com m unication bandw idth available to a m odule, and m essage re
ception policy. Based on these functional elem ents, users of these parallel m achines
should tailo r their algorithm s or application program s to achieve high perform ance
com puting on the parallel machines.
2.4 Sum m ary
C urrent scalable d istrib u ted m achines are form ed by a collection of processor/m em ory
m odules com bined by an interconnection netw ork. In this chapter, a realistic m odel
is proposed to bridge the algorithm designers and th e parallel m achines. T h e
m odel captures th e com m on features of d istributed m em ory m achines to which an
algorithm designer should atten d in order to achieve large speed-ups in using th e
parallel m achines. T h e m odel does not try to cap tu re all the details of a speci
fied parallel m achine. T he a ttrib u te of our m odel describes the integral behavior
of th e system software and m achine hardw are. Thus, by setting th e a ttrib u te of
33
ou r model according to a certain class of parallel m achines, th e m odel can cap
tu re the com m on behavior of the class of parallel m achines. T he com m unication
facilities in current d istributed parallel m achines face substantial overhead in ini
tializing message delivery and scheduling message reception. T he integral behavior
of perform ing com m unication are captured by using point-to-point channels and
m essage handlers. T he tim e com plexity for perform ing com m unication is described
by using three param eters (T0,Td,Tc). According to th e a ttrib u te of our m odel, it
encourages m achine users (1) to localize m ost of th e data-accesses, (2) to sched
ule th e d a ta tran sp o rtatio n based on th e message reception policy of the m essage
handlers, (3) to partially overlapping th e com m unication activities.
34
C hapter 3
Initial D ata-m apping
In this chapter, data-m apping strategies for reducing com m unication tim e are in
vestigated in two aspects: (1) a m ethodology is proposed for evaluating th e tech
niques for m apping input d ata item s to th e modules (m odule-m apping) of a dis
trib u ted m em ory m achine w ith bounded m em ory sizes, and (2) approaches are
proposed for m apping input d ata item s to m em ory locations (m em ory-m apping).
D ata dependencies during com putations make the m odules need to access d a ta
from o th er modules. T he goal of applying a m odule-m apping strategy is to localize
m ost of th e rem ote data-accesses in solving an application problem . Thus, th e
total com m unication tim e in solving the application problem can be reduced. In
general, th e d ata item s in a message are stored at consecutive m em ory locations
in its source m odule and will be delivered to consecutive m em ory locations in its
target m odule. Thus, m em ory to m em ory copy operations are needed for the source
or targ et m odules to copy th e d ata item s to appropriate m em ory locations. We
call this process message construction. T h e goal of applying a m em ory-m apping
strategy is to totally or partially elim inate the m em ory to m em ory operations.
By elim inating th e operations, the tim e for com m unicating messages am ong th e
m odules can be reduced. In Section 3.1, a m eter for m easuring th e scalability
of a parallel system is given and the syntax of the com m ands for tran sp o rtin g
d ata am ong the m odules is discussed. T he suitability of applying various m odule-
m apping strategies for achieving large speed-ups on a d istrib u ted m em ory m achine
is investigated in Section 3.2. T he suitability of the m odule-m apping techniques is
m easured on a scalability m eter. In Section 3.3, approaches are proposed to reduce
the tim e for accessing d a ta from other modules; it is achieved by either storing
35
th e interm ediate o u tp u t of each m odule at appropriate memory locations (thus,
th e tim e for constructing a message can be reduced) or choosing a n appropriate
com m unication mechanisms.
3.1 Background
C urrent d istrib u ted m em ory m achines are form ed by a collection of processor/m em ory
m odules connected by an interconnected netw ork. Each of the m odules accesses
d a ta from another m odule over th e interconnection netw ork. The com puting power
of a d istrib u ted m em ory m achine can be scaled up by increasing th e num ber of
processor/m em ory m odules. All th e modules of the parallel machine cooperate to
execute a task. D uring executing a task, com m unication an d synchronization m ay
be needed to exchange d ata am ong the m odules. T he com m unication and syn
chronization are th e m ajor overheads in perform ing a parallel task. T h e overheads
have adverse effect on th e perform ance of a parallel system . The term system refers
to a m achine-algorithm pair. T hus, m inim izing the overheads are im p o rta n t for a
parallel system to achieve large speed-ups.
Scalability analysis can help m achine users evaluate th e perform ance of a par
allel system . Efficiency is an im portant perform ance m eter for a p arallel system .
It indicates th a t th e percentage of the available com puting power o f a parallel
m achine has been dedicated to th e useful com putations of a task. It is well known
th a t to m aintain th e efficiency of a parallel m achine at a desired level, th e re exists a
relationship betw een th e problem size of an application an d the num ber of proces
sors in th e parallel m achine on which the application problem is solved. In [34], the
authors state th a t th e scalability of a parallel algorithm on a scalable architecture
is a m easure of its capability to effectively u tilize an increasing num ber of proces
sors. Thus, they developed a scalability m eter called isoefficiency [33]. It relates
th e problem size of an application to the num ber of processors necessary for an
increase in speed-ups in proportion to the num ber of processors used in th e appli
cation. T he relationship between th e isoefficiency function of a system and some
perform ance m easures has been discussed in [22]. T he scalability analyses have
been m ade on several m odels which address th e topologies of the interconnection
networks. T he analyses consider th e suitability of using various m odule-m apping
36
strategies in solving problem s on parallel m achines. However, due to advances
in the netw ork technology, it is reasonable for the users of a d istrib u ted memory
m achine to ignore th e topology of interconnection netw ork. Thus, th e scalability
of a parallel system needs to be evaluated based on a m odel which does not use
the netw ork topology to describe an interconnect network.
Even though an optim al m odule-m apping is applied to m ap in p u t d ata to th e
m odules, com m unications m ay be needed to exchange interm ediate o u tp u t am ong
th e m odules. Com m unication tools, such as message-passing library (M PL), are
provided by the m achine vendors for users to access d a ta from o th er modules.
T he syntax of m any com m ands of the com m unication tools supports d a ta delivery
using messages of various sizes. By using this type of com m ands, each of th e
source m odules tran sp o rts d a ta item s stored at consecutive m em ory locations to
consecutive m em ory locations in the target m odules. T hus, m em ory to memory
copy operations m ay be necessary for a source m odule to move th e d a ta stored
at different locations to consecutive m em ory locations. Similarly, th e operations
m ay also be necessary for the target m odules to move th e d ata item s stored a t
consecutive m em ory locations to their final m em ory locations. T he tim e for IBM
SP-2 to copy one by te from a m em ory location to another m em ory location is 0.01
/^sec. It is m ore th an one th ird of the d a ta delivery ra te (0.028 ^ sec/b y te). For
those processors w ith slower local memory access rate, com pared w ith th e processor
(RS 6000) used in IBM SP-2, th e local m em ory accesses needed for moving d ata
item s leads to a large com m unication tim e. T hus, the interm ediate o u tp u t should
be stored at appropriate m em ory locations in each of th e m odules to reduce th e
to tal com m unication tim e.
3.2 M apping D ata to M odules w ith B ounded
M em ory Sizes
C om m unication overhead in a distributed m em ory m achine includes rem ote data-
access and synchronization. In th e d istrib u ted m em ory m achines, m em ory units
are physically d istributed across th e m odules. As the num ber of processor/m em ory
m odules of a d istrib u ted m em ory machine increases to suit th e tim ing requirem ent
37
of a large-scale application, th e total m em ory also increases in proportion to the
num ber of modules. To m aintain the efficiency of a parallel system at a desired
level, th ere is an u p p er bound on the num ber of m odules which can be applied to
a given application of a specified problem size. The m axim um num ber of m o d
ules depends on th e applications, the initial m odule-m apping strategies, an d the
problem sizes. For a given application of a specified problem size, different m odule-
m apping strategies m ay have different m em ory requirem ents. Furtherm ore, as the
size of an application problem increases, th e m em ory requirem ents can increase in
different rates for different m odule-m apping strategies. In this case, the m em ory
requirem ent of an application o f a given problem size m ay exceed the m em ory
space available in th e parallel machine. T h is implies th e scalability issue should
consider th e im pact of available memory size.
T he isoefficiency function has been used to analyze th e scalability of a parallel
system . T he isoefficiency function of a parallel system depends on the execution
tim e for useful work and com m unication overhead (e x tra work) perform ed in th e
parallel system . T h e isoefficiency analysis does not consider the m em ory constraint
in a distributed m em ory m achine. In this section, the scalability of a parallel sys
tem is evaluated by considering th e suitability of using various m odule-m apping
strategies on d istributed m em ory machines w ith bounded m em ory sizes. T he scal
ability analysis is conducted on our structure-independent module.
3.2.1 Definitions and N otations
Isoefficiency function [22] relates the problem size to th e function of the nu m b er
of processors f{p). In general, th e isoefficiency can b e com puted by using th e
following equation:
E = (3.1)
P x T(p,n)
where
• E : the desired efficiency of a parallel system . It is a constant.
• Te(n)\ problem size. T h e amount o f com putations taken by an op tim al
sequential algorithm w ith in p u t size eq u a l to n.
38
• T(p,n): parallel com putation tim e, p is th e num ber of processor/m em ory
m odules in th e parallel system .
In spirit, the equation (3.1) is to evaluate the capability of a parallel m achine to
solve an application problem using a processor-tim e optim ality approach. T he
to tal work (including useful work and com m unication overhead) perform ed by a
parallel system is p x T(p, n). For a non processor-tim e optim al parallel algorithm
em ployed on a parallel m achine, the order of m agnitude of th e p x T (p , n) can
be larger than th a t of Te(n). In this case, the efficiency approaches zero as the
problem size increases, and th e isoefficiency function of this parallel system does
not exist. This im plies th at the scalability of some m achine-algorithm pairs cannot
be evaluated using th e current definition of efficiency.
In some applications, an optim al sequential algorithm of an application prob
lem m ay not be chosen for parallelization because of the difficulty in parallelizing
it or its unreality for the practical use. T he unreality of an optim al sequential
algorithm m ay be due to a large constant factor in th e tim e com plexity of the
algorithm . A nother reason for users to choose a non-optim al sequential algorithm
to im plem ent its parallel version at some tim e is th a t the current technology sup
ports high perform ance processors, however, substantial tim e is lost on perform ing
the rem ote data-access. This m ay force users to consider choosing practical se
quence algorithm s or applying appropriate m odule-m apping techniques, such as
replicating d ata item s among th e m odules, for achieving large speed-ups.
Let T(l,n) be th e tim e taken by a single m odule system to execute th e al
gorithm . Accessing d a ta from rem ote m em ory and synchronization am ong the
m odules are not required for perform ing com putations in a single m odule system .
T hus, th e com putations perform ed in T(l,n) tim e u nits exclude this ty p e of op
erations. In this section, we define an altern ate definition of th e efficiency of a
parallel system for deriving the isoefficiency function of a parallel system . T he
definition which uses T(l,n) and T(p,n) is given as follows:
T he altern ate definition of efficiency uses T ( l ,n ) as a base. The isoefficiency
function derived by using E quation (3.2) shows how well th e useful work of an
algorithm can be parallelized. In spirit, th e Equation (3.2) is to evaluate the
suitability of a parallel algorithm for solving an application problem of increasing
size on a scalable parallel m achine. Since th e scalability is evaluated based on
a sequential algorithm and its parallel version, we call th e isoefficiency function
derived by this definition as the absolute isoefficiency function. In this section, we
use absolute isoefficiency function to m easure th e scalability of a parallel system .
T he advantage of taking the absolute isoefficiency function as a scalability m easure
is as follows. For a parallel algorithm which can solve a problem in a sm all period
of tim e by using larger am ount of operations (th a n those perform ed by an optim al
sequential algorithm ), the isoefficiency function is still defined. Thus, users of a
parallel m achine on which a non-optim al algorithm is im plem ented can evaluate
the possibility of increasing the num ber of processors to achieve linear speed-ups.
In contrast to th e absolute isoefficiency function, th e isoefficiency function derived
using th e definition in Equation (3.1) is called relative isoefficiency function. In
the following analysis, we use absolute isoefficiency function instead of relative
isoefficiency function to m easure th e scalability of a parallel system . In [34], the
authors also define th e Memory O verhead F actor (M OF) for a m achine-algorithm
pair. T h e m em ory overhead factor is the ratio of the total m em ory required by a
parallel algorithm to solve the application problem of a given size to th e am ount
of m em ory required by an optim al sequential algorithm to solve the sam e problem
of the sam e size.
3.2.2 Scalability A nalysis with Bounded M em ory Sizes
In th e design of a parallel algorithm on a d istrib u ted m em ory m achine, effectively
m apping d ata to th e processor/m em ory m odules of the parallel m achine is a key
issue for achieving high perform ance com puting on the parallel system . Various
m odule-m apping strategies [2, 41, 42, 49, 35] have been developed for network
m odels. T h e m apping techniques partition d a ta into several groups and m ap each
of the groups to th e modules based on the d a ta dependency between com p u ta
tions and th e topologies of the interconnection networks. For some applications,
40
th e num ber of rem ote c l at a-accesses and th e num ber of synchronizations am ong
m odules can be reduced by distributing several copies of the sam e algorithm or
d a ta to each of the m odules and concurrently perform the com putations. T he
effects of th e m odule-m apping strategies in which th e input d a ta item s are repli
cated am ong th e modules have been investigated in [35, 55, 56]. By replicating
th e data, th e isoefficiency function of such a system m ay im prove. However, the
to tal m em ory in the d istrib u ted memory m achines m ay be used up quickly due to
m ultiple copies of d ata or program s needed to be stored. The concept of virtu al
m em ory can be used to com pensate th e insufficiency of th e physical m em ory. In
this case, th e large space of I/O devices m ay be used as a tem porary storage and
d a ta is transferred between th e I/O devices and th e m em ory as needed by the
m odules. U nder the current technology, processors are getting m uch faster, but
I/O devices are improving m ostly in capacity, not perform ance. It im plies th a t
th e gap betw een fast processor and I/O device becom es wide. In this case, the
tim e for transferring d ata between I/O devices and physical m em ory dom inates
th e to tal execution time. T h e advantage of virtual m em ory becom e obscure and
th e virtual m em ory is not suitable for som e applications in w hich fast execution
is their first consideration. T his implies th a t we need to consider the constraint
im posed by th e memory size of a parallel m achine in evaluating th e scalability of
a parallel system .
T he basic requirem ent for solving an application problem of a given problem
size is th a t th e am ount of m em ory space needed by th e application problem ru n on
a distributed m em ory m achine should be less th an the to tal m em ory size provided
by th e d istrib u ted memory m achine. T he following inequality conveys this concept.
M aig(n) < M arch{p,m). (3.3)
where,
• M aig(n) : th e m em ory size needed to solve an application problem w ith the
am ount of input d ata equal to n, an d th e application problem is solved using
A lgorithm alg. It is a function of in p u t size n.
41
Algorithm A ll- p a ir s - s h o r te s t- p a th
begin
for k = 1 to n do
for i = 1 to n do
for j = 1 to n do
= rtnn{Pk- % j ] , P k~%k] + i M M )
endfor
endfor
endfor
end{ A ll-p a irs - s h o rte s t-p a th }
Figure 3.1: Sequential Floyd A lgorithm .
• Marchip-,™) ' ■ the to tal m em ory in a distributed m em ory m achine arch on
which th e A lgorithm alog is run. It is a function o f the num ber of pro
cessor/m em ory m odules p in a distributed memory m achine and th e local
m em ory size m in each module.
From E quation (3.2), th e isoefhciency function f(p ) can be derived. T h e isoeffi
ciency function sets an upper bound on the num ber of m odules can be applied to
an application of a given size. T he inequality in (3.3) sets a lower b o u n d on the
num ber of m odules should be used to store the in p u t data. As the problem size of
an application is increasing, the lower bound an d the u pper bound m ay approach
to a value. It im plies th a t the d istrib u ted m em ory m achine m ay only b e extended
to a certain size for solving an application problem of increasing size. If th e num
ber of m odules exceeds this bound, th e n it is im possible for th e parallel system to
m aintain its efficiency at a desired level E. We call this ran g e the expansion range
of a m achine-algorithm pair.
In the following, we use all-pairs-shortest-path problem as an exam ple to illus
tra te our concept. We assum e th at th e all-pairs-shortest-path problem is solved by
th e Floyd A lgorithm [1], which is show n in Figure 3.1. Using checkboard partitio n ,
th e n X n cost-m atrix is divided in to p sub-m atrices. Each of th e sub-m atrices has
size equal to ^ x -^=. T hese p sub-m atrices are m apped to th e p modules. Assume
th a t the entries of each of th e sub-m atrices are stored at th e local m em ory locations
using the row m ajor order. The com m unication patterns u sed in the checkerboard
42
A_th column
lik - i)
a kr
<
i
j i k - l )
ik r
•
. <*)
“ Ir
fc_th column
Figure 3.2: D a ta dependency in th e checkerboard partition.
partition is illustrated in Figure 3.2. T he total com m unication tim e of th e parallel
checkboard version of Floyd A lgorithm is
71
2n x (t0 + — (tc + td)) x log y/p + n tba.
V p
where tbs is th e tim e for synchronizing all the m odules of the parallel m achine. The
tbs can be 0 (1 ) for a parallel m achine such as T M C CM-5, or can be O (logp) for a
parallel m achine such as th e IBM SP-2. The to tal com putation tim e of th e parallel
43
version of Floyd A lgorithm is 0 (^ - ). The to tal overhead due to perform ing rem ote
data-accesses and synchronizations is
n
{to + —- ( tc + td)) X n p logp + nptbs-
\/P
T h e worst case of sequential Floyd A lgorithm has tim e com plexity equal to
0 ( n 3). (3.4)
According to E quation (3/2), to keep the efficiency of th e parallel system at a
desired level E , we m ust have the isoefficiency function:
f(p ) = ^ (p 15(lo g p )3). (3.5)
A ssum e each of the m odules in a distributed m em ory m achine contains a local
m em ory w ith size equal to m. T hus, the to tal m em ory of th e d istrib u ted m em ory
m achine is pm. According to Equations (3.5) and (3.4), we have,
n — fi(p05 logp). (3.6)
Since the to ta l m em ory size provided in a parallel m achine should be larger th an
th e m em ory requirem ent of an algorithm running on the parallel m achine, we have,
p x m = f](n 2). (3.7)
C om bining Equations (3.6) and (3.7), we have
p = 0 ( 2 ^ ) .
T h is equation suggests th at to solve a large-scale application problem , linear
speed-ups can be achieved by expanding the d istrib u ted m em ory m achine from a
single m odule to p m odules, where p = 0 ( 2 '/’ "). Note th a t each of th e m odules
has a local m em ory w ith size equal to m. If we need to use more th an p m odules,
44
Base alg. Parallel V ariant Isoefficiency M OF E xpansion Range
D ijkstra Source-Partition
P3 P
[ l,m 05]
D ijkstra Source-Parallel (plogp )15 n fl m2 ]
* • ’ flojcm)3 J
Floyd Stripe (plogp)a 1
I1 ’ (logm)*]
Floyd C heckerboard p 15(logp)3 1 [ 1 , 2 '/ ’"]
Floyd Pipelined Checkerboard
pl.6
1 oo
Figure 3.3: Expansion ranges of various parallel system s.
then efficiency cannot be m aintained at a desired level E (i.e. linear speed-ups is
not possible) due to th e large overhead.
3.2.3 Observations and Discussions
Based on th e above analysis, we consider th e expansion range of the parallel sys
tem s p artially listed in [35]. T he results are shown in Figure 3.3. T he definition
of the term inology used in Figure 3.3 for describing m odule-m apping strategies is
sam e as those in [35]. In Figure 3.3, [x, y] m eans th a t th e parallel system can m ain
tain its efficiency at a desired level for the num ber of m odules betw een x and Q(y).
oo m eans th a t for any num ber of m odules used in th e parallel system , th e system
still can m aintain efficiency at a desired level. From Figure 3.3, we can observe
some interesting results. Some parallel system s w ith good isoefficiency functions
m ay have sm aller expansion range; however, some system s w ith poor isoefficiency
function have larger expansion range. For exam ple,
• Floyd Checkerboard (System A):
- f{p) = 0(p1 ,5 (iogp)3 )-
- Expansion range is [1 ,2 '/’ "].
• D ijkstra Source_Parallel (System B):
~ f(p ) = © ((plogp)1'5).
45
- Expansion range is [1,
T he isoefficiency function of System A is poorer th a n th at of System B. However,
th e expansion range of System A is larger th an th a t of System B .
T he speed-ups of a parallel system is equal to efficiency x number of processors.
Assum e System s A and B need to m aintain efficiency at some desired level. For a
problem size which can be run on both system s, th e speed-ups of System B is higher
th an th a t of System A (due to the lower order of m agnitude of th e isoefficiency
function of System B). However, for a problem size up to a certain value, only
System A can be used to solve the large-scale problem w ithout degenerating th e
efficiency of the parallel system . It is due to th a t System A has larger expansion
range. A ccording to the definition of expansion range, System A as well as S ystem
B can be observed linear speed-ups w ithin their expansion ranges. Since S ystem
A has larger expansion range, larger speed-ups (com pared with System B) m ay be
observed for a given application which requires large am ount of in p u t data.
3.3 M apping D ata to M em ory Locations
M any d istrib u ted m em ory machines provide several m echanism s for effectively
com m unicating long messages of various sizes and effectively com m unicating sh o rt
messages of fixed size. D ata exchange am ong the m odules of a d istributed m em ory
m achine can be achieved by either using a long m essage or using a sequence of sh o rt
messages of a fixed size. A message consists of d a ta item s stored a t consecutive
m em ory locations. T hus, d ata item s need to be moved to consecutive m em ory
locations before the d a ta item s can be delivered by using a long message. T h e
operations of moving d a ta between m em ory locations may not b e necessary if a
sequence of short messages are used to deliver d ata. Since the num ber of m essages
used to deliver d ata increases, the to ta l startu p tim e also increases. Thus, it is ob
vious th a t a decision should be m ade for choosing an appropriate com m unication
m echanism for delivering d ata am ong th e modules.
46
To capture th e com m unication activities in the current d istrib u ted m em ory
m achines, two types of com m unication m echanisms (Cs and Cb) are defined. As
sum e the d ata delivery rates Td = td (/rsec/byte) are th e same for both types of
com m unication mechanisms.
• Cs: a com m unication m echanism which delivers d a ta using a sequence of
sm all messages of k bytes. According to the definition of our m odel, com
m unicating d a ta of M bytes between a pair of m odules takes a t m ost
^ + + (3.8,
where T'0 is th e startu p tim e for delivering a short message. E quation (3.8)
can be simplified to Mt'a -f td, where
C = y + ( 1 - ^ - (3.9)
According to th e definition of the startu p tim e in our model, th e first byte
of the message is in the channel after the startu p process. T hus, if d ata
of M bytes are delivered using M short messages of one byte, th e tim e for
delivering th e d ata is
Mt'0 + t d, (3.10)
where t'0 is th e startu p tim e for delivering the message of one byte. Com
paring E quation (3.9) w ith E quation (3.10), the tim e for delivering M bytes
by using messages of k bytes with th e startu p tim e equal to T'0 is equivalent
to th at for delivering the sam e am ount of d ata using messages of one byte
rjf - j
w ith the sta rtu p tim e equal to - f + (1 — ^)td. In th is case, we can virtually
consider th a t th e Cs delivers d ata using a sequence of messages of one byte
and the sta rtu p tim e is denoted to be t0, where tQ = - f + (1 — j)td-
• Cb: a com m unication m echanism which delivers d a ta using a long message.
T he size of a long message is in proportion to th e am ount of d a ta to be
delivered. T h e d a ta item s need to be moved to consecutive m em ory locations
before the d a ta item s can be delivered. Denote the s ta rtu p tim e for delivering
a message using Cb to be t 0.
47
In general, after som e com putations have been perform ed, exchanging data
am ong the m odules m ay be required for the parallel machine to continue its com
putations. Assum e the d a ta items in m odules i to be delivered to other m odules are
partitioned into c groups: D 0, D \ , . .., D c-\. D enote the set of d a ta groups in mod
ule i which will be delivered to m odule j to be Sij, where 5,-j C { D 0, D \ , .. . , D c_ i } .
To reduce th e com m unication tim e, strategies should be carefully em ployed for
m apping d a ta item s to m em ory locations and for choosing an app ro p riate com
m unication m echanism s for a given m em ory m apping. By carefully m apping data
to th e m odules, the m em ory to m em ory copy operations can be p artially or to
tally elim inated. By choosing an appropriate com m unication m echanism , either
th e m em ory to m em ory copy operations can be elim inated or the to tal tim e for
sta rtu p processes can be reduced.
In [43], th e tim e for copying d ata from a m em ory location to another m em ory
location has been considered for deciding an optim al m odule-m apping strategy. In
th is section, strategies are proposed for reducing com m unication tim e for perform
ing windows operations. T he developm ent of th e strategies is in two aspects: (1)
for a given m em ory-m apping, the strategies for applying Cb and Cs are investigated
and, (2) a m em ory-m apping strategy is proposed for com m unication using Cb-
W indow operations are widely used in perform ing low level vision tasks. In
perform ing window operations, the o u tp u t for a pixel e in an image depends on
th e inform ation of the pixels enclosed in a rectangular area of the im age. In
general, the pixel e is located at the center of th e rectangular area. E xam ples of
window operations are edge detection [45] and im age com ponent labeling [2, 15].
To parallelize window operations, an im age consisting of n x n bytes is partitio n ed
in to p subim ages. T he p subimages is m apped th e p m odules of a d istrib u ted
m em ory m achine. This procedure is called “m odule-m apping” . A m odule-m apping
strategy for perform ing window operations is given in Figure 3.4. In Figure 3.4, the
im age has been partitio n ed into subim ages / 0, h , ..., /is (each of the subim ages
consists of y | bytes). Subim age /, is assigned to m odule i, for 0 < i < 15. During
perform ing com m unication, the boundary d ata item s of for 0 < i < 15, in a
m odule need to be delivered to the o th e r eight m odules which store the neighboring
subim ages of /;.
48
cmo o
o o x > o
o o o o
o o o o
n x n im age p -1 6
Figure 3.4: M odule-m apping for window operations.
We first investigate th e suitability of various com m unication m echanism s for
perform ing window operations using th e m odule-m apping strategy in F igure 3.4
is used. To illustrate our idea, we analyze the tim e needed to deliver d a ta using
C s and Cfc. We assum e th a t each of th e processor/m em ory m odules uses a two-
dim ensional array to store a subim age. T he entries of each of th e tw o-dim ensional
array are stored in the local m em ory using row m ajor order. In window operations,
th e boundary o u tp u t d ata item s of /,, for 0 < i < 15, need to be used by its
neighboring subim ages for com puting results. If a mask of 2h x 2h bytes are
used, then th e boundary interm ediate o u tp u t item s of a subim age which have the
distance (to th e boundary of the subim age) less th an h need to be delivered to o ther
m odules. Assum e th e d a ta item s in a module to be delivered to other m odules
are partitio n ed into eight groups: A, B , C , D , F, G , H, I as shown in F igure 3.5.
According to th e locations of th e d ata in the subim age, the groups can be form ed
into eight sets of d ata groups: GO, G l , ..., G l as shown in F igure 3.5; each set
is sent to one of the eight m odules which contains the neighboring subim ages.
T h e following result considers the tim e for perform ing a com m unication using C3
and Cb for a given m em ory-m apping A (as shown in Figure 3.6). The tim e for
perform ing a com m unication for window operations is given by
(1) 4h{-^= + h)t'0 + 8td, if C„ is used.
(2) 810 + 2h(-^f= + 2h){td + 2tc) -(- , if Cb is used.
49
Subimage
n/p
1/2
h —
A B C
D E
F
G H I
GO={A,B,C}
G2={I,H,G}
G4={A}
G6={I}
G1={C,F,I}
G3={G,D,A}
G5={C}
G7={G}
Figure 3.5: E ight sets of d ata groups.
Memory-mapping A
C/5
a
.2
* 4 - >
0
*2
1
< 0
s
to
e
• ^
«/ 5
C 3
S i
o
c
Interleaving o f {A,B,C}
Interleaving o f {D,E,F}
Interleaving o f {G,H,I}
Figure 3.6: M em ory-m apping A.
51
532348485353914848904853482348
From th e above results, we know th at if th e startu p tim e t'0 is large com pared
with td and tc, th en C\ is a desirable com m unication m echanism to be used for
perform ing window operations. However, if th e startu p tim e t'Q is close to th e data
delivery ra te td, th en using Cs can lead to fast com m unication. In a systolic array,
small am ount of d a ta can be efficiently com m unicated betw een the processing el
em ent during each com m unication phase due to the sm all s ta rtu p tim e. A nother
com m unication strateg y can be em ployed on m em ory-m apping A: delivering d ata
using m ultiple various-sized messages w ithout any m em ory to m em ory copy ac
tivity. Each of these messages delivers the d a ta which have been initially stored
in consecutive m em ory locations. Thus, m ore th an one m essage may need to be
com m unicated betw een any pair of neighboring modules. Using this strategy, the
tim e for perform ing a com m unication for window operations is
, nh 2hntd
2 /i( — — + 2/1 + 2 )t0 H —
y/P y/P
It is easy to see th a t by using this strategy, th e m em ory to m em ory copy operations
can be totally elim inated.
For delivering d a ta using a long message, th e d ata to be delivered should be
stored in consecutive memory locations. Thus, for a given m odule-m apping, ap
propriately m apping d ata into th e local m em ory locations m ay further reduce the
tim e of perform ing com m unication using Cb- A m em ory-m apping to achieve this
goal is illustrated in Figure 3.7. T he m em ory-m apping B can partially elim inate
the m em ory to m em ory copy operations in perform ing com m unication using Cb-
Using m em ory-m apping B on th e m odule-m apping as shown in Figure 3.4, only
the d a ta item s in group A need to be copied to the reserved area. T he tim e for
delivering d ata using Cb in perform ing window operations is
810 + 4 h ( — — + h)td + 2 h2tc.
V p
It is easy to see th a t the tim e for perform ing m em ory to m em ory copy operations
is only 2h2tc by using the m em ory-m apping approach.
52
Memory mapping B
E
A
B
C
F
I
H
G
D
£
o
a
o
0
1 -H
&
1
$
W )
£
•
cd
8
o
£
L reserve for A
Figure 3.7: M em ory-m apping B.
53
3.4 Sum m ary
T he size of a scalable parallel m achine can be scaled up by increasing the nu m b er
of processor/m em ory modules in the m achine. The available com puting power and
available physical m em ory also increase proportionally to the nu m b er of m odules.
As the problem size increases, th e m em ory requirem ent for a given m odule-m apping
m ay increase at a different rate. It implies th a t for a given m odule-m apping of an
application problem , the requirem ents in m em ory size and com putational pow er
can be very different. In this chapter, a m ethodology has been proposed to evaluate
th e suitability of various m odule-m apping applied to a d istributed memory m a
chine w ith bounded memory size. T he evaluation is based on a scalability m eter,
th e isoefficiency function of a parallel system . We have shown th a t although th e
isoefficiency function of a parallel system exists, it m ay not m aintain efficiency a t a
desired level for any num ber of m odules exceeding its expansion range. Our resu lt
is useful for a program m er to choose an appropriate m odule-m apping strategy for
solving a problem . In addition, based on the requirem ents of m em ory size and
com putational power for solving a certain class of applications of typical problem
sizes, m achine m anufactures can decide an appropriate memory size to be installed
in a m achine.
M any of th e current distributed m em ory m achine provide several com m unica
tion m echanism s to suit the requirem ents of perform ing various com m unications.
To reduce com m unication tim e, it is im p o rtan t for th e m achine users to choose an
appropriate com m unication m echanism based on a given m em ory-m apping s tr a t
egy. In this chapter, approaches to reduce th e tim e for perform ing com m unications
on those m achines w ith various m echanism s are studies. Furtherm ore, we also p ro
pose a m em ory-m apping strategy for perform ing window operations. The strateg y
reduces the com m unication tim e by placing th e interm ediate o u tp u t w ith the sam e
targ et m odule a t consecutive m em ory locations.
54
C hapter 4
C om m unication L atency R educing
Accessing d ata from rem ote m odules is a m ajo r overhead in exploring parallel
processing on d istrib u ted m em ory machines. T he time required for d a ta to be
tran sp o rted from its source m odule to its ta rg e t module is defined to be commu
nication latency. T h e com m unication latency includes th e tim e for m oving d ata
item s to consecutive m em ory locations (if necessary), the sta rtu p tim e, th e d ata
tran sp o rtatio n tim e, and the tim e for moving d a ta items from consecutive m em
ory locations to th eir final m em ory locations (if necessary). Based on our model,
the processor in a m odule will not be able to access the d a ta in a m essage until
the message has been com pletely received by th e module. T h u s, the com putation
activity of a processor may be suspended if th e required d a ta cannot be accessed
by th e processor. To achieve high perform ance com puting on d istributed m em ory
m achines, it is im p o rtan t to reduce th e time for a module to access d ata from other
modules. In Section 4.1, the im portance of th e problems considered in th is chapter
is discussed. Techniques for reducing the com m unication tim e for several funda
m ental com m unication patterns are investigated in Section 4.2. In Section 4.3, a
m essage-grain balancing technique is proposed to avoid m odule-access contention
which m ay happen in distributing d ata am ong the m odules; and a closed form
is derived for evaluating the necessary condition of applying the message-grain
balancing process.
55
4.1 Background
Several program m ing paradigm s are used by program m ers to design their codes
for solving application problem s. T he program m ing paradigm s can be classified
in to two categories:
the com putation activities of some m odules overlap with th e com m u
nication activities of the o th er m odules, or
All th e m odules synchronously altern ate between th e com putation phases
and th e com m unication phases
T h e latter program m ing paradigm makes program m ers m ore easily to develop th eir
codes and to debug th e errors in th eir program s. Using th e program m ing paradigm ,
th e activities of the m odules in a m achine can be described as follows. D uring a
com putation phase, each of the m odules either keeps idle, or accesses d ata from its
local memory, perform s com putations based on the d ata, and stores the com puted
results back to its local memory. D uring a com m unication phase, each of m odules
eith er keeps idle, or delivers to and (or) receives d ata from rem ote modules. For
th is type of program m ing paradigm , techniques are needed for globally accelerating
th e d ata delivery activities in any of the com m unication phases to reduce th e
com m unication latency.
In Section 4.2, techniques are proposed for perform ing several fundam ental
com m unication p attern s. These com m unication p attern s include one-to-one, one-
to-m any, and m any-to-one. We only consider th e case: each of the source m odule(s)
tran sp o rts equal am ount of d ata to the targ et m odule(s). For th e fundam ental
com m unication p attern s considered in this section, module-access contention only
happens to m any-to-one com m unication p attern . Since each of th e source m odules
contains th e d a ta item s which targ et for th e same m odule, the ra te of the d a ta
receptions a t th e targ et m odule is th e only factor which affects th e com m unication
tim e. M any-to-m any com m unication which delivers d a ta of fixed am ount am ong
th e m odules can be perform ed using m ultiple one-to-one com m unications. T hus,
we do not consider th a t fixed am ount of d ata to be transported am ong the m odules
for m any-to-m any com m unication p attern .
56
D istributing d ata am ong the m odules has its practical usefulness in m any ap
plication problem s. Exam ples are balancing load am ong th e m odules of a parallel
m achine and perform ing parallel histogram m ing on a parallel machine. In perform
ing load balancing and parallel histogram m ing, various am ount of d a ta in each of
th e m odules are tran sp o rted to the o th er m odules based on th e load inform ation
or th e predeterm ined d a ta m ovem ent rules. In Section 4.3, a technique for dis
trib u tin g various am ount of d ata am ong th e m odules is proposed. D uring th e d a ta
distrib u tio n process, each of th e m odules delivers d ata to th e other m odules and
th e am ount of d a ta to be delivered to th e other m odules can be very different. In
general, d a ta distribution is m any-to-m any com m unication. Based on our m odel,
m any-to-m any com m unication may suspend th e th e com m unication activities in
som e m odules for a period of tim e. T his is due to th e m odule-access contention. To
avoid m odule-access contention, m any-to-m any com m unication can be perform ed
using linear perm utation [8]. The linear perm utation consists of p-1 steps; during
step j , 1 < j < p, m odule i delivers d a ta to m odule (i + j) . However, this approach
can only directly apply to th e case: each of the m odules distributes d a ta of fixed
am ount to th e o ther m odules. For distributing d a ta of various am ounts am ong
th e m odules, th e m odule-access contention still exists. T his suggests th a t a novel
approach is required for d a ta distribution in which each of th e m odules distrib u tes
d a ta of various am ounts to th e other m odules. O ur technique is proposed to handle
this scenario. O ur d ata distribution algorithm solves this problem by balancing
th e m essage-grains first. W e also derive a closed form for judging th e necessary
condition of applying the m essage-grain balancing technique to d ata d istrib u tio n
process.
4.2 C om m unication A ctiv ity H iding
In th is section, techniques for the m odules to access d a ta from o ther m odules
are proposed for reducing th e com m unication latency. Several fundam ental com
m unication p attern s are considered. T hese include one-to-one, one-to-m any, and
m any-to-one. We only consider the case: delivering d ata of fixed am ount to o th er
m odules. In th is section, th e term scattered d a ta item s (in b y te) refers to th e d a ta
item s which are not stored in consecutive m em ory locations. Those scattered d a ta
57
items can be delivered using a sequence of short messages of fixed size or longer
messages of various sizes. Delivering d ata using a sequence of short m essages of
fixed size can be perform ed w ithout moving th e d ata item s to consecutive m em ory
locations. A sequence of short messages are delivered by issuing a sequence of com
m ands, su ch as read or write, from the source processor. For delivering d a ta using
a long m essage, m em ory to m em ory copy operations m ay be required to move the
scattered d a ta item s to consecutive m em ory locations before the m essage can be
sent. B ased on our model, th e copy operations can overlap with the d a ta tran s
portation. Thus, th ere is an opportunity to reduce the com m unication latency by
overlapping the com m unication activities. By overlapping th e com m unication ac
tivities, th e com m unication costs can be p artially hidden. For com parison purpose,
we also derive the tim e for delivering d ata using a sequence of messages of fixed
size. T h e notations, Cs and Cb, have been defined in Section 3.3 to cap tu re the
features of different com m unication m echanism s provided by a distributed m em ory
m achine. For clarity, we repeat th e definitions of these notations.
• C s: a com m unication m echanism which delivers d a ta using a sequence of
sm all messages of k bytes. According to the definition of our m odel, com
m unicating d a ta of M bytes between a p air of m odules takes at m ost
(41)
w here T'0 is th e startu p tim e for delivering a short message. E quation (4.1)
can be simplified to Mt'0 + td, where
i = J + (1 - |> f* (4.2)
A ccording to th e definition of the s ta rtu p tim e in o u r model, th e first byte
of th e message is in th e channel after th e startu p process. T hus, if d ata
of M bytes are delivered using M short messages of one byte, th e tim e for
delivering the d a ta is
M t0 + ti, (4.3)
5 8
M em ory
# A data item
O A sent data item
m The number of scattered data items
module i
O o O r a
• • • • •
module j
Figure 4.1: An illustration for one-to-one com m unication.
where tQ is th e startu p tim e for delivering the message of one byte. Com
paring E quation (4.2) with E quation (4.3), the tim e for delivering M bytes
by using messages of k bytes w ith th e startu p tim e equal to T'0 is equivalent
to th a t for delivering the sam e am ount of d ata using messages of one byte
t '
w ith th e startu p tim e equal to -f - + (1 — \)td- In this case, we can virtually
consider th a t Cs delivers d ata using a sequence of messages of one by te and
th e startu p tim e for delivering a message is t'a, w here t'0 = ^ + (1 — ^)fd-
• Cb'. a com m unication m echanism which delivers d a ta using a long message.
T he size of a long message is in proportion to th e am ount of d a ta to be
delivered. T he d a ta item s need to be moved to consecutive m em ory locations
before th e d a ta item s can be delivered. Denote th e startu p tim e for delivering
a message using Cf, to be t 0.
4.2.1 One-to-one
In one-to-one com m unication, a source m odule delivers scattered d a ta item s of m
bytes to th eir targ et m odule, and the d ata item s will be stored scatteredly a t the
targ et m odule. T he one-to-one com m unication is illustrated in Figure 4.1.
59
Startup time
Memory copy time
Data transportation time
Module i
Channel
Module j
Time
Figure 4.2: O verlapping in perform ing one-to-one com m unication.
T he following results show the tim e for perform ing one-to-one com m unication
using eith er Cs or CV If Cs is used to perform one-to-one com m unication, then
th e to tal startu p tim e required for delivering d a ta is in lin ear proportion to the
am ount of th e data. T his leads to the following result: th e tim e for perform ing
one-to-one com m unication using C3 is,
m t0 -(- td.
It is easy to see th a t th e to tal sta rtu p tim e dom inates th e tim e for perform ing
one-to-one com m unication.
A straightforw ard approach for perform ing one-to-one com m unication using C \>
is to use a long m essage for delivering the d a ta . The approach consists of four
non-overlapping steps:
1. Move th e scattered d ata item s in the source module to consecutive memory
locations.
2. S ta rt up a com m unication.
3. T ransport th e d ata.
60
4. Copy d a ta item s stored a t consecutive memory locations in th e target m odule
to their final memory locations.
T h e tim e of perform ing one-to-one com m unication using this approach is:
tQ + m x (td + 2tc).
Some of th e distributed m em ory m achines provide different com m unication
m echanism s [18, 31] for different com m unication requirem ents. For exam ple, a
light weight protocol is used for active message [18] to com m unicate a short m es
sage of fixed size. The sta rtu p tim e for delivering d a ta using such a m echanism
is usually sm aller than the startu p tim e for delivering d ata of various sizes. In
general, the large startup tim e in delivering d ata using messages of various sizes
is partially due to th e tim e required for buffer m anagem ent in th e com m unicating
m odules. Thus, th e users should choose an appropriate com m unication m echanism
to transfer scatte red d ata item s based on th e num ber of d ata item s to be deliv
ered. Based on th e tim e derived for perform ing one-to-one using C a and C\ on our
m odel, Cs should be em ployed when the num ber of d a ta item s to be delivered is
less th an w, w here w < , to~t and t' > + 2tc. If t 0 < td + 2tc th en Cs can lead
tj A te
to a faster com m unication.
For large am ount of d ata, Cb should be used to deliver d ata. In this case,
th e tim e for perform ing one-to-one com m unication can be reduced by delivering
th e scattered d a ta items using a sequence of messages of an ap p ro p riate size, and
overlapping th e message construction operations w ith d ata tran sp o rtatio n s. T his
strateg y is illu strated in F igure 4.2. The size of the message chosen to deliver d a ta
depends on the features of th e com m unication m echanism s and th e to tal am ount
of d a ta to be delivered. D enote h to be th e size of messages chosen for delivering m
scatte red data item s, a to be max{ ----------.
~ 2tc + td
4.2.2 One-to-m any
In perform ing one-to-m any com m unication, m odule 0 delivers scattered d ata item s
of pm bytes to o th er p m odules; each of th e p m odules receives m bytes. An
illu stratio n of one-to-m any com m unication is given in Figure 4.3. In our m odel,
th e first byte of a message has been delivered to th e channel after the sta rtu p
process. According to th e definitions of delivering d a ta using Cs, startu p tim e t'0
is associated with each byte delivery. Thus, based on th e definition of Cs and th e
n atu re of our m odel, th e tim e for perform ing one-to-m any com m unication using
C s can be perform ed in:
p m t0 + t d.
N ext, we consider perform ing one-to-m any com m unication using Cj. A four-
step approach (which is sim ilar to th a t of perform ing one-to-one com m unication)
can be applied. T he four steps are listed as follows.
1. Move all the scattered d a ta item s (pm bytes) in th e source m odule to con
secutive m em ory locations,
62
M em ory
# A data item
O A sent data item
m The number of scattered data items
p m
ooo
ooo
ooo
099
...
• • •
Module 0
Module 1
Module 2
Module 3
■
i
■
Module p -1
Module p
m
9999
• • • •
Figure 4.3: An illustration for one-to-m any com m unication.
2. S tart up the com m unications.
3. T ransport d ata through th e channels using p messages w ith size equal to to
bytes.
4. As soon as a message is com pletely received by any of the targ et m odules, the
m odule will move the arrived d ata item s from consecutive m em ory locations
to th eir final locations.
63
First level
Module 0
Second level
module-group module-group . . . . . . . module-group
Figure 4.4: An illustration for a m odule-partition approach.
T he activities in Steps 2 and 3 are alternately executed by m odule i for p tim es. T he
activities in Step 3 overlap w ith th e activities in Step 4. T he tim e for perform ing
one-to-one com m unication using th is approach is:
pt0 + (pm — p + 1 )td + (p + 1 )m tc.
This approach is inefficient if the num ber of m odules and th e startu p tim e are
large. An approach which partitions th e m odule to yjp groups as em ployed in [23]
can be used to perform one-to-m any com m unication. However, our approach also
considers how to overlap the com m unication activities to reduce th e com m unication
latency. Figure 4.4 shows the m odule-partition approach. Using the m odule-
p artitio n approach, p target m odules are partitioned into y/p groups; each group
contains y/p m odules. Module 0 sends a message of size y/pm to a specified m odule
in each of th e groups. After receiving th e message, the specified m odule in a group
im m ediately sends y/p messages to th e other m odules in th e group; each of the
messages contains d a ta item s of m bytes. By overlapping m em ory copy operations
w ith d a ta tran sp o rtatio n , the tim e for perform ing one-to-m any com m unication is
a t most
2 y / p t 0 + ‘ l y / p m t i + ( y /p + 1 ) m t c ■ + p m a .
64
time
time
second level
first level
►
number o f groups num ber of groups
\
time
number o f groups
F igure 4.5: Trade-off in the m odule-partition approach.
However, dividing the m odules into ^/p groups for perform ing one-to-m any m ay
not be an o p tim al partition. As shown in Figures 4.5, an optim al num ber of
m odules exists for achieving m inim al com m unication tim e in perform ing one-to-
m any com m unication using th e two-level m odule-partition approach. T he num ber
of groups depends on the num ber of m odules in a d istrib u ted m em ory m achine, the
am ount of d a ta to be delivered, th e com m unication param eters of th e m achines.
D enote g to be th e num ber of m odule groups to which th e m odules of a parallel
65
m achine are partitioned, and T(g) to be the tim e for perform ing one-to-m any
com m unication. Thus, we have,
p pmtd pmtc
1 \9) = -to + --------- 1 h gt 0 + m tc + pma.
9 9 9
Taking th e derivative and equating it to zero (th a t is, we have
pt 0 + pmtd + pmtc
0 " T to —
9 2
T his im plies th a t ____
p(ta + m tc + rntd)
5 = V r . (4 - 4 )
Two observations can be m ade based on E quation (4.4):
• The num ber of groups approaches y/p, if t 0 is very large com pared w ith the
term m{td + tc).
• The num ber of groups is increasing if (1) th e value of m(td + tc) increases and
(2) t 0 is small com pared w ith m{td + tc), th en the num ber of groups should
increase. Large m (td+tc) im plies th at the am ount of d a ta tran sp o rted among
the m odules is the dom inant factor in perform ing one-to-m any com m unica
tion. T hus, if we choose the num ber of m odule group g to be a large value,
then th e total am ount of d ata transported am ong the m odules decrease. This
can lead to a faster com m unication.
T hus, the m inim um of T(g) is
ypt, 0{t 0 + m tc + m td) + m (pa + tc).
4.2.3 M any-to-one
In m any-to-one com m unication, one target m odule receives d a ta item s from the
other p source modules; each of th e modules delivers scattered d ata item s of m
bytes to th e targ et module. The m any-to-one com m unication is illu strated in
Figure 4.6. If perform ing m any-to-one com m unication using C s, then th e p source
66
Memory
9 A data item
0 A sent data item
m The number of scattered data items
o d b o
o c fb o
o # m# #
■
• • • •
Module 1
Module 2
Module 3«
■
i
■
Module p - 1
Module p
Module 0
Figure 4.6: An illustration for m any-to-one com m unication.
m odules can send th e first byte of the d ata concurrently, however, the target
m odule takes the d ata one by one from the channel. T hus, the com m unication
tim e depends on the num ber of com m unicating m odules, th e startu p tim e t'0, and
th e d ata tran sp o rtatio n rate 14. B ased on these, the tim e for perform ing m any-to-
one com m unication using Ca is
t 0 + m x (m axjptd, <„}). (4.5)
T w o interesting conclusions can be m ade based on E quation 4.5.
/
• If p > then th e tim e for perform ing m any-to-one com m unication using C 3
is sm aller than th a t using Cf, . Note th a t perform ing com m unications using
Cs needs m em ory to m em ory copy operations and in general, tQ > t'0.
I
• if p
min mm
mm.
min min
0,p-l
1.p-l
2,2
p-2,0 V 2 ,l p-2,2
max max max max max
M i
friciX
t
Ma
Figure 4.12: N otations used in th e closed form .
78
Data
C Data to be balanced )
Sources
m.
d itQ- m 0
£
W )
e 5
r
di r m\
di 2 - m2
dtj3- m3
F igure 4.13: An illustration for th e am ount of d ata th a t needs to be balanced.
79
users of such a class of parallel m achines can com pute the values of pM a — pm a
and delta first. Based on the com puted values, the user of a parallel m achine can
judge the w orth of perform ing m essage-grain balancing for a given com m unication
p attern . In some cases, using th e inform ation accum ulated from previous activi
ties am ong th e m odules, the u p p er bound on pMa — p m a and 6 can be calculated
locally.
In [59], a variant of our d ata distribution algorithm is proposed for perform ing
d a ta distribution on a m achine w ith large startu p tim e. T he algorithm em ploys
m essage-grain balancing process for three tim es. The to tal com m unication tim e is
5y/pt 0 + 5L x (td + tc). Using th e closed form we derived, it is obvious th a t the
th ird m essage-grains balancing process is not necessary. This is because pMa —pm a
and 6 has th e same upper bound (p 2 ). If we perform the balancing process, the
com m unication tim e m ay not be reduced. T hus, perform ing d a ta distrib u tio n on
a parallel m achine w ith large sta rtu p tim e can be achieved in at m ost
4s/pt<, + 4L x (td + tc).
C om pared w ith the tim e for perform ing d a ta distribution in [59], th e tim e which
has been saved by using less m essage-grain balancing processes is yjpt 0 + L x (t^ +
tc)-
4.4 Sum m ary
In this chapter, we show th at using an appropriate message size to delivered d ata
item s am ong m odules can reduce the com m unication tim e. T he choice of the
message size is based on the com m unication a ttrib u te of a parallel m achine and
th e com m unication p attern s to be perform ed on th e parallel m achine. Once the
message size has been decided, th e data item s are delivered using a sequence of
messages of th e size. By overlapping the m essage construction operations w ith th e
message transm ission operations, p art of th e com m unication activities are hidden.
T hus, th e to tal com m unication tim e can be is reduced.
80
Data, distribution is widely used in m any applications. T he am ount of d ata in
a m odule to be d istributed to other m odules can be very different. W ith o u t dy
nam ically scheduling th e message delivery operations or em ploying sm art message
handlers, module-access contention can happen. In this chapter, a m essage-grains
process is investigated. T he process is employed before a straightforw ard process
for d istributing d a ta is perform ed. T he balancing process needs ex tra m em ory to
m em ory copy operations and ex tra d a ta transm ission operations. B alancing the
m essage grains m ay not be necessary for some com m unication p attern s. To avoid
th e gain from th e balancing process is less th an the overhead paid for th e process,
a closed form is given for judging the w orth of perform ing th e balancing process
for a given com m unication pattern.
81
C hapter 5
D ata-rem apping
An integrated parallel system w hich is developed for solving a class of application
problem s on a parallel m achine m ay consist of several m ajor steps. In each of the
m ajor steps, the com ponents of th e parallel m achine cooperate to execute a bunch
of tasks. In the integrated parallel system , th e m ajor steps need to be executed
one by one and th e o u tp u ts of previous steps m ay need to be used as th e inputs
of the next steps. As th e parallel system is im plem ented on a distributed m em ory
m achine, the o u tp u ts of the previous steps m ay be stored in such a way th a t most
of the data-accesses in the next step need to go over the interconnection netw ork.
T hat is, th e d ata layout of the o u tp u t causes inter-m odule d a ta dependency during
perform ing the required com putations in th e next step. T h e inter-m odule d ata
dependency can be elim inated by partitioning the d ata in each of th e m odules
into several data-groups such th a t th ere is little d ata dependency am ong th e data-
groups. T hen, each of th e data-groups is assigned to the m odules of the d istrib u ted
m em ory m achine. In this way, m ost of the d ata item s which cause d ata dependency
can be localized in th e local m em ory of each of th e modules. T hus, the overhead due
to accessing d ata from other m odules can be minim ized. In Section 5.1, previous
works related to rem apping d ata to minim ize th e parallel overhead are discussed
and a classification of data-rem apping strategies is given. T h e tim e com plexity
of applying data-rem apping for an application problem w ith fixed com m unication
p attern s is investigated in Section 5.2. In Section 5.3, a dynam ic data-rem apping
technique is developed to solve th e class of application problem s in w hich the
p attern of inter-m odule d ata dependency can only be known in run-tim e.
82
5.1 Background
T he perform ance of a running large-scale parallel system m ay be degenerating
due to (1) increasing degree of inter-m odule d ata dependency, o r (2) increasingly
uneven com putational load am ong the modules. Several researches concerning
about these can be found in [9, 10, 14, 20, 40, 61]. It seems th a t their approaches
focus on eith er elim inating th e adverse effect of inter-m odule d a ta dependency or
balancing th e com putational load am ong th e m odules; but n ot both. In [10], a
heuristic was used in their algorithm to reduce th e adverse effect of inter-m odule
d ata dependency in a m ajo r step of parallel vision recognition system . In [61],
a data-rem apping technique was developed to reduce the degree of inter-m odule
d ata dependency in a m ajor step of a parallel system for detecting buildings from a
aerial picture. The suitability of applying fixed data-rem apping techniques to th e
class of application problem s w ith fixed com m unication patterns is investigated in
[14, 40]. In [9, 20], data-rem apping techniques for im age processing are proposed
to balance th e load am ong th e modules.
D ata-rem apping can be classified into two categories: fixed (or static) d ata-
rem apping or dynam ic data-rem apping. In the fixed data-rem apping, the targ et
modules to which the d a ta should be m oved during rem apping process can be
determ ined a t the beginning of the execution. It is suitable for th e class of ap
plication problem s in which all the com m unication p attern s are known after an
initial m odule-m apping strateg y is applied. This class of application problem s in
clude com puting Fast Fourier Transform (F F T ), perform ing colum nsort on a set
of num bers, and com puting all_pairs_shortest_paths. In contrast to fixed d ata-
rem apping, dynam ic data-rem apping m ay need th e inform ation o f th e global d a ta
layout to determ ine the targ et modules to which th e d a ta should b e moved. T h e in
form ation need to be gathered in run-tim e. To gather the inform ation, each of the
m odules m ay need to access d a ta from th e other m odules. Thus, fast d ata m ove
m ent and effective data re-layout policy are necessary for dynam ically rem apping
d ata item s am ong the m odules.
83
columns
0 12 3 4
Figure 5.1: An illustration for 16-point F F T butterfly graph.
5.2 F ixed D ata-rem apping
In this section, the tim e com plexities of using several approaches for com puting
F F T are investigated to highlight several im p o rtan t factors. These factors d eter
m ine th e effectiveness of em ploying a data-rem apping strategy.
T he Fast Fourier Transform (F F T ) com putes the D iscrete Fourier Transform
(D F T ) of an n-dim ensional com plex vector (xo, xi,- ■ ■ We assum e p is a
power of 2 and p 2 < n. T he butterfly algorithm [12] can be used to com pute F FT .
Figure 5.1 shows an 16-point butterfly graph. T h e initial m odule-m apping for the
84
^ M ixlulc 1 )
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Figure 5.2: A n illustration for cyclic layout.
bu tterfly algorithm can be cyclic layout or blocked layout. For cyclic layout on a
parallel m achine w ith p m odules, th e F F T point /,, 0 < i < n, is stored a t m odule
(i m od p ). T he inter-m odule d a ta dependency of the cyclic layout for com puting
16-point F F T b u tterfly is shown in Figure 5.2. Assum e th e size of a point is one
byte. For blocked layout, the F F T point /,-, 0 < i < n, is stored at m odule ([i/p ]).
In [14], a com bination of the cyclic layout and blocked layout (denoted hybrid
layout) has been proposed to reduce the am ount of rem ote data-accesses. T he
hyb rid layout uses cyclic layout for a period of tim e, th en th e interm ediate o u tp u t
is rem apped to form the blocked layout. T he inter-m odule d a ta dependency of the
hy b rid layout for com puting 16-point F F T butterfly is illustrated in F igure 5.3.
M odule □ M odule 2 fSSS M odule 3
columns
0 1 2 3 4
85
row s
{ § § | M odule 0 M odule I [ 1 M o d u le 2 ^ M o d u le 3
com m unication
m i i i i K
colum ns 0
Figure 5.3: An illustration for hybrid layout.
5.2.1 On the Com plexity of Performing Data-rem apping
In this section, the factors affect the tim e com plexity in perform ing data-rem apping
is investigated. A ssum e parallel system s A and B are developed to solve an ap
plication problem . System B employs data-rem apping b u t System A d o n ’t. In
System B, the am ount of rem ote data-access is less th an th a t of System A. An
in tu itio n ab o u t these parallel system s is th a t the execution tim e of System B is
shorter th an th a t of System A. The intuition is only correct for some classes of
parallel m achines b u t is not tru e for all parallel m achines. C om puting F F T will be
used to illu strate this. We choose hybrid layout to be our data-rem apping strategy
in com puting F F T . We will show th a t for a given problem size, an unique solution
m ay not exist for determ ining “w hether th e data-rem apping strategies should be
applied” w ithout looking into th e com m unication feature of a parallel m achine. To
illu strate th a t different com m unication features of th e parallel m achine m ay lead
to different decisions, we consider two types of com m unication m echanism s: Cs
and Cb', which are defined in Section 3.3.
If cyclic layout or blocked layout is used, then each of th e m odules accesses
d a ta of n ,°SP bytes from another module. If hybrid layout is used, th en each of
th e p m odules accesses d ata of ^ bytes from the other p — 1 m odules. Since the
tim e for perform ing com m unication using Cs is in proportion to the am ount of
accessed rem ote d ata, it is obvious th at th e com m unication tim e required by the
hybrid layout is less th an th a t required by th e blocked layout. Thus, if Cs is used
to com pute n-point F F T , where n > p2, then hybrid layout is m ore desirable
com pared w ith the others.
However, if Cb is used for perform ing com m unication, th en the situ atio n is
very different. By using Cb, d ata item s are delivered using messages of various
sizes. T here is a startu p tim e associated w ith launching a m essage to th e network.
Thus, it is obvious th a t delivering d ata using Cb depends on th e num ber of different
source-target pairs and the am ount of accessed rem ote d ata. Assum e each of the
m odules directly delivers d ata of th e sam e target to its targ et m odule using one
message. T hen, th e m axim al num ber of different source-target pairs gives a lower
bound on th e num ber of messages to be sent. During com puting F F T , th ere are
log p source-target pairs if cyclic layout is used and there are p source-target pairs if
87
hybrid layout is used. B ased on above analysis, we derive th at th e com m unication
tim e for com puting F F T using cyclic layout is
log2 p(tQ+ - ( t d + 2tc)). (5.1)
P
In th e hybrid layout, th e com m unication patterns used in perform ing data-rem apping
is a p-to-p com m unication. A lthough the am ount of accessed rem ote d ata is re
duced, however, the num ber of different source-target pairs increase. Based on this,
we derive th a t the com m unication tim e for com puting F F T using hybrid layout is
pt 0 + ^ ( t d + 2tc). (5.2)
Based on Equations (5.1) and (5.2), we have th a t th e hybrid layout is not desirable
if th e in p u t size is less th a n jj bytes.
From th e analyses m ade in this section, we know th a t the necessity of applying
data-rem apping depends on the feature of the com m unication m echanism provided
by a parallel machine. T h e most sim ple case is th a t only Cs is only provided
in th e parallel m achine. In this case, the total am ount of rem o te data-access
reflects th e com m unication time. For another case in which only C & is used in
the parallel machine, th e am ount of rem ote data-access and th e com plexity of th e
com m unication p attern in perform ing data-rem apping affect th e effectiveness of
the data-rem apping.
5.2.2 On the M essage Sizes for Performing Data-rem apping
T he tim e com plexity derived in Section 5.2.1 im plies th at if th e com puting F F T
problem has large input size, then em ploying hybrid layout always leads to sm aller
com m unication tim e. T hus, if a F F T problem of large input size runs on a p a r
allel m achine which provides various types of com m unication m echanism s, th en
an appropriate m echanism may need to be chosen carefully for perform ing d ata-
rem apping process. Tw o types of com m unication m echanism s (Cs and Cb) are
investigated in this section. In the following, we show th at the choice of th e com
m unication mechanisms should be based on the in p u t size of th e problem .
Using hybrid layout for com puting n-point F F T , each of th e p m odules accesses
rem ote d ata of ^ bytes. Thus, the tim e for rem apping d a ta using Cs in com puting
F F T is given by:
— + U. (5.3)
P
T h e tim e for rem apping d ata using Cb for com puting n-point F F T in a parallel
m achine of p m odules is given in E quation (5.2). The factor p associated w ith
th e t 0 in E quation (5.2) can be reduced to ^Jp if the m odule-partition strategy
in Section 4.2.2 is used. However, this strategy increases th e am ount of rem ote
data-access. As th e results shown in Section 5.2.1, the data-rem apping process is
required only for large input size. Large input size implies th a t th e am ount of d ata
tran sp o rted am ong the m odules is large. Thus, th e m odule-partition strategy will
not be applied in this section to com pare the tim e for perform ing data-rem apping
using Cs. Based on Equations (5.2) and (5.3), it follows th a t perform ing data-
rem apping using Cb leads to a sm aller com m unication tim e com pared th a t using
C 3, if
_ . Pipto - td)
n > ^—•
K - t d - 2 te
Several observations can be m ade from the above equations. As we m entioned,
the t 0 is usually larger th an t'a due to the overhead of buffer m anagem ent. In
addition, under current technology, some com m unication m edians have very high
d a ta transm ission rate. To com pute F F T on this com putational environm ent (th at
is, ta > t'0 ~> ti and t 0 > t0 ^> tc), we have the following result:
th e m echanism to be applied is
Cb if n >
Cs if p2 < n <
/ " i
A nother interesting observation can be m ade for a parallel m achine w ithout enough
physical m em ory space. It also im plies th a t the input size is very large. T hus, some
of th e d a ta item s should be stored in a slow device such as disks. In this case, the
m easured tc can be large due to the d a ta swapping between th e disk and physical
I
m em ory. If the m easured tc larger th an -a^ , th en Cs is a preferable com m uni
cation m echanism for perform ing data-rem apping. Based on above analyses, if an
89
application is run on a parallel m achine w ith bounded memory size, then m ech
anism Cb m ay be good for a range of problem size. Beyond or below the range,
m echanism Cs can leads to a shorter com m unication tim e.
5.3 D ynam ic D ata-rem apping
In general, th e interm ediate level im age understanding (IU) tasks in a vision system
o p erate on clusters of m arked pixels (which are th e o u tp u t of low level analysis)
to ex tract inform ation for high level analysis [46]. As the system im plem ented
on a d istributed m em ory m achine, the interm ediate level IU task s may involve
arb itra ry d ata distribution, dependency am ong com putations to b e perform ed in
various m odules, and frequent irregular inter-m odule com m unications. An ex am
ple of these is: scanning the elem ents of th e contours in an image in parallel. T h is
scanning process is defined to be contour-tracing. T h e parallel contour-tracing
problem captures the com m on com putational features of the interm ediate level
IU tasks in a parallel vision system . In this section, a technique for perform ing
data-rem apping is developed for th e contour-tracing problem in th is section. O u r
data-rem apping technique is developed for elim inating the inter-m odule d ata d e
pendency and for balancing th e load am ong the m odules. After d a ta item s have
been rem apped, the contours can be traced locally in each of th e modules. For
com paring th e perform ance of our data-rem apping technique, a parallel algorithm
for in-place contour-tracing is also described. A worst case is considered for analyz
ing th e tim e com plexity in tracing contours w ith data-rem apping an d th a t w ithout
data-rem apping. In general, th e worst case generates large am ount of data to be
com m unicated and high degree of inter-m odule d ata dependency th a n the o th e r
cases. Since Cb usually perform s well for delivering large amount of d ata, it is used
to analyze th e tim e com plexity in perform ing com m unications.
5.3.1 Definitions and N otations
In general, an n x n im age is partitioned into p subim ages (of size x ~ j for
m apping d ata to the parallel m achine w ith p modules. An im age of size 128 x
128 is shown in Figure 5.4. Assum e the subim ages are num bered using the row
90
Figure 5.4: A 128 by 128 raw image.
m ajor order. T hen, in our m odule-m apping, subim age i is assigned to m odule i
for perform ing low level operations. Each of the p m odules is responsible for pro
cessing one of the subim ages. Using this m odule-m apping, each of the subim ages
is surrounded by at m ost eight subim ages (which are denoted as the neighboring
subim ages). In each m odule, contour-elem ents are detected by m arking pixels.
Contour-segm ents are form ed by linking the m arked pixels in th e m odules. T he
im age of size 128 x 128 in which th e m arked pixels are linked is shown in Fig
ure 5.5. Inform ation such as th e length of a segm ent and the m odule which contains
th e im m ediate successor of th e segm ent can be easily com puted during th e linking
step. All th e inform ation is useful in perform ing data-rem apping.
A contour i can be represented by an ordered set S{ of contour pixels. A t each
contour pixel, properties of th e contour-pixel (such as its coordinates, intensity,
etc.) are stored. In perform ing contour-tracing, th e contour pixels are scanned
one by one following th e order in 5,. T he length of contour 5,, denoted |S',|, is
defined as th e num ber of elem ents in th e set. A global contour denotes a contour
which is em bedded in at least two m odules. A local contour refers to a contour
which is com pletely em bedded in one module.
91
Figure 5.5: A 128 by 128 im age w ith m arked pixels.
T he input to th e contour-tracing problem is a collection of contours. T he o u tp u t
of tracing 5, is {u,-(0),u,-(l),. . . ,u,-(|5,-| — 1)}. At any current elem ent, th e o u tp u t
is com puted based on the o u tp u t already extracted for the elem ent im m ediately
preceding the cu rren t elem ent and the attrib u tes of th e current elem ent. T h a t is,
Ui(j) depends on u ,(j —1) and th e d a ta stored at position j in Si, for 1 < i < |5,-| — 1.
We assum e th at,
• it takes t tim e to com pute the o u tp u t for any contour-elem ent,
• a global contour is segm ented into a t most yfp segm ents, and
• one byte is enough to store any inform ation regarding to a pixel.
T he last assum ption will be used for calculating th e com m unication tim e.
5.3.2 In-Place Com putations
C ontour-tracing can be perform ed directly w ithout redistributing th e contour-
elem ents. We refer to this approach as in-place contour tracing. T he in-place
algorithm can be perform ed in synchronous or asynchronous fashion. In th e syn
chronous algorithm , the m odules synchronously altern ate betw een com putation
92
and com m unication. Inserting a barrier synchronization betw een th e com putation
and com m unication phases can delay th e com putational activity of some m od
ules even though th e activity can proceed. This delay effect can be elim inated
if no barrier synchronization point is inserted betw een th e com putation and th e
com m unication phases. Using this asynchronous algorithm , each of the m odules
com m unicates w ith its neighboring m odules as soon as th e m odule finishes its
com putation phase.
O u r in-place contour tracing is developed to reduce th e com putational delay
effect. Let ready contour-segment denote a segment having th e inform ation needed
to s ta rt the tracing process. T he asynchronous contour-tracing algorithm oper
ates as follows: each of the p m odules alternates betw een its own com putation
and com m unication phases. T h at is, different m odules m ay be in different phases
a t any instant. D uring the com putation phase, m odule i , 0 < i < p, scans the
ready contour-segm ents in its module. After all of its ready contour-segm ents
have been scanned, m odule i com m unicates w ith its eight neighboring m odules.
T hen, m odule i proceeds to scan the contour-segm ents which are recently ready.
T he to tal num ber of messages will not increase too m uch since th e com m unication
is perform ed after all the ready contour-segm ents have been com pletely traced.
O ur algorithm is developed to elim inate th e delay effect caused by irrelevant m od
ules. By elim inating this effect, the contours entirely em bedded in a m odule-group
are traced independently from the o th er contours entirely em bedded in an o th er
m odule-group. T his feature is very suitable for perform ing contour-tracing on
clusters of w orkstations for some types of images. An exam ple is the im age in
which th e m ajor objects are far from each other. For clusters of w orkstations,
resource sharing is very im portant. T hus, the m odule-group which finishes tracing
contours can release th e com puting power to other users w ithout w aiting until all
th e o th er contours have been traced in other modules. T he tim e com plexity for
perform ing contour-tracing depends on th e input im age. In th e worst case, th e
algorithm takes com putation tim e and 8t oy/ p + 4 n (2 tc + td) com m unication
tim e.
T h e advantage of this algorithm is th a t the contour-tracing task can take th e
o u tp u t directly from the previous task w ithout redistributing th e d ata. T he dis
advantage of th e algorithm is th a t the num ber of contour-elem ents assigned to th e
93
A lgorithm C o n to u r-tra c in g
begin
Label Clusters;
M ove Data;
T race Localized Contours;
end { C o n to u r-tra c in g }
Figure 5.6: M ajor steps in perform ing data-rem apping.
m odules m ay be unbalanced. Even if equal num ber of contour-elem ents are stored
in th e m odules, inter-m odule d ata dependencies can still severely degenerate th e
perform ance of the parallel machine.
5.3.3 Com putations with Data-Rem apping
In this section, a parallel contour-tracing algorithm w hich employs data-rem apping
process is developed. A fter th e data-rem apping process has been perform ed, the
elem ents of a contour are localized in a module. T hus, rem ote data-access is not
required during tracing th e localized contours. D enote S = J2i Si- T he strategy
of almost even load is also used in the data-rem apping process. A lm o st even load
m eans th a t
either a m odule stores a t most contour-elem ents, or a m odule stores
p
exactly one contour which consists of more th an ^ contour-elem ents.
T h e data-rem apping process has three m ajor steps as shown in F igure 5.6. D etails
of th e m ajor steps will be discussed as follows.
T he step ‘label clusters’ assigns a label /, 0 < / < p, to each of th e contours.
B ased on th e assigned labels, the contours are red istrib u ted am ong th e m odules
during the step ‘move d a ta ’. T he o utput of module i of th e ‘label clu sters’ step is p
sets of contour-elem ents, w here set j , for 0 < j < p, stores the contour-elem ents to
be delivered from m odule i to module j . D enote the w eight of a cluster (or contour)
to be the am ount of com putation required to trace th e cluster (or contour). T he
‘label clu ster’ step includes (1) calculating the weight for each of th e clusters, and
(2) determ ining the target modules for all the contours. Figure 5.7 shows the
94
A lgorithm Label Clusters
begin
C alculate weight for each of th e contours;
Assign a label to each of the contours;
Propagate the label of a contour to all
th e segm ents of the contour;
end{Label Clusters}
Figure 5.7: M ajor steps in perform ing cluster-labeling.
m ajor steps in ‘label clusters’. During perform ing th e weight calculation (label
propagation), each of th e m odules needs to gather (m ulticast) d a ta from (to) the
o ther m odules. T he com m unication p attern s depend on the approaches used to
g ath er d ata and to m ulticast inform ation. Two approaches can be used for these
purposes. Theses approaches include divide-and-conquer and parallel segm ent-
tracing.
By using divide-and-conquer, the w eight calculation consists of logp steps.
D uring step j , 0 < j < J 2*2 , each block of modules cooperates to get the weights
or p artial weights of the contours. D uring step j , th ere are ^ blocks; a block
consists of 2J x 2J m odules. D enote th e m odule at th e left-bottom corner of a
block as th e header of the block. A ccording to th is, the block and its header
have different configurations for different steps. T he weight calculation operates
as follows. D uring step j , for 1 < j < th e header of each of th e current blocks
gathers p artial weights from th e four headers which are defined for step j — 1 and
enclosed in th e current block. This approach is illu strate Figure 5.8. It is obvious
th a t this approach dose not depend on th e num ber of segm ents to which a contour
is cut. T he com m unication tim e in perform ing this process is 4f0logp+ 4n(< d+ 2< c).
T he divide-and-conquer process can operate in reversal way to propagate labels,
and th e com m unication tim e is the sam e as th a t for perform ing w eight calculation.
A nother approach is parallel segm ent-tracing. In this approach, the segm ents
of each of th e contours are scanned one by one. N ote th a t th e weights of all the
segm ents have been calculated before th e segm ent-tracing is perform ed. Thus,
th e com m unication tim e is only depend on the the m axim um of th e num bers of
segm ents to which a contour is cut; but does not depend on th e contour-elem ents.
95
step 3
step 2
step 1
step 0
Figure 5.8: An illustration for th e divide-and-conquer approach.
If th e m axim um of the num bers of segm ents is sm all, then em ploying parallel
segm ent-tracing is a good choice. To illustrate our idea, an exam ple of data-
rem apping which employs parallel segm ent-tracing is shown in Figure 5.9. In
addition, th e alm ost even load strategy is also explained using th e same figure.
In th e exam ple, contours are em bedded in a parallel m achine w ith four m od
ules. A fter parallel segm ent-tracing has been perform ed, th e weights of all the
contours are stored at the modules which contain th e tails of th e contours (see
(B) in Figure 5.9). After th e weights of all the clusters have been calculated, the
sum of th e weights in each of the m odules is calculated (see (C ) in Figure 5.9).
To label the clusters, the sum in each of the m odules is broadcast to all th e other
m odules for determ ining th e target m odules to w hich the contours will be redis
trib u ted . A ssum e th e sum of the w eights in m odule i is V F, an d the weight of
cluster j in m odule i is in.y. T he sums in the m odules can be broadcast using all-
to-all com m unication (see (D) in Figure 5.9). M odule i determ ines the labels for
th e clusters 5,0, 5,: i ,. . . , 5 tJ which have th eir tails in m odule i based on th e weights
96
(A ): Initial state
(B): W eight calculation
14 0
33 21
14 0 21 33 14 0 21 33
140 21 33 14 0 21 33
(C ): S um o f local w eights (D): W eight collection
I I '
IO
i
1 ' 2 ' 3
— i f - c :
--------- — - { 1 25 ! 8 y
L 1 1
(E): Labeling o p erations in the m odule located at the left bottom comer.
(G) A fter m oving data (F): Label propagation
Figure 5.9: An exam ple for labeling clusters.
W 0, Wi,..., Wio, tun, ..., Wij,... W p- 1 . This can operate currently at th e modules.
To achieve balanced load, we use almost even load strategy to determ ine th e ta r
get m odules for all th e contours. In this strategy, each of th e m odules partitions
Wo) W i,..., tuioi u>n, ■ ■ i ,... W p- i into p groups such th a t the am ount of com
p u tatio n s to be perform ed on each group is alm ost equal. Then, the inter-m odule
d a ta dependencies are elim inated by assigning all the elem ents of a contour to a
targ et module. T he assignm ent operates as follows: if th e ^ t h pixel of S', is in
group j , then th e label for contour Si is assigned j (see (E) in Figure 5.9 ). T he
label for a global contour need to be propagated to the m odules which store the
segm ents of th e global contours (see (F) in Figure 5.9). Parallel segm ent-tracing
is also applied to propagate the labels in reversal way. A fter the label propagation
process has been perform ed, all-to-all com m unication is perform ed to move th e
d a ta am ong th e m odules (see (G) in Figure 5.9).
T he label assignm ent is perform ed in a decentralized way. In th e worst case,
each m odule can contain segm ents of global contours. T he weights of th e clus
ters can be calculated in at m ost 0(n) com putation tim e and 8t 0 % / p + 4n(td + 2t c)
com m unication tim e. Since there are a t most — clusters in a m odule, th e sum of
p ’
2
th e weights in a m odule can be com puted in 0 ( ^ - ) com putation tim e. Collecting
th e sums from all the o ther m odules can be perform ed in p t0 + pt& com m uni
cation tim e. Label assignm ent can be perform ed locally in 0 (p + ^ ) com puta
tio n tim e. T he com m unication tim e for perform ing label propagation is at m ost
8toy/ p + 4n(td + 2tc). After the label of each of th e contours has been propagated
to th e m odules which contain the segm ents of th e contour, d ata m ovem ent can be
perform ed in at m ost 2p t0 + ^ - ( 2 tc + td) using th e technique developed in Section
4.3.1. Therefore, the to tal execution tim e for p < n can be asym ptotically ex-
2 2
pressed as follows: com putation tim e is 0 (n + + p ) = 0 ( z j-) and com m unication
tim e is 0 (p -f- n) = O(n). Since the parallel in-place algorithm has O ( ^ ) com pu
tatio n tim e and 0 (n) com m unication tim e, our data-rem apping strateg y can lead
to a tim e com plexity of low order if th e length of th e longest contours is O ( ^ ) .
According to th e definition of th e contour-tracing problem , the elem ents of
all th e contours need to be scanned one by one. T hus, th e lower bound on the
tim e for tracing contours on a parallel m achine is equal to the tim e for tracing
th e longest contour So on a serial m achine. We show th a t after redistributing the
98
contours, contours can be traced in m inim al tim e or in asym ptotic processor-tim e
optim ality. O ur data-rem apping technique is designed to elim inate interprocessor
d a ta dependency as well as achieve load balance. After data-rem apping has been
perform ed, each of the m odules has a t m ost ^£1 contour-elem ents or one long
contour w ith length larger th an T hus, the tim e for perform ing contour-tracing
is a t m ost m a x { ^ , [Shi} x t tim e. It is easy to see th a t tracing the contours can be
perform ed to result in (1) processor-tim e optim ality, if > |5 0|, or (2) m inim al
running tim e, if ^ < |So|.
T h e m erit of contour-tracing w ith d a ta rem apping is th a t there is no com m u
nication needed am ong the m odules after th e data-rem apping process has been
perform ed. Thus, each of th e m odules can perform contour-tracing independently.
T he shortcom ing of this scheme is th a t there m ay be a high variance in th e lengths
of th e contours. T hus,the num bers of contour-elem ents which are assigned to the
m odules can be very large. In the worst case, some of th e m odules m ay not be
able to store th e elem ents assigned to them due to lim ited local m em ory space.
T his shortcom ing can be solved by d istributing contour elem ents equally am ong
th e modules. To avoid excessive com m unication, th e contour-elem ents need to be
carefully assigned to the m odules. In addition, scheduling of the com putations is
needed to m inim ize the adverse effect of the possible inter-m odule d a ta dependen
cies.
5.3.4 Im plem entation D etails and Experim ental R esults
T h e algorithm s were im plem ented on a SP-2 w ith a dedicated pool of 64 m odules
a t th e Maui High Perform ance C om puting C enter. We com pare our algorithm s
using various sizes of im ages (128 x 128, 256 x 256, 512 x 512, IK x 1A"). Each
of th e images was run on configurations of size 4, 8, 16, 32 and 64 m odules. T he
codes were w ritten using C and M PL message passing library. For em ploying
M PL, th e startu p tim e for com m unicating a message is 40 \isec and th e peak d a ta
transm ission ra te is 35.54 M bytes/sec.
W e did not targ et any specific linear approxim ation algorithm . Thus, a dum m y
loop was used to sim ulate the com putation tim e t for com paring our algorithm s.
N ote th a t, t denote the am ount of com putation needed by a m odule to ex tract
99
inform ation from a contour-elem ent. By changing th e num ber of iterations of th e
loop, t will be different, t (in psec) can be calibrated using the following formula:
t = 14 + 0.104 x number o f loop iterations. At the tim e of this w riting, data item s
are redistributed w ithout employing message-grain balancing. We denote this to
be one-phase d a ta redistribution. D ata m ovem ent th a t employs m essage-grain
balancing is denoted two-phase data redistribution. To make a decision on th e
choice between one-phase and two-phase, experim ents have been conducted on
large sizes of images. O ur result show th a t one-phase algorithm is b etter th a n
two-phase. F igure 5.10 illustrates this. T here are several reasons which leads to
th e one phase is a better choice:
• a small num ber (< 64) of dedicated pool of m odules were available during
the experim ents, and
• small observed variance in sizes of th e messages due to th e features of th e
tested images.
D enote the algorithm which traces contours w ithout redistributing data to be
A lgorithm A and the algorithm which traces contours employing data-rem apping
to be A lgorithm B. The com parisons of th e perform ance between A lgorithm A and
A lgorithm B are shown in Figures 5.11, 5.12, 5.13, and 5.14.
For image sizes 128 x 128 (Figure 5.11), 256 x 256 (Figure 5.12), and 512 x 512
(F igure 5.13), A lgorithm B always takes less tim e th an Algorithm A (regardless of
th e size of the parallel m achine and the num ber of loop iterations). However, th ere
is a cross-over point betw een the two algorithm s if th e I K X I K im age is used on
SP-2 with 4 m odules. Figure 5.14 illu strates this. Cross-over point between
th e algorithm s was observed a t 200 iterations of the dum m y loop. C om pared w ith
A lgorithm A, A lgorithm B is faster if th e num ber of iterations is large. T h at is, if t
is large, the gain from the data-rem apping is larger th a n th e overhead incurred from
th e data-rem apping. T his implies th at th e com putational load of a m odule can
enhance the adverse effect of inter-m odule d a ta dependencies. We believe th at th e
to ta l execution tim e of A lgorithm A can be reduced if load balancing is perform ed
based on the local contours. T he reason is th a t by redistributing th e local contours,
no e x tra com m unication is introduced and th e m axim al com putational load am ong
th e m odules is reduced.
100
Execution Tim es (m sec) Execution Times (m s e c )
T w o - P h a s e
N u m b e r o l P r o c e s s o rs
128 by 128
11 0
256 by 256
512 by 512 IK by IK
F igure 5.10: Com m unication tim es for m oving data using various approaches.
101
A lg o rith m A ( 6 4 P N s ) ___
P 3 0 0
A lg o rith m A ( 4 P N s )
A lg o rith m B ( 6 4 P N s ) _
- - - A - -
200
100
A lg o rith m B ( 4 P N s )
N u m b e r o f L o o p Ite r a tio n s
Figure 5.11: Execution tim es for a 128 x 128 im age.
In our experim ents, as th e num ber of modules increases, th e speed-ups o f A l
gorithm A decreases more rapidly th an th a t of A lgorithm B . In A lgorithm A,
com m unication is perform ed after all th e current ready-segments (of local contours
and global contours) have been scanned. T his approach delays th e processing of
the successor segm ents of th e global contours. This delay accum ulates and b e
comes severe if th e am ount of inter-m odule data dependencies increase. If th e
num ber of m odule increases, th e inter-m odule d ata dependencies also increase.
T hus, A lgorithm A perform s worse th an A lgorithm B, if large num ber of m odules
are used. This can also explain why A lgorithm A perform s worse th an A lgorithm
B for sm aller im ages (of size 128 x 128, 256 x 256, an d 512 x 512).
In our experim ents, we found th at th e d ata dependencies betw een m odules is
th e m ost im portant factor which affects th e speed-ups in dealing w ith this kin d of
problem s. This can be illustrated using o u r experim ental data. Consider the case
of tracing contours in a 512 x 512 image using A lgorithm B on SP-2 w ith 4 m odules.
A fter th e data-rem apping process has been perform ed, the observed num bers of
contour-elem ents in the m odules were 9638, 9657, 9559, and 9610. Compared w ith
th e loads of th e m odules before rem apping the d ata, they are 9288,10245, 9002,
and 9929. T he loads am ong th e m odules are not significantly different. However,
th e execution tim e of A lgorithm B is alw ays sm aller th an th a t of A lgorithm A
(see Figure 5.13). This is because th e elem ents of th e global contours (w hich
102
E xecution Time (m se c ) fl) E xecution Time ( m s e c )
8 0 0
A lg o rith m A ( 4 P N s ) .
6 0 0
A lg o r ith m A ( 6 4 P N s )
5 0 0
frth m B ( 4 P N s )
3 0 0
200 A lg o r ith m B ( 6 4 P N s )
100
5 0 0
N u m b e r o f L o o p I te r a tio n s
5.12: Execution tim es for a 256 x 256 image
> 500
>000 A lg o rith m A ( 4 P N s)
A lg o rith m B ( 4 P N s )
1 5 0 0
1000
5 0 0
A lg o rith m B (6 4 P N s )
1000
N u m b e r o f L o o p Ite r a tio n s
Figure 5.13: E xecution tim es for a 512 x 512 im age
3 0 0 0
2 5 0 0
£2000
|
J 1 5 0 0
l
< 2 1000
0
0 5 0 0 1 0 0 0 1 5 0 0
N u m b e r o f L o o p I te r a tio n s
Figure 5.14: Execution tim es for a 1024 x 1024 image.
were observed to be 8% of th e total contour-elem ents) have been localized after
the data-rem apping process has been perform ed. T his im plies th a t balance of the
loads am ong the modules cannot guarantee large speed-ups.
C onsider another case: tracing contours in a \ K x IK im age using A lgorithm
B on SP-2 with 4 processors. In this test image, it was observed th at only 1% of
the contour-elem ents belong to global contours. It is a sm all num ber com pared
w ith th e previous case. After th e d ata rem apping, th e num ber of contour-elem ents
in th e m odules were observed to be 6057, 5842, 5892, and 5927. T he load is
balanced com pared w ith th e initial d istribution which was 11277, 12441, 0, and 0.
In this case, A lgorithm A can provide faster execution th an A lgorithm B only if the
am ount of com putation perform ed at a contour-elem ent is very small (see Figure
5.14). T hus, d ata rem apping m ay not be necessary for any load redistribution.
From the two cases, it seems th at unbalanced load is not th e m ajor issue in
obtaining large speed-ups for this application. However, th e inter-m odule d ata
dependency is th e m ajor adverse factor. However, this adverse factor can be en
hanced if the load in the m odules becom e large or th e load is im balanced.
A lg o rith m A ( 4 P N s )
A lg o rith m B ( 4 P N s )
A lg o rith m A ( 6 4 P N s )
A lg o rith m B ( 6 4 P N s )
104
5.4 Sum m ary
We conclude th is chapter by th e following statem ents. In this chapter, m eth o d
ologies for perform ing fixed data-rem apping and dynam ic data-rem apping are in
vestigated. For fixed d ata rem apping, we show th a t not only th e am ount of d ata
to be redistributed can affect the effectiveness of a data-rem apping strateg y but
also th e com plexity of the com m unication patterns can affect th e com plexity of the
effectiveness of th e d ata rem apping strategy. For dynam ic data-rem apping, two
strategies for perform ing contour-tracing are investigated. T h e first strateg y is to
trace contours w ithout perform ing data-rem apping process. T he second strateg y
traces contours after d ata item s have been rem apped. T he data-rem apping pro
cess is applied to elim inate inter-m odule d ata dependencies as well as balance the
loads am ong th e m odules. O ur results show th a t d ata dependency is th e m ajor
adverse factor in tracing contours. Even if the num ber of global contours is sm all,
th e adverse effect becomes large if the com putational loads of th e m odules becom e
heavy or the loads am ong th e m odules are im balanced. Thus, we suggest th a t for
perform ing interm ediate level IU tasks on distributed m em ory m achines, elim in at
ing th e inter-m odule d ata dependency first, then, try to balance th e loads am ong
th e modules.
105
C hapter 6
C onclusion
T h e com puting power of th e conventional serial com puters has steadily increased
to m atch th e com putational requirem ents of the applications. However, th e com
p u tin g power of a serial com puter has been lim ited by the speed of light. Due
to th e increasing com plexity of the em erging applications, the perform ance of th e
serial com puters may not satisfy the tim e constraints im posed by th e applications.
T hus, it is n atu ra l for hum an beings to em ploy an ensem ble of processors, i.e. a
parallel m achine, to suit th e com putational requirem ents. Parallel processing has
m ade a trem endous im pact on m any applications. To achieve large speed-ups for
solving application problem s on a parallel machine, novel approaches to effectively
utilize the com puting power of parallel m achines are needed. In this chapter, con
trib u tio n s of th e dissertation are identified and some plausible directions for future
research are outlined.
6.1 C ontributions of th is D issertation
C urrently, m any vendors are offering d istrib u ted m em ory parallel m achines. These
m achines are form ed by a collection of essentially processor/m em ory m odules com
bined by an interconnection network. Accessing d ata from the rem ote m em ory is
perform ed using the interconnection netw ork. The tim e for a m odule to access
d a ta from rem ote memory is m uch longer th a n th at for th e m odule to access d a ta
from its local memory. In general, rem ote data-access is needed for th e m odules
in a parallel m achine to exchange data. To reduce the overhead of d a ta exchange,
th e am ount of inter-m odule data-access should be m inim ized.
106
In this dissertation, we have proposed a structure-independent model to capture
the features of current distributed m em ory machines. These features are im portant
to th e users of these m achines for designing fast algorithm s or developing applica
tions which m atch th e tim ing requirem ents. O ur structure-independent m odel is
realistic and robust. In such parallel m achines, high perform ance com puting can
be achieved by localizing m ost of the data-access. In this dissertation, several tech
niques are proposed to m ap d ata on th e parallel m achines such th a t m ost of the
data-access can be located in the local memory. T he im pacts of these techniques are
also investigated based on our model. O ur techniques which are developed for per
form ing high perform ance com puting on d istributed m em ory m achines include (1)
th e strategies for m apping input d ata item s to m odules and to m em ory locations,
(2) th e approaches for reducing com m unication latency by partially overlapping
com m unication activities and by balancing m essage-grains, and (3) th e techniques
for fixed data-rem apping and dynam ic data-rem apping. T he technique developed
for dynam ic data-rem apping is applied to solve th e applications which have arbi
trary d a ta distribution and dependency am ong com putations to be perform ed in
various m odules. To show th e usefulness of our approach, experim ents were con
ducted on an IBM SP-2 w ith a dedicated pool of 64 modules. Section 1.3 provides
a detailed sum m ary of our research contributions m ade in this dissertation.
6.2 Future D irections
Parallelism for m any large-scale applications has generated significant interest in
th e recent past. A lthough we have addressed several key issues in reducing com m u
nication costs for designing algorithm s or developing applications on d istrib u ted
m em ory m achines, a thorough study is required to understand the inherent prob
lem in im plem enting an integrated parallel system for large-scale applications.
Various fu tu re research avenues have em erged from this dissertation.
• In our model, we ab stract the com m unication facility of a parallel m achine
into point-to-point channels and message handlers. O ne of th e im portant
roles of a message handler is a scheduler for receiving incom ing messages.
T he message handler switches am ong th e channels according to its message
107
reception policy to receive the d ata item s from th e channels delivered from
th e m em ory in other m odules. C urrently, the message reception policy in
our model used to analyze the com m unication tim e is first-com e-first-served.
However, we believe th a t for certain com m unication patterns, different m es
sage reception policies can lead to very different com m unication costs. For
exam ple, com pared w ith first-come-first-serve, scheduling m essage reception
using shortest-m essage-first or round-robin may b e a better policy for tra n s
porting d ata w ith high variance in th e am ount. By investigating various
m essage reception policies for the m essage handlers, m achine designers can
decide w hat kinds of policies should be provided in a d istrib u ted m em ory
m achine to suit the requirem ents of various applications. T h e research is
also useful for algorithm designers or compiler developers to choose an ap
pro p riate reception policy (protocol) from those provided by a d istributed
m em ory m achine.
• D ue to the lower price/perform ance ratio com pared with M P P , clusters of
w orkstations will becom e th e m ajor resources for perform ing parallel and
d istributed com putations in the future. For evaluating the perform ance of
a given algorithm on clusters of w orkstations, th e architectural feature of
netw ork com puting needs to be investigated. In general, the n u m b er of links
and switches used by clusters of w orkstations m ay be smaller com pared w ith
th e M PP of th e same num ber of processor/m em ory modules. T hus, the to
tal available com m unication bandw idth provided by the network m ay not be
large enough for all the m odules to perform rem ote data-access a t th e sam e
tim e, even though the com m unication pattern does not cause any m odule-
access contention at th e target m odules. Exam ples are th e w orkstations
connected by E thernet or FDDI, in which all d a ta traffic shares a single
physical m edium . In addition, larger packets can be used in th e netw ork
to tran sp o rt d a ta betw een any pair of processor/m em ory m odules to reduce
th e com m unication tim e. A n exam ple is the packet format used by M yrinet,
in which the payload of a M yrinet packet is of arbitrary length [7]. In tu
itively, transporting d a ta using a large packet is p rone to suffering from d a ta
contention in th e netw ork com pared w ith th at using small packets. Thus,
108
th e users of the clusters of w orkstations may need to pay atten tio n to the
d ata contention in th e network in developing th e ir applications. For a bad
com m unication p a tte rn to clusters of w orkstations, the tim e for perform ing
the p attern can be very long due to the d ata contention in th e netw ork. To
reduce the com m unication tim e, users of such m achines may be needed to use
several effective com m unication p attern s to carry o ut the bad com m unication
pattern .
• An analytical tool is required for designing algorithm s which effectively over
lap com putations and com m unications. T he purpose of th e tool is to ease
th e developm ent and analyses of th e algorithm s. The m ethodologies and
notations used by the tool should be concise and consistent. C urrently, al
gorithm designers use asym ptotic expression to represent com putation tim e
and exact expression to represent com m unication tim e. Due to th e inconsis
ten t representation, it is very difficult to see to w h at degree th e com putation
activities can overlap w ith the com m unication activities.
• O ur data-rem apping technology has been developed to handle inter-m odule
d ata dependency and im balanced load among th e modules. We have inves
tigated th e linear d ata dependency am ong the subtasks. For som e applica
tions, the d a ta dependency am ong th e subtasks can be m ore com plex. In
this case, a graph can be used to describe the d a ta dependency am ong th e
subtasks. To investigate the effectiveness of our data-rem apping technique
on com plex d ata dependency, we will develop a program w hich random ly
generates th e load for th e m odules and the d a ta dependency am ong the
m odules, and will apply our technique to the generated task. Furtherm ore,
various data-rem apping techniques will be proposed for running applications
on distributed m em ory m achines. T h e effectiveness of applying these tech
niques on distributed m em ory m achines will be discussed and th e suitability
of these techniques will be classified according to th e inherent n atu res of th e
applications.
109
• Message passing is a paradigm used widely on certain classes of parallel
machines. M PI (M essage Passing Interface) is the new stan d ard for m u lti
com puters introduced by th e M essage-Passing Interface Forum (M P IF ) in
April 1994 [44]. The goal of M PIF is to develop a widely used stan d ard for
w riting m essage-passing program s. T he m ain advantages of establishing a
m essage-passing standard are portability and ease-of-use. T he perform ance
of the M PI highly depends on their im plem entations. C urrently, the perfor
m ance of th e M PI available on several com m ercial m achines is not good. To
suit the requirem ent of high perform ance com putation, th e com m unication
prim itives of th e M PI need to be carefully im plem ented. In th e future, we will
investigate th e M PI com m unication prim itives and explore th e possibility of
designing fast com m unication prim itives for th e M PI.
110
B ibliography
[1] A. V. Aho and J. E. Hopcroft and J. D. Ullm an, D ata Structures and A l
gorithm s, C om puter Science and Inform ation Processing, Addison-W esley,
Reading, MA, 1983.
[2] H. M. Alnuweiri and V. K. P rasan n a, “Parallel A rchitectures and A lgorithm s
for Im age C om ponent Labelling,” IE E E Transaction on P attern Analysis and
M achine Intelligence, Vol. 14, No. 10, pp. 1014-1034, 1992.
[3] H. Aoyama and M. Kawagoe, “A Piecewise Linear A pproxim ation M ethod
Preserving Visual Feature Points of Original Figures,” C VG IP: Graphical
Models and Image Processing, Vol. 53, No. 5, pp. 435-446, 1991.
[4] D. B allard, “G eneralizing the Hough Transform to D etect A rbitrary Shapes,”
P attern Recognition, Vol. 13, pp. 111-122, 1981.
[5] A. Bar-Noy, S. K ipnis, “D esigning Broadcasting A lgorithm s in th e P ostal
M odel for M essage-Passing System s,” In Proceedings o f 4-th A C M Sym p. on
Parallel Algorithm s and Architectures, pp. 13-22, 1992.
[6] G ordon Bell, “U LTR A C O M PU TER S, A Teraflop Before Its T im e,” C om m u
nications o f the A C M , Vol. 35, No. 8, A ugust 1992.
[7] N. J. Boden, D. C ohen, R. E. Felderm an, A. E. K ulaw ik, C. L. Seitz, J. N.
Seizovic, and W . Su, “M yrinet: A G igabit-per-second Local A rea N etw ork,”
IE E E Micro, pp. 29-39, February 1995.
[8] S. H. Bokhari, “C om plete Exchange on th e iP S C /860,” IC A S E Technical
Report, No. 91-4, NASA Langley Research C enter, Jan u ary 1991.
[9] A. N. Choudhary, B. N arahari and R. K rishnam urti, “An Efficient H euristic
Scheme for D ynam ic R em apping of Parallel C om putations,” Parallel C om
puting, 19, pp. 621-632, 1993.
[10] Y. Chung, V. K. P rasanna, and C.-L. Wang, “A Fast Asynchronous A lgorithm
for Linear Feature E xtraction on IBM SP-2,” In Proceedings o f W orkshop on
C om puter Architectures fo r M achine Perception, 1995.
I l l
[11] Y. C hung and V. K. Prasanna, “Tim ing M easurem ent for Perform ing Basic
Com m unication P attern s on IBM SP-2,” Technical R eport, D epartm ent of
EE-System s, U niversity of Southern California, 1995.
[12] J. M. Cooley and J. W . Tukey, “An A lgorithm for The M achine C alculation
of Com plex Fourier Series,” M ath. Comp, 19, pp. 297-301, 1965.
[13] T. H. Cormen, C. E. Leiserson, and R. L. R ivest, ''Introduction to Algorithm s,
M IT Press, 1989.
[14] D. Culler, R. K arp, D. P atterson, A. Sahay, K. E. Schauser, E. Santos, R.
Subram onian, and T. Von Eicken, “LogP: Towards a R ealistic Model of P a r
allel C om putation,” Proc. o f 4-th A C M S IG P L A N Sym p. on Principles and
Practices o f Parallel Programming, pp. 1-12, 1993.
[15] R. C ypher, J. L. C. Sanz, L. Snyder, “Algorithm s for Image C om ponent L abel
ing on SIMD M esh Connected C om puter,” IE E E Transaction on Com puter,
Vol. 39, No. 2, pp. 276-281, 1990.
[16] R. Cypher, A. Ho, S. K onstantinidou, and P. Messina, “A rchitectural R e
quirem ents of Parallel Scientific A pplications w ith Explicit C om m unication,”
Proceedings o f The 20th Annual International Sym posium on C om puter A r
chitecture, pp. 2-13, 1993.
[17] M. Dubois, C. Scheurich, and F. A. Briggs, “M em ory Access Buffering in M ul
tiprocessors,” In Proceeding 13th Annu. International Sym posium on C om
puter Architectures, pp. 434-442, 1986.
[18] T . V. Eicken, D. E. Culler, S. C. Goldstein, and K. E. Schauser, “A ctive
Messages: a M echanism for Integrated Com m unication an d C o m putation,”
Proceedings o f The 19th Annual International Sym posium on C om puter A r
chitecture, pp. 256-266, 1992.
[19] S. Fortune and J. W yllie, “Parallelism in R andom Access M achines,” In P ro
ceedings o f the 10th A nnual Sym posium on Theory o f Com puting, pp. 114-118,
1978.
[20] D. Gerogiannis and S. C. O rphanoudakis, “Load Balancing R equirem ents in
P arallel Im plem entations of Im age Feature E xtraction T asks,” IE E E Trans
actions on Parallel and Distributed System s, Vol. 4, No. 9, Septem ber 1993.
[21] G rand Challenge: High Perform ance C om puting and C om m unications, C om
m ittee on Physical, M athem atical, and Engineering Science, N ational Science
Foundation, 1991.
112
[22] A. G u p ta and V. K um ar. “Analysis Perform ance of Large Scale P arallel Sys
tem s,” Technical Report 92-32, D epartm ent of C om puter Science. University
of M innesota, M inneapolis, 1992.
[23] S. E. H am brusch and A. A. K hokhar, “C 3: An A rchitecture-Independent
M odel for Coarse-G rained Parallel M achines,” In Proceedings o f Sixth IE E E
Symposium, on Parallel and D istributed Processing, pp. 544-551, 1994.
[24] J. L. Hennessy and D. A. P atterson. C om puter Architecture -A Q uantitative
Approach, M organ K aufm ann, 1990.
[25] T. Heywood and S. Ranka, “A P ractical H ierarchical M odel of Parallel Com
putation: I. T he M odel,” Journal o f Parallel and Distributed Computing, Vol.
16, pp. 212-232, 1992.
[26] A. H uertas, C. Lin and R. N evatia, “D etection of Buildings from M onocular
Views of Aerial Scenes using P erceptual Groupings and Shadows,” Proceedings
o f A R P A Image Understanding W orkshop, 1993.
[27] IE E E . Sym posium Record - Hot Chips IV , A ugust 1992.
[28] J. J a J a and K. W . Ryu, “T he Block D istributed M em ory Model for Shared
M em ory M ultiprocessors,” In Proceedings o f International Parallel Processing
Sym posium , pp. 752-756, 1994.
[29] D. V. Jam es, A. T. Laundrie, S. Gjessing, and G. S. Sohni, “D istributed-
D irectory Scheme: Scalable Coherence Interface,” IE E E Computer, 23(6),
p p . 7 4 -7 7 ,1 9 9 0 .
[30] E. T. Kalns and L. M. Ni, “Processor M apping Techniques Toward Efficient
D ata R edistribution,” In proceedings o f International Parallel Processing S ym
posium , pp. 469-476, 1994.
[31] R. K. Koeninger, M. Furtney, and M. W alker, “A Shared M emory M P P from
C ray R esearch,” Digital Technical Journal, Vol. 6, No. 2, Spring 1994.
[32] J. K uskin, et al., “T he Stanford FLASH M ultiprocessor,” Proceedings o f The
21th A nnual International Sym posium on C om puter Architecture, pp. 302-313,
1994.
[33] V. K um ar and V. N. Rao. Parallel D epth-first Search, P art II: A nalysis, In
ternational Journal o f Parallel Programming, 16 (6), pp. 501-519, 1987.
[34] V. K um ar and A. G upta, “Analyzing Scalability of Parallel A lgorithm s and
A rchitectures,” Technical Report 91-18, D epartm ent of C om puter Science,
U niversity of M innesota, M inneapolis, 1991.
113
[35] V. K um ar and V. Singh, “Scalability of Parallel Algorithm s for T h e All-pairs
S hortest-path Problem ,” Journal o f Parallel and D istributed Com puting, Vol.
13, p p 124-138, 1991.
[36] T. T. Kwan, B. K. Totty, and D. A. Reed, “C om m unication and C om putation
Perform ance of the CM-5,” In Proceedings o f Supercomputing ’ 93, pp . 192-201,
1993.
[37] T. Leighton, “T ight Bound on the Com plexity of Parallel S o rtin g ,” IE E E
Transactions on Com puters, C-34(4), pp. 344-354, A pril 1985.
[38] D. Lenoski, J. Laudon, K. G harachorloo, W . D. W eber, A. Gupta, J . Hennessy,
M. Horowitz, and M. Lam, “T h e Standford Dash M ultiprocessor,” IE E E Com
puter, pp. 63-79, M arch 1992.
[39] K. Li, “Scalability Issues of Shared V irtu al M emory for M ultiprocessors,” in
Dubois and T hakkar (eds.), Scalable Shared-M em ory Multiprocessors, Kluwer
Academ ic Publishers, B oston, MA, 1992.
[40] C.-C. Lin and V. K. P rasanna, “Analysis of Cost of Perform ing C om m unica
tions Using Various C om m unication M echanism s,” in Proceeding o f Sym po
sium on the Frontiers o f M assively Parallel Computation, pp. 290-297, 1995.
[41] C.-C. Lin and V. K. P rasanna, “Scalable Parallel E xtraction of L inear Fea
tures on M P-2,” In Proceedings o f Workshop on C om puter Architectures fo r
M achine Perception, pp.352-362, 1993.
[42] W . M. Lin and V. K. P rasanna, “Parallel Algorithm s and A rchitectures for
C onsistent Labeling,” Parallel Processing fo r Artificial Intelligence, L. Kanal,
V. K um ar, H. K itano and C. S uttner, Editors, 1994, Elsevier Science Pub
lishers B. V.
[43] S. M iguet and V. Poty, “R evisiting the A llocation of Regular D a ta Arrays to
a Mesh of Processors,” In Proceedings o f Fourth IE E E Sym posium on Parallel
and Distributed Processing, p p .62-69, 1992.
[44] M essage Passing Interface Forum , “M PI: A M essage-Passing S ta n d ard ,” Tech
nical Report CS-94-230, C om puter Science D epartm ent, U niversity of Ten
nessee, Knoxville, 1994.
[45] R. N evatia and K. R. B abu, “Linear F eature E xtraction and D escription,”
C om puter Graphics and Im age Processing, 13, pp.257-269, 1980.
[46] R. N evatia, M achine Perception, Prentice-H all Inc., 1982
114
[47] D. J. Palerm o, E. Su, J. A. Chandy. and P. Banerjee, “C om m unication O pti
m izations used in th e Paradigm Com piler for D istributed-M em ory M ulticom
p u ters,” in Proceedings o f International Conference on Parallel Processing,
1994.
[48] D. A. Patterson and J. L. Hennessy, C om puter Organization and Design,
The H ardware/Software Interface, M organ K aufm ann Publishers, San M ateo,
California, 1994.
[49] V. K. P rasanna and C. S. Raghavendra, “A rray Processor w ith M ultuple
B roadcasting,” Journal o f Parallel and Distributed Com puting,, pp. 173-190,
4, 1987.
[50] A. Rosenfeld, R. A. Hum m el, and S. W. Zucker. “Scene Labeling by Re
laxation O peration,” IE E E Transactions on System s, M an, and Cybernetics,
SMC-6.-420-423, 1976.
[51] S. Ranka, R. Shankar, and K. A lsabti, “M any-to-M any Personalized Com m u
nication with B ounded Traffic,” in Proceeding o f Sym posium on the Frontiers
o f M assively Parallel Computation, February 1995.
[52] A. P. Reeves, “P arallel Algorithm s for real-tim e im age processing,” in M ulti
com puter and Image Processing, K. P reston and L. U hr (E d), A cadem ic Press,
pp. 7-18, 1982.
[53] C. R einhart and R. Nevatia, “Efficient P arallel Processing in High Level Vi
sion,” In Proceedings o f A R P A Image Understanding Workshop, 1990.
[54] Howard. J. Siegel et al., “R eport of the P u rd u e W orkshop on G rand Challenges
in C om puter A rchitecture for th e Support of High Perform ance C om puting,”
Journal o f Parallel and D istributed Computing, 16, pp. 199-211, 1992.
[55] Xian-He Sun and Lionel M. Ni, “A nother View on Parallel S peed-up,” In
Proceedings o f Supercomputing ’ 90, pp. 324-333, 1990.
[56] Xian-He. Sun, “P arallel C om putation Models for Scientific C om puting on
M ulticom puter,” Ph.D. D issertation, C om puter Science D epartm ent, Michi
gan S tate University, 1990.
[57] A. S. T anenbaum , Com puter Networks, Prentice-H all Inc., New Jersey, 1981.
[58] L. G. V aliant, “A Bridging M odel for Parallel C om putation,” C om m unications
o f the AC M , Vol. 33, No. 8, pp. 103-111, 1990.
[59] C.-L. W ang and V. K. P rasanna, “Parallelization of P erceptual G rouping
on D istributed M em ory M achines,” In Proceedings o f Workshop on C om puter
115
Architectures fo r M achine Perception, 1995. U niversity of Southern California,
January 1995.
[60] C.-L. Wang, V. K. P rasanna, H. K im , and K hokhar, “Scalable D ata Parallel
Im plem entations of O bject Recognition using G eom etric Hashing,” Journal
o f Parallel and Distributed Computing, pp. 96-109, M arch 1994.
[61] C.-L. W ang and V. K. Prasanna, “Low Level Vision Processing on C onnection
M achine CM -5,” In Proceedings o f Workshop on C om puter Architectures for
Machine Perception, pp. 352-362, 1993.
116
Asset Metadata
Creator
Lin, Cho-chin (author)
Core Title
Algorithmic approaches for reducing communication costs in distributed memory machines
Contributor
Digitized by ProQuest
(provenance)
School
Graduate School
Degree
Doctor of Philosophy
Degree Program
Computer Engineering
Degree Conferral Date
1995-08
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
oai:digitallibrary.usc.edu:usctheses,OAI-PMH Harvest
Format
theses
(aat)
Language
English
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC11257272
Unique identifier
UC11257272
Identifier
9621717.pdf (filename)
Legacy Identifier
9621717
Document Type
Dissertation
Format
theses (aat)
Internet Media Type
application/pdf
Type
texts
Source
University of Southern California Dissertations and Theses
(collection),
University of Southern California
(contributing entity)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the author, as the original true and official version of the work, but does not grant the reader permission to use the work if the desired use is covered by copyright. It is the author, as rights holder, who must provide use permission if such use is covered by copyright.
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus MC 2810, 3434 South Grand Avenue, 2nd Floor, Los Angeles, California 90089-2810, USA
Repository Email
uscdl@usc.edu
Linked assets
University of Southern California Dissertations and Theses