Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
00001.tif
(USC Thesis Other)
00001.tif
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
INFORMATION TO USERS This manuscript has been reproduced from the microfilm master. UMI films the text directly from the original or copy submitted. Thus, some thesis and dissertation copies are in typewriter face, while others may be from any type of computer printer. The quality of this reproduction is dependent upon the quality of the copy submitted. Broken or indistinct print, colored or poor quality illustrations and photographs, print bleedthrough, substandard margins, and improper alignment can adversely affect reproduction. In the unlikely event that the author did not send UMI a complete manuscript and there are missing pages, these will be noted. Also, if unauthorized copyright material had to be removed, a note will indicate the deletion. Oversize materials (e.g., maps, drawings, charts) are reproduced by sectioning the original, beginning at the upper left-hand comer and continuing from left to right in equal sections with small overlaps. Each original is also photographed in one exposure and is included in reduced form at the back of the book. Photographs included in the original manuscript have been reproduced xerographically in this copy. Higher quality 6” x 9" black and white photographic prints are available for any photographs or illustrations appearing in this copy for an additional charge. Contact UMI directly to order. A Bell & Howell Information Company 300 North Z eeb Road. Ann Arbor. M l 48106-1346 USA 313/761-4700 800/521-0600 ALGORITHMIC APPROACHES FOR REDUCING COMMUNICATION COSTS IN DISTRIBUTED MEMORY MACHINES by Cho-chin Lin A D issertation Presented to the FACULTY OF TH E G R A D U A TE SCHOOL U N IV ER SITY OF SO U T H E R N C A L IFO R N IA In Partial Fulfillm ent of th e R equirem ents for the Degree D O C TO R O F PH IL O SO PH Y (C om puter Engineering) August 1995 Copyright 1995 Cho-chin Lin UMI Number: 9621717 UMI Microform 9621717 Copyright 1996, by UMI Company. All rights reserved. This microform edition is protected against unauthorized copying under Title 17, United States Code. UMI 300 North Zeeb Road Ann Arbor, MI 48103 UNIVERSITY OF SOUTHERN CALIFORNIA THE GRADUATE SCHOOL UNIVERSITY PARK LOS ANGELES, CALIFORNIA 90007 This dissertation, written by C h o - c h i n L in under the direction of h..Xs. Dissertation Committee, and approved by all its members, has been presented to and accepted by The Graduate School, in partial fulfillment of re quirements for the degree of DOCTOR OF PHILOSOPHY Dean of Graduate Studies Date DISSERTATION COMMITTEE .... Chairperson To m y wife Hui-ling A cknow ledgem ents It is a pleasure to express m y deep g ratitu d e to Professor Viktor K . Prasanna, the chairm an of m y dissertation com m ittee, for his consistent assistance, encour agem ent, and guidance throughout my grad u ate studies a t U niversity of Southern California. This dissertation would not have been com pleted w ithout th e proper m otivational support he offered when it was most needed. Numerous ideas on m y research have been given to m e through his extensive research experience. T hanks to Professor Jean-Luc G audiot and Professor Kai Hwang for guiding me through the P h D qualifier. My deep appreciation is also given to Professor Douglas Ierardi and Professor T im othy M. Pinkston for being on m y dissertation com m ittee and for m any invaluable suggestions. I appreciate th e help of Yongwha C hung in w riting th e code for m y algorithm on a parallel m achine. In addition, thanks to my colleagues W ei-m ing Lin, Ash- faq A. Khokhar, Cho-li Wang, Jongwoo B ae, W en-heng Liu, P rash an th B. B hat, Suriyaprakash N atarajan , and Young-won Lim for th e ir support a n d help in dis sertatio n related m atters, or otherwise. Finally, but not least, I would like to express my m ost profound g ratitu d e to my parents, and m y wife who have shown th eir continued endurance, sacrifice, and love throughout m y doctorate study. C ontents D edication ii A cknowledgem ents iii List O f Figures vi A bstract viii 1 Prem ise 1 1.1 In tro d u c tio n ........................................................................................................... 1 1.2 M otivation and A p p r o a c h e s ........................................................................... 8 1.3 A Sum m ary of R e s u l t s ..................................................................................... 12 2 A C om putational M odel 18 2.1 B a c k g r o u n d .......................................................................................................... 18 2.2 A R ealistic C om putational M o d e l ................................................................ 21 2.3 C om parison with O th er M o d e ls ................................................................... 26 2.3.1 T h e PRA M M o d e l .............................................................................. 27 2.3.2 T h e Network M o d e ls ........................................................................... 28 2.3.3 T h e LogP M o d e l .................................................................................. 28 2.3.4 T h e Postal M o d e l.................................................................................. 31 2.3.5 T h e BSP M o d e l..................................................................................... 32 2.4 Sum m ary .............................................................................................................. 33 3 Initial Data-m apping 35 3.1 B a c k g r o u n d .......................................................................................................... 36 3.2 M apping D ata to M odules with B ounded M em ory S iz e s..................... 37 3.2.1 Definitions and N otations ................................................................ 38 3.2.2 Scalability Analysis w ith B ounded M em ory Sizes .................. 40 3.2.3 O bservations an d D is c u s s io n s ......................................................... 45 3.3 M apping D ata to M em ory Locations ........................................................ 46 3.4 Sum m ary .............................................................................................................. 54 iv 4 Com m unication Latency R educing 55 4.1 B a c k g ro u n d ........................................................................................................... 56 4.2 Com m unication A ctivity H id in g .................................................................... 57 4.2.1 O ne-to-one ............................................................................................... 59 4.2.2 O n e - to - m a n y ........................................................................................... 62 4.2.3 M an y -to -o n e............................................................................................... 66 4.3 M essage-grain B a la n c in g ................................................................................... 70 4.3.1 D ata D is trib u tio n .................................................................................... 70 4.3.2 On the O verhead of Balancing M essag e -g ra in s........................... 76 4.4 Sum m ary ............................................................................................................... 80 5 Data-rem apping 82 5.1 B a c k g ro u n d ............................................................................................................ 83 5.2 Fixed D a ta -re m a p p in g ...................................................................................... 84 5.2.1 O n the Com plexity of Perform ing D ata-re m a p p in g .................... 87 5.2.2 On the Message Sizes for Perform ing D ata-rem apping . . . . 88 5.3 Dynam ic D ata-rem apping ............................................................................... 90 5.3.1 Definitions and N otations .................................................................. 90 5.3.2 In-Place C o m p u ta tio n s .......................................................................... 92 5.3.3 C om putations with D a ta -R e m a p p in g ............................................. 94 5.3.4 Im plem entation D etails and E xperim ental R e s u lts .................... 99 5.4 Sum m ary .................................................................................................................. 105 6 Conclusion 106 6.1 C ontributions of this D is s e rta tio n .................................................................... 106 6.2 F uture D ire c tio n s ....................................................................................................107 Bibliograpgy 111 v L ist O f Figures 1.1 A diagram for a centralized m em ory m achine............................................ 3 1.2 A diagram for a distributed m em ory m achine............................................ 4 1.3 A diagram for a current d istributed memory m achine............................ 5 1.4 An approach to achieve large speed-ups on parallel m achines............... 9 1.5 Issues of m apping d ata to distributed m em ory m achines........................ 11 2.1 An overview of our com putational m odel...................................................... 22 2.2 An illustration for the d ata receptions at a m essage h an d ler......................24 3.1 Sequential Floyd A lgorithm ................................................................................ 42 3.2 D ata dependency in th e checkerboard p artitio n .......................................... 43 3.3 Expansion ranges of various parallel system s................................................ 45 3.4 M odule-m apping for window operations......................................................... 49 3.5 Eight sets of d ata groups...................................................................................... 50 3.6 M em ory-m apping A ................................................................................................ 51 3.7 M em ory-m apping B ................................................................................................ 53 4.1 An illustration for one-to-one com m unication.............................................. 59 4.2 O verlapping in perform ing one-to-one com m unication............................. 60 4.3 An illustration for one-to-m any com m unication......................................... 63 4.4 An illustration for a m odule-partition approach......................................... 64 4.5 Trade-off in the m odule-partition approach.................................................. 65 4.6 An illustration for m any-to-one com m unication.......................................... 67 4.7 O verlapping in perform ing m any-to-one com m unication..............................69 4.8 An illustration for m odule-access contention................................................ 72 4.9 A com m unication p attern for distributing d a ta am ong 20 modules. . 73 4.10 Algorithm _l for d ata d istrib u tio n..................................................................... 74 4.11 A lgorithm -2 for d ata d istrib u tio n ...................................................................... 76 4.12 N otations used in the closed form ..................................................................... 78 4.13 An illustration for th e am ount of d ata th a t needs to be balanced. . 79 5.1 An illustration for 16-point F F T butterfly g rap h ........................................ 84 5.2 An illustration for cyclic layout......................................................................... 85 5.3 An illustration for hybrid layout....................................................................... 86 vi 5.4 A 128 by 128 raw im age....................................................................................... 91 5.5 A 128 by 128 image w ith m arked pixels........................................................ 92 5.6 M ajor steps in perform ing data-rem apping................................................... 94 5.7 M ajor steps in perform ing cluster-labeling.................................................... 95 5.8 An illustration for the divide-and-conquer approach................................ 96 5.9 An exam ple for labeling clusters...................................................................... 97 5.10 Com m unication tim es for moving d a ta using various approaches. . . 101 5.11 Execution tim es for a 128 x 128 im age.................................... 102 5.12 Execution tim es for a 256 x 256 im age....................................103 5.13 Execution tim es for a 512 x 512 im age.................................... 103 5.14 Execution tim es for a 1024 x 1024 im age................................................. 104 A bstract M any applications which concern issues of hum an welfare have enorm ous com pu tatio n al requirem ents. However, to d ay ’s sequential com puters cannot m eet th e requirem ents needed to im plem ent th e applications. C urrently, m any vendors are offering d istributed m em ory parallel machines. In these m achines, th e tim e for a processor to access d a ta from rem ote m odules is much longer th an th a t for th e processor to access d a ta from its module. S ubstantial overhead exists in current d istributed m em ory m achines to in itial ize message delivery and schedule message reception. To achieve large speed-ups, algorithm s should be designed in such a way as to m inim ize the overhead in ac cessing rem ote m odules. In this dissertation, a realistic m odel is proposed to serve as a bridge betw een algorithm designers and d istributed m em ory m achines. B ased on the m odel, our approaches for achieving large speed-ups focus on: localizing m ost of th e data-access and reducing the overhead in accessing rem ote d ata. These approaches are developed in three aspects. F irst, a m ethodology is developed for evaluating various strategies for m apping d a ta to th e modules; then the effect of using various m apping strategies on a parallel m achine w ith bounded m em ory size can be evaluated. A relationship betw een th e size of a problem and the size of a m achine exists for th e m achine to m aintain its efficiency a t a desired level. The relationship depends on th e m apping strategy, in which th e m em ory requirem ent could increase non-linearly w ith th e problem size. We have derived a range on th e num ber of m odules over which th e efficiency of th e parallel machine can be m aintained at a desired level. Second, techniques for reducing latency in moving d a ta am ong th e m odules are considered. These techniques are developed for perform ing window operations, for perform ing fundam ental com m unication p attern s, and for distrib u tin g d a ta am ong th e modules. O ur results show th at for a given m odule-m apping of window viii operations, th e tim e for constructing messages can be reduced by storing the d a ta at ap p ro p riate m em ory locations; by hiding partial com m unication activities, th e com m unication latency in perform ing fundam ental com m unication p attern s can be reduced; and th e m odule-access contention problem in perform ing d a ta d istribution can be solved by balancing message-grains. Finally, techniques are proposed for perform ing data-rem apping for image p ro cessing applications with arb itrary d a ta dependency. For the data-rem apping, we show th a t an algorithm designed for less am ount of rem ote data-access may degen erate th e perform ance of som e types of parallel m achines. Then, a technique for dynam ic data-rem apping is developed to elim inate inter-m odule d a ta dependency as well as load im balance during run-tim e. We also rep o rt results of experim ents conducted on an IBM SP-2. ix C hapter 1 P rem ise Parallel processing has been an active area for a couple of decades. T he success in parallel processing needs th e advances in hardw are technology an d software technology. T h e goal of the hardw are design in a p arallel m achine is to provide enorm ous com puting power for large-scale applications. Based on th e nature of an application, parallel software should be tailored to effectively u tiliz e the com p u tin g power of th e parallel m achine to suit the com putational requirem ent of th e application. In this ch ap ter, the need of parallel machines an d th e trend in developing parallel m achines are presented in Section 1.1. In S ection 1.2, the m otivation of th is research an d the approaches used in th is research are given. A sum m ary of our results is listed in Section 1.3. 1.1 Introduction W hen we look a t hum an history, we find th a t human beings have relied mainly on th eir brains to perform various activities: such as rem em brance or calculation. In order to reduce th e burden on hum an brains, varieties o f tools or ru les have been developed to ease th e activities. Exam ples are: perform ing calculations by the aid of abacus and th e slide rule, a n d keeping rem em brance b y the aid of ropes and the knot-tieing rules. However, th e tools and rules did n ot totally relieve th e hum an brains from th e burdens. As th e size and th e com plexity of activities to be carried o ut is increasing, lim itations for hum an beings in perform ing such activities are becom ing more an d more obvious. For exam ple, 1 • th e tools and rules for carrying out th e activities generally need th e cooper ation of hum an hands and brains; this leads to a slow processing speed, • hum an beings are notoriously prone to error, so th a t com plex activities per formed by hands are unreliable, and • th e rem em brance capacity of hum an beings is lim ited; m em ory becomes ob scure after a period of tim e. T hus, com puting m achines (com puters) are developed and em ployed to assist hu m an beings in handling activities of increasing com plexity. T he m ajor com ponents of to d a y ’s electronic sequential com puter consist of a m icroprocessor, a m em ory unit, and in p u t/o u tp u t equipm ent. M icroprocessor perform ance is advancing at a rate of 50 to 100% per year [27]. T he perform ance of state-of-the-art m icroprocessors can have com puting speed up- to hundreds of M FLO PS. M em ory capacity is increasing at a ra te com parable to the increase in capacity of DRAM chips: quadrupling in size every three years [24]. T oday’s personal com puters or w orkstations use tens or hundreds of M Bytes. It seems th a t substantial progress has been achieved in sequential com puter technol ogy. However, th e perform ance still cannot suit the com putational requirem ents of th e grand challenge problem s, which are identified in the U.S. High-Perform ance C om puting and C om m unication (HPCC) program . These problem s consider the following applications: C lim ate M odeling Fluid Turbulence Pollution Dispersion H um an Genome Ocean C irculation Q uantum C hrom odynam ics Sem iconductor M odeling Superconductor M odeling C om bustion Systems Vision and Cognition T hese applications concern th e issues of hum an welfare and th e science th a t may lead to a b etter living environm ent. T he applications have enorm ous com puta tional requirem ents. Consider, for exam ple, th e problem of m odeling th e weather. In 5 years tim e, d a ta collection facilities will be in place to define detailed atm o spheric structures and perm it significant advances in forecasting capabilities. T he 2 a-) a-2 Interconnection network b-1 b-2 P ,: processor / : memory Unit j Figure 1.1: A diagram for a centralized m em ory m achine. goal of im proving atm ospheric m odeling resolution to a 5-km scale and providing tim in g results is believed to require 20 T FLO PS of perform ance [54]. However, to d a y ’s m ost powerful sequential com puters cannot m eet th e com putational re quirem ent needed to im plem ent this approach. Thus, it is obvious th a t a serious a tta c k on th is application will require high perform ance parallel m achines. M any parallel m achines w ith various architectures appear to pursue th e goal of high perform ance com puting. According to the m em ory organization of the M IM D (m ultiple instruction and m ultiple data) parallel m achines, th e m achines can be classified into two categories: centralized m em ory m achines and d istrib u ted m em ory m achines [48]. C entralized m em ory machines p u t th e processors all on one side and th e m em ories on the o ther side. T he stru ctu re of a centralized m em ory m achine is illu strated in Figure 1.1. T he tim e for accessing d a ta from any m em ory location is th e sam e for all the processors. The reason is th a t every data-access should go over th e interconnection netw ork. Exam ples of these m achines are: • C ray Y /M P • SGI 4/360 3 a-1 a-. a-2 Interconnection network P i : processor i M j: memory unit j Figure 1.2: A diagram for a d istributed m em ory m achine. • IBM ES9000 • Sequent Sym m etry For a d istrib u ted m em ory m achine, th e m em ory u n its are d istrib u ted across the m achine. Each of the processors has som e memory units close to it. We refer to those m em ory u nits as th e local m em ory of the processor. In contrast to th e local m em ory, the o th er m em ory units are called the rem ote m em ory of th e processor. T h e stru ctu re of a d istributed memory m achine is illustrated in Figure 1.2. In such a m em ory organization, th e tim e for accessing d ata from local m em ory is shorter th an th a t for accessing d a ta from rem ote memory. D ue to th e cost/perform ance advantage, current d istributed memory parallel m achines are form ed by a collection of com plete com puters (processor/m em ory m odules) com bined by an interconnec tion netw ork. Figure 1.3 illustrates th e current tre n d in building a d istrib u ted m em ory m achine. In these machines, d a ta can be accessed from local m em ory w ithout th e aid of an interconnection netw ork. Exam ples of such m achines include • T hinking M achines CM-5. • Intel Paragon. • C luster of W orkstations. • IBM SP1/SP2. • Stanford D ASH /FLA SH . • C ray T3D. 4 a-2 a-1 Interconnection network f ---------\ a Mi processor/memory module i Figure 1.3: A diagram for a cu rren t d istributed m em ory m achine. Basically, those parallel machines are built from a sm all num ber of basic com ponents, w ith no single bottleneck com ponent. Thus, th e parallel m achines can be increm entally expanded over th e ir designed scaling range and potentially can deliver linear increm ental perform ance for a well-defined set of applications. Such parallel m achines are considered to be scalable [6]. According to the rem ote data-access capability of the m odules in the distributed m em ory m achines, th e machines have either m ultiple disjoint addressing-spaces or a global addressing-space. For those machines having m ultiple disjoint addressing- space, m essage-passing is used to access data from rem ote memory. For those ma chines having a global addressing space, distributed-shared-m em ory (D SM ) is used to access d a ta from rem ote memory. The principal difference between message- passing and DSM is in their protocols [32], A m ajor problem encountered in DSM is th a t a module needs to use a sequence of loads an d stores to perform re m ote data-access using messages of fixed size. T his results in high com m unication overhead in transferring a block of d a ta in DSM. A m ajor criticism of m ost com m ercial m essage-passing machines is their high software overhead associated with user-level m essage-passing. This can be inefficient if a p air of modules frequently com m unicate short messages. R ecent work [16] has shown th a t to su it th e tim ing requirem ents of various applications, the com m unication among th e modules should be perform ed well with eith er very short messages or very long m essages as well as w ith a m ixture of very short and very long messages. Thus, th e architec tu res of cu rren t parallel machines ten d to support both ty p es of com m unications. M essage-passing m achines are m oving towards efficient su p p o rt for com m unicat ing short messages of fixed size (e.g. active messages in C M -5), features norm ally associated w ith DSM m achines. Similarly, DSM m achines are beginning to pro vide efficient message-like block transfers (e.g. th e block tran sfer engine in T3D and th e M agic chip in FLASH). T hus, the users of these parallel m achines should choose th e appropriate com m unication m echanism according to the requirem ents of th eir com putations or redesign th e algorithm s to exploit the com m unication m echanism s. 6 A lthough various com m ercial or experim ental d istributed m em ory m achines are available or under construction, they have th e com m on inherent characteris tics. T hose characteristics which cannot be overlooked for algorithm designers and application developers to achieve high perform ance com puting are listed as follows. • Since a m odule th a t accesses d ata from rem ote m em ory m ust go over the interconnection network, th e tim e for perform ing rem ote data-access is much longer th an th a t for perform ing local data-access. T he steps for a m odule to access d ata from rem ote m em ory in a d istributed m em ory m achine include m essage construction, network startu p processing, and d ata tran sp o rtatio n . • Various com m unication m echanism s m ay be provided by parallel machines; each of the m echanism s is designed to suit the requirem ent in perform ing a specific class of com m unications. Thus, users of the parallel m achine should carefully choose an appropriate m echanism according to the com m unication requirem ent of an application. • T h e interconnection network supports point-to-point com m unication. Thus, a m odule accessing d ata from rem ote m em ory can be perform ed w ithout disturbing th e com putational activities perform ed in th e other m odules. • Each m odule has lim ited com m unication bandw idth, th a t is, each m odule can deliver or receive d ata of fixed am ount in a constant period of time. T hus, m ultiple m odules attem p tin g to access a specified m odule can cause th e problem of m odule-access contention. • T h e network capacity is finite. A source m odule will suspend its d a ta delivery activities if th e network is saturated by th e data. • T h e im portance of the com m unication protocols is increasing. T h e protocols define how th e com m unication activity of a pair of com m unicating m odules is operated. In general, accessing d ata from rem ote m em ory using different com m unication protocols can lead to very different com m unication tim es. Recently, several models [5,14, 28, 58] have been proposed. These m odels are to capture th e characteristics of current distributed m em ory m achines for algorithm 7 designers tailoring their algorithm s to suit th e n atu re of an application. Several m echanism s have also been built into the d istributed m em ory m achines for hiding th e long latency in accessing d a ta from rem ote memory. These m echanism s in clude d ata pre-fetching, shared virtual m em ory [39], coherent caches [38], scalable coherence interface [29], and relaxed m em ory consistency [17]. In this dissertation, we propose a realistic m odel serving as a bridge for th e algorithm designers and th e d istributed m em ory m achines. T hen, techniques for reducing com m unication costs in distributed m em ory m achines using algorithm ic approaches are proposed. T he rest of this chapter gives the m otivation of this research and approaches used in this research, and sum m arizes our results. 1.2 M otivation and A pproaches O ur research is m otivated by the technological trends in designing state-of-the- a rt com m ercial or experim ental parallel m achines. Today, m ost M IM D parallel m achines are essentially formed by a collection of processor/m em ory m odules con nected by an interconnection netw ork. These m achines are nam ed “d istrib u ted m em ory m achines” . In the d istributed m em ory m achines, each m odule accesses d a ta from o th er m odules (rem ote data-access) over an interconnection network. T he com m on arch itectu ral feature of th e parallel m achines encourages th e users to em ploy the locality of data-access in solving application problem s on th e m achines. T hus, several techniques for achieving or restoring d a ta locality are proposed and investigated. T he hardw are of m any current M IM D distributed m em ory m achines is designed to deliver com puting power in the G FLO PS range. T he T FL O P S perform ance of parallel m achines is the next goal to be pursued. A general approach for achiev ing large speed-ups on these m achines is to p artitio n a task into several subtasks. T h en , each of these subtasks is m apped to th e m odules of the m achine. Figure 1.4 illustrates this approach. R em ote data-access m ay be necessary due to th e possible d ata dependency among th e subtasks. T h e ex tra tim e for rem ote d ata- access adds to th e to tal execution tim e. T he overhead of accessing d ata from rem ote m em ory using m essage-passing or distributed-shared-m em ory can lead to u nder utilization of the com puting power of th e parallel m achines. T his is a m ajor < T task partition useful work communication extra work or communication overhead ( subtask ~ ) module ^ subtask ^ module message construction hardware overhead software overhead Figure 1.4: An approach to achieve large speed-ups on parallel m achines. overhead incurred in parallelizing a task. In order to reduce the adverse effects of inter-m odule data-access, novel approaches for reducing com m unication costs need to be investigated. C om piler approaches to m inim ize th e com m unication tim e have been proposed (see, for exam ple,[47]). However, for m any large-scale applica tions, th e com putations generally involve arb itrary d a ta distribution, dependencies am ong com putations to be perform ed in various processors, and frequent irregular interprocessor com m unication. Due to the large com putational com plexity of these applications, algorithm ic approaches for reducing th e tim e in accessing d a ta from rem ote m em ory are also needed. 9 To investigate th e techniques developed for achieving high perform ance com p u tin g , a model of parallel com putation w hich bridges parallel m achines and par allel software is very im portant. In this dissertation, we first propose a realistic com putational m odel. A realistic model should reflect the users’ system view point of th e parallel m achines. T he param eters of th e model m ust capture those features th a t are fundam ental to predict the perform ance of the parallel softw are running on distributed m em ory m achines. The m odel should be concise enough to develop a n d analyze algorithm s easily. T he model should provide th e stage for develop ing portable algorithm s, which can be run on different architectures w ith little m odification. Since the com m on feature of distributed m em ory m achines is th eir large re m o te data-access tim e, it im plies th at the d a ta locality is a key issue in achieving high perform ance on such m achines. This suggests th at developing effective data- m apping techniques should be a useful approach for reducing the to tal execution tim e. In this dissertation, based on our m odel, several techniques for achiev ing effective data-m apping are proposed, and th e lim itations and usefulness of the techniques are investigated. In general, the issues in data-m apping consist of initial data-m apping and data-rem apping as shown in Figure 1.5. Initial data-m apping considers how to assign data to each of the m odules to m inim ize the com m unica tio n costs for th e en tire or p arts of the execution. In general, initial data-m apping is to capture th e d a ta locality of the com putation at the beginning, th u s, most of th e data-accesses can be localized. For some applications, th e localized data-access m ay become obscure after a period of com putations and th e num ber of rem o te data- accesses increases. Each rem ote data-access will experience a high com m unication latency. Thus, an approach of data-rem apping to restore th e data locality of the com putations is necessary. D ata-rem apping needs to consider the d a ta re-layout and d ata tran sp o rtatio n am ong the modules. These are th e overheads incurred in data-rem apping. For those applications w hich have arb itra ry d ata dependency and im balanced lo ad among th e modules, gathering global inform ation is necessary for determ ining th e policy of d a ta re-layout. Global inform ation can b e gathered in to a module by accessing d a ta from rem ote memory. It is obvious th a t data tran sp o rtatio n am ong modules plays an im p o rtan t part in data-rem apping. Due 10 c Approach ) Technique used to carry out an approach GOAL: To achieve or restore data locality Data-mapping Data-remapping Initial data-mapping Replication Data re-layout Data partition Data transportation ^ ^ S c h e d u lin g ^ (^nT in-packjng) (^Overlapping^ re 1.5: Issues of m apping d ata to distributed m em ory m achines. to technological constraints, th e lim ited bandw idth am ong any p air of com m uni cating m odules may lead to an intolerant com m unication tim e if an inappropriate com m unication scheduling strateg y is applied. For exam ple, d a ta tran sp o rtatio n am ong a group of m odules may be poorly scheduled such th a t m any m odules access th e sam e sm all group of m odules a t the sam e tim e. A straightforw ard approach to avoid this can be achieved using the linear perm u tatio n algorithm [8]. In th e approach, d a ta transportation am ong the m odules can be perform ed using several p erm u tatio n steps; in each perm utation step, each m odule in th e group directly sends (receives) one message to (from ) the o th er m odule. However, th e approach cannot be directly applied to applications w hich require com m unicating messages w ith high variance in sizes. T his suggests th a t novel approaches are required for d a ta tran sp o rtatio n am ong m odules to ensure gains from data-rem apping larger th an the overheads paid for th e data-rem apping. As illustrated in Figure 1.5, techniques such as com m unication scheduling, m essage-grain packing, overlapping com m unication activities, etc. should be considered for developing fast d ata tra n s p o rtatio n algorithm s. 1.3 A Sum m ary o f R esults T h e contributions of this dissertation are two fold: propose a com putational m odel for bridging parallel software and distributed m em ory architectures, and investi gate th e im pacts of our data-m apping techniques based on the m odel. In C h ap ter 2, a realistic structure-independent m odel is proposed to capture th e com m on architectural features of state-of-the-art d istrib u ted m em ory m achines. To illu strate th e uniqueness of our model, th e model is com pared w ith several re cently proposed models (the P ostal model [5], the LogP model [18], and the B SP m odel [58]). T he proposed functional elem ents of our m odel represent an algo rith m designer’s perspective. To m odel the rem ote data-access in th e d istrib u ted m em ory m achines, we abstract th e com m unication facilities into point-to-point du plex channels and duplex message handlers, ( “Duplex” refers to a com m unication facility which can perform d ata delivery and d ata reception at th e same tim e). A channel is dedicated to a pair of modules for tran sp o rtin g d ata an d a m essage handler is attached at each m odule for delivering (and receiving) d a ta to (from ) 12 the channel. If th ere are m ultiple messages arriving a t a target m odule at th e same tim e interval, the m essage handler at th e target m odule should choose a m essage to receive according to its reception policy. A message handler can be a hardw are device (such as the support circuit in T3D), it can be a com bination of software and hardw are (such as the com m unication processor in SP-2), or it can be a softw are routine executed by a processor (such as in CM -5). Intuitively, the role of th e m essage handlers in o u r model is to capture th e com m unication protocol overheads used for rem ote data-access. A p air of message handlers should cooper ate to tran sp o rt d a ta from consecutive m em ory locations in the source m odule to consecutive m em ory locations in the target m odule. In general, th e step s for com m unicating a m essage between a pair of m odules include constructing a message (in w hich m achine users copy d ata from local m em ory locations into consecutive local m em ory locations), starting-up the netw ork, delivering d ata to th e netw ork, and receiving d a ta from the network. T he m ajor differences betw een our m odel and o th e rs’ are th a t we capture th e message construction activity and th e message reception policy a t th e target module. In C h ap ter 3, th e suitability of applying various initial data-m apping strategies on d istrib u ted m em ory m achines to achieve scalability is evaluated. T h e evaluation is m ade by considering the situ atio n in w hich the num ber of processor/m em ory m odules is increased to suit th e com putational requirem ent of increasing problem size. F urtherm ore, an approach for reducing com m unication costs by totally or partially elim inating the m essage construction activity is proposed. • C u rren t scalable parallel machines can be scaled up by adding m ore proces sor/m em ory modules. T h u s, the available com puting power an d available physical m em ory is scaled up in proportion to th e num ber of m odules. It is well known th a t a parallel system of increasing size m ay m ain tain its effi ciency constant by increasing the am ount of input d ata. A parallel system refers to a m achine-algorithm pair. B ased on this, isoefficiency function has been defined to measure th e scalability of a parallel system in [33]. In general, as th e problem size increases, the m em ory requirem ent for an initial d ata- m apping technique may increase in different rates. However, isoefficiency 13 analysis ignores the available m em ory in the parallel m achines. In this chap ter, the scalability of th e parallel system s w ith lim ited available m em ory is discussed. Assum e th a t th e to tal m em ory of a d istributed m em ory m achine is linearly proportional to th e num ber of m odules and th e efficiency of the parallel system needs to be m aintained at a desired level. Then, our result shows th a t although the isoefficiency function exists for a parallel system , th e num ber of m odules of th e distributed m em ory m achines m ay be bounded. We call the u pper bound of the num ber as ‘expansion range’ of th e parallel m achine. This implies th a t the available physical m em ory of a d istrib u ted m em ory m achine also affects the scalability of th e parallel system . O ur re sult also show th a t a parallel system with a b e tte r isoefficiency m easure m ay possibly have a sm aller expansion range th an th a t w ith poor isoefficiency m easure. T hus, the expansion range of a parallel m achines should also be taken into consideration for m easuring the scalability of a parallel system . • fn general, d a ta item s need to be placed in consecutive m em ory locations in a source m odule to be delivered to consecutive m em ory locations in a target m odule. To place the d a ta in consecutive m em ory locations, m em ory to m em ory copy m ay be necessary. T he tim e for copying d a ta from a m em ory location to another m em ory location is not negligible. For exam ple, the m em ory to m em ory copy tim e is 0.01 /zsec/byte in SP-2. ft is m ore th an one th ird of th e u n it d ata transm ission tim e (0.028 ^isec/byte) in th e m achine, fn this chapter, we propose an approach for partially or to tally elim inating m em ory to m em ory copy activities. This is achieved by eith er (1) delivering d a ta of the sam e destination using m ultiple messages of sm aller size or (2) placing d ata of the sam e destination as close as possible in th e local memory. T he im portance of reducing m em ory to m em ory copy activities is increasing as th e technology of new com m unication m edia (such as optical fiber) is advancing which leads to even faster d ata transm ission rate. In this case, th e m em ory to m em ory copy tim e will dom inate the d a ta tran sp o rtatio n tim e. In C hapter 4, techniques are proposed to reduce th e com m unication latency. Users of parallel m achines can w rite th eir codes using different program m ing 14 paradigm s. For exam ple, program m ers can write th eir codes such th at com pu tation phases of som e m odules can overlap the com m unication phases of the o ther modules; or all th e m odules synchronously alternate between com putation phases and com m unication phases. T he advantages of the la tte r is th a t th e codes can be easily w ritten and debugged. To achieve large speed-ups for solving an application problem on a d istributed m em ory m achine, the tim e spent in a com m unication phase should be m inim ized. In this chapter, techniques are proposed to reduce the tim e required for perform ing basic com m unications am ong the m odules. • A straightforw ard approach for com m unicating a message takes the follow ing steps: constructing th e message, starting up th e com m unication, and d a ta transportation. In general, a processor needs to cooperate w ith th e com m unication facility to start up a com m unication. However, there is an o p p o rtu n ity for overlapping the activity of m essage construction w ith th a t of d a ta transportation. We show th a t by choosing app ro p riate message sizes and overlapping message construction w ith d a ta tran sp o rtatio n , the com m u nication latency can be significantly reduced. T h e size of th e message should be chosen according to th e architectural features of the parallel m achines and the am ount of d ata to be transported. For exam ple, th e tim e for com m unicating m bytes am ong a pair of modules can be up to t0 + m{td + 2tc), w here t0 is th e startu p tim e, t is th e d ata tran sp o rtatio n ra te , and tc is the m em ory to m em ory copy rate. However, using a sequence of messages of size eclual to M es, where a = m ax ^f^,tc}, and overlapping th e d a ta tran sp o rtatio n with message packing-and-unpacking (construction), one-to- one com m unication can be efficiently perform ed in — a.){2tc + td)—a, for m > / g, T7 ~ bytes. — 2 tc+tj j • D a ta redistribution am ong the m odules can be achieved by perform ing all- to-all com m unication. To avoid congestion due to m any m odules com m uni cating with a m odule, th e linear perm utation algorithm [8] can be applied. However, as th e am ount of d ata delivered to or received from o th er m odules are very different, the delay effect which causes th e com m unication activities of som e m odules to be suspended m ay happen due to (1) th e lim ited band w id th betw een any pair of com m unicating m odules and (2) finite capacity 15 of the network. If this delay effect happens a t each of th e p-1 perm u tatio n steps, then th e tim e for perform ing data-redistribution can be up to p ta-\-E Y d, w here L is th e m axim um of th e am ount of data to be delivered or received in a module. O ur technique is proposed to elim inate this adverse effect. O ur d a ta redistribution technique consists of tw o phases. D uring the first phase, o u r technique attem p ts to balance the am ount of d a ta am ong th e m odules which is to be delivered to o ther modules. During th e second phase, th e d ata is sent to their final destination. Using our technique, th e tim e for perform ing d a ta redistribution can be reduced to 2pt0 + 2L(td + 2C ) + tbs, where ff,s is the tim e for perform ing a barrier synchronization. To get th e benefit from the two-phase technique, the gain from em ploying the balancing phase should be larger than th e overhead of employing th e balancing phase. A m ethod for estim ating th e gain and overhead in using th e technique is also given. Based on this, the necessity of perform ing d a ta redistribution using the tw o-phase technique for a com m unication p attern can be judged. In C hapter 5, approaches for data-rem apping are investigated. In general, d ata rem apping can be used to balance th e load am ong the m odules as well as restore the locality of data-access during com putation. For th e applications in which the com m unication p attern s and load distribution can be known beforehand, the approach used to rem ap d ata to th e modules can be pre-determ ined. For th e other cases, dynam ic-data rem apping techniques are required. Two im portant problem s (F F T an d contour tracing) in im age processing which have different com m unication requirem ents have been chosen as an exam ple for dem onstrating the suitability in applying data-rem apping techniques. • T h e fast Fourier Transform (F F T ) com putes the D iscrete Fourier T ransform (D F T ) of an n-dim ensional complex vector (# o ,# i,. 1). B utterfly al gorithm [12] can be used to com pute F F T . T he initial data-m apping, which m aps the butterfly algorithm [12] to th e processor/m em ory m odules of a d istrib u ted m em ory m achine, can be cyclic layout or blocked layout. A hy b rid layout is a sequence of various data-m appings, which are alternatively applied during com puting F F T . Since th e F F T have fixed com m unication 16 patterns, fixed data-rem apping can be applied to rem ap cyclic (blocked) lay out to blocked (cyclic) layout. C om puting F F T using hybrid layout has been considered to have sm aller com m unication tim e com pared w ith blocked (or cyclic) layout [14]. However, our result show th a t blocked (or cyclic) layout can lead to a faster com m unication in some cases. This is due to th e large startu p tim e dom inating th e d ata tran sp o rtatio n tim e in th e data-rem apping process. • In this dissertation, a technology for dynam ic data-rem apping is proposed to eliminate inter-m odule d a ta dependencies as well as balance the load am ong the m odules. O ur data-rem apping technique consists of two m ajor steps: (1) decentralized cluster-labeling and (2) d ata m ovem ent. T he ‘cluster-labeling’ assigns a label to each of th e d ata clusters based on the inform ation of global load and d a ta dependency among th e modules. This approach can be applied to m any interm ediate level IU tasks for load redistribution. These include, for exam ple, finding convex hull, approxim ating contours using linear seg m ents, perceptual grouping, etc. To illustrate our ideas, we use th e contour tracing problem as an exam ple. A contour is a cluster of m arked pixels which denote the boundary of an object in an image. Based on the contours, m any operations can be perform ed to ex tract the features of th e objects in th e im- q age. To show th e usefulness of our methodology, experim ents were conducted on an IBM SP-2 w ith a dedicated pool of 64 m odules. O ur experim ental re sults indicate th a t inter-m odule d a ta dependency is the m ajor factor th a t lim its the speed-ups th a t can be achieved in tracing contours. In addition, unbalanced work load on th e m odules can fu rth er reduce th e speedups. 17 C hapter 2 A C om putational M odel T he developm ent of a parallel m odel that bridges parallel software an d parallel m achines is fundam ental to the success of m assively parallel com putation. In Section 2.1, the classification of com puter m odels is given a n d th e current trend in developing models for parallel com putation is discussed. T h e nature of o u r model and th e param eters used in our m odel are given in Section 2.2. In Section 2.3, to illu strate the uniqueness and reality of our m odel, our m odel is com pared with several well-known m odels. 2.1 Background According to the purposes of the proposed m odels for parallel com putation, the m odels can be classified into four basic types: th e program m ing model, th e ma chine m odel, the architectural m odel, and th e com putational models [25]. The program m ing model is to verify a program in te rm s of the sem antics of a particu lar im plem entation of a program m ing language. T he other ones are m ore related to th e perform ance evaluation on a parallel system (a software-m achine p air). The m achine m odel consists of the d etail description of the h ard w are and operating system s. T h e architectural model is at the n e x t higher level of abstraction over th e m achine model. T h e architectural model describes the topology of th e inter connection network of a parallel m achine and th e functions of the fundam ental resources; b u t not its im plem entation details. T h e com putational model is at the 1 8 next higher level of abstraction. Ideally, a m odel of com putation provides an ab stract view of a class of com puters. The basic requirem ents of the com putational m odel of a parallel m achine are described as follows. • T h e model m ust be capable of providing enough inform ation ab o u t th e rel ative cost of parallel com putations and com m unications. • T h e n atu re of th e model m ust provide th e basis for easily developing an efficient algorithm as well as analyzing th e algorithm . • For machine users to focus on the m ost im portant factors which affect the algorithm designers, a com putational m odel should not use too m any param eters for describing the perform ance of a parallel system . In our opinion, the num ber of param eters should be no m ore th an five. A num ber of m odels [5, 14, 28, 58] have recently been proposed to satisfy these requirem ents. T he com m on feature of these m odels is th at th ey do not describe the topology of the interconnection netw ork for th e class of parallel m achines which th e models attem p t to capture. B ased on the current technology, the sta rtu p tim e for delivering a m essage is large com pared w ith the network latency. T he small netw ork latency is due to th e fast network switches. For exam ple, the netw ork switch latency of IB M SP-2 is 125 nsec (the sta rtu p tim e is 40 (isec), and th e network latency of T M C CM-5 is 200 nsec (the startu p tim e 86 fisec). T he netw ork topologies of IBM SP-2 and TM C CM-5 are m ultistage network and fat-tree respectively. T he types of inter connection networks have very low diam eter. For those m achines of reasonable sizes (less th an 16K processors), t h e ‘ com m unication tim e is insensitive to th e dis tance between any pair of com m unicating processors. T he reason is th a t th e effect of th e topology of an interconnection network is masked by th e large s ta rtu p tim e. C urrently, the strategies for fault tolerant routing have been paid m uch atten tio n in designing an interconnection network. Exam ple is the 3D tours netw ork struc tu re used in th e Cray T3D in which d ata will b e routed using altern ativ e p ath s if a broken link or a fault switch has been detected in its regular path. T h e data- routing strategy guarantees th e parallel m achines can work u n d er th e environm ent of broken links or fault switches. It also implies th a t the physical distance between 19 a pair of com m unicating processors is not necessary to be same as th e distance of the routes for tran sp o rtin g d a ta between th e pair of com m unicating processors. Since (1) th e network latency is sm all, (2) th e size of of current M IM D parallel m a chines is reasonable, and (3) m ultiple routes can be used for tran sp o rtin g d ata; a com putational model for current MIMD parallel m achines which do not im ply any stru ctu ral inform ation about th e topology of their network is realistic. However, the com putational m odel should capture th e relative costs of parallel com puta tions and com m unications. We refer this ty p e of m odel as structure-independent model. A m ajor advantage comes w ith a structure-independent m odel is th a t the algorithm s designed on the m odel have larger portab ility and can operate in the presence of broken links and broken netw ork switches. Based on the above reasons, our model uses a structure-independent approach. C urrently, m any parallel m achines provide different m echanism s for various re quirem ents in perform ing com m unications. Exam ples are Cray T3D, TM C CM-5, and Stanford FLASH. In these m achines, m echanism s are provided for delivering d ata using long messages of various sizes or using short messages of fixed size. For a m odule which delivers d a ta using long messages of various sizes, m em ory to mem ory copy operations may be required for the m odule to construct a message. The m essage construction is operated to m ove the d a ta which will be delivered to consecutive m em ory locations. T he param eters of our m odel should be able to dis tinguish th e costs for tran sp o rtin g d ata using various com m unication m echanism s. In addition, our m odel can distinguish th e costs in perform ing com m unications using long messages for transporting d a ta item s stored at consecutive m em ory locations from th a t for transporting d ata stored at scattered m em ory locations. C urrently, th e com m unication protocol in a parallel m achine plays an im p o rtan t role in effectively accessing d ata from rem ote memory. T h e protocol is a p art of the parallel m achine. T hus, we ab stract the com m unication protocol into a functional elem ent in our model. T he functional elem ent schedules th e d a ta receptions of the incom ing messages. 20 2.2 A R ealistic C om putational M odel O ur developm ent of a structure-independent m odel for distributed m em ory m a chines is m otivated by the technological trend in constructing parallel m achines. T he n atu re and param eters of our m odel absorb th e factors which affect th e algorithm designs and the application developm ents. C urrent scalable parallel m achines are form ed by a collection of essentially com plete com puters (proces sor/m em ory m odules) com bined by an interconnection network. In these m achines, the m em ory is physically distributed across the parallel machines. We refer this class of parallel m achines as distributed m em ory m achines. In these m achines, accessing d ata from rem ote m em ory is m uch slower th an accessing d a ta stored a t local memory. For developing effective parallel algorithm s on current d istrib u ted m em ory m achines, several factors related to rem ote data-access need to be care fully taken into consideration. These factors include: (1) large s ta rtu p tim e for delivering a message, (2) lim ited com m unication bandw idth available to a m od ule, (3) th e requirem ent of m em ory to m em ory copy operations for constructing a message, (4) finite netw ork capacity, and (5) the policies for scheduling message receptions a t the targ et modules. To m odel the rem ote data-access from algorithm designers’ perspective, we abstract the communication facilities (including hard ware and software supports) of a distributed m em ory m achine into point-to-point channels and message handlers. Each of th e channels and message handlers can perform d ata delivery and d ata reception at the sam e tim e. T he overview of our m odel is illustrated in Figure 2.1. A message delivery is initiated by th e processor in a source m odule which starts up the message handler to push a sequence of d a ta item s into the channel. W ithout additional intervention by the processors, th e message handlers at a source m odule and at a targ et m odule cooperate to tran s port d a ta from consecutive m em ory locations in th e source m odule to consecutive m em ory locations in the target module. At th e sam e tim e, the processors in both of th e m odules can either be idle or perform other com putations. T he detail descriptions of th e channels and th e message handlers are given as follows: 21 ■ — z - data in consecutive memory locations I control line channel or data path 1 : data processor] .processor) memory memory data message handler message handler data Figure 2.1: An overview of our com putational m odel. Point-to-point channel : channels are used to model a d istrib u ted m em ory m achine w ith p modules. Any pair of m odules is connected using a chan nel. D a ta exchange among a p air of com m unicating m odules is through the channel which connects the com m unicating m odules. D ata are delivered and received from the ends of th e channels. T h e function of th e point-to-point channels is to cap tu re the featu re of current distributed m em ory m achines in which a pair of m odules com m unicate w ithout disturbing th e com putation or com m unication activities in o th er modules. M essage handlers : p message handlers are used to model a d istrib u ted m em ory m achines with p modules. E ach of the m essage handlers is attached a t a 22 m odule for delivering (receiving) d a ta from consecutive m em ory locations in the m odule to (from) another m odule. T he message handler of a m odule switches over th e channels connected to the m odule for message deliveries and message receptions. Messages can be of various sizes. If th e message handler of a m odule is busy in delivering a pending message, the message handler will not respond to any request of message delivery from the processor in th e same m odule. In this case, the processor in the m odule waits idly until th e message handler can respond to its request. For receiving a message from a channel, th e message handler cannot receive any new message before th e pending message has been com pletely received. If there are m ultiple messages queued a t the channels connected to a m odule, a decision of choosing a m essage from th e queued messages will be m ade. T he decision is m ade according to the d a ta reception policy of the message handlers. In a d istrib u ted m em ory m achine, a message handler can be a hardw are device (such as th e support circuit in T 3D ), it can be a com bination of softw are and hardw are (such as th e com m unication processor in SP-2), or it can be a software executed by a processor (such as the case in CM-5). The function of the message handlers is to capture (1) th e lim ited com m unication bandw idth available to a m odule, and (2) the com m unication protocol for effectively accessing rem ote d ata. Basic assum ptions about our m odel are m ade as follows: • Each of th e channels has capacity of at m ost one byte. T he activ ity of a message handler in pushing d ata into a channel will be suspended if the channel is satu rated . T he suspended message handler tem porarily ignores any new request of m essage delivery. If the processor of th e sam e m odule makes a new request of message delivery, the processor keeps idle u n til the pending m essage has been com pletely delivered. T hus, our m odel encourages th e m achine users to develop a good scheduling policy for delivering messages in th eir program s. Note th a t the scheduling policy for receiving m essages is captured by th e message handlers. • T he message handler at a target m odule should switch am ong the p — 1 chan nels, which connect to th e m odule, for receiving messages. T his is illu strated 23 channel rmnminmniimn data source side message handler. m em ory target side message ^h an d ler. m em ory Figure 2.2: An illustration for th e data receptions at a message h an d ler. in Figure 2.2. T h e message reception policy of th e message h an d ler at a targ et m odule can affect th e to tal com m unication tim e . The analyses m ade in th e following chapters assum e that th e message handlers receive messages using first-come-first-served message reception policy. In our m odel, we use three param eters to cap tu re the com m unication cost. Td (fisec/byte): th e rate at which the m essage handler transports d a ta from consecutive memory locations in its source m odule to consecu tive m em ory locations in its target m odule. For S P -2, Tj was m ea sured to be 0.028/fsec/byte. This was m easured b y com puting th e ratio ~ n^ e~ h~ rae for large messages on SP-2 using M P L . 24 T0 (/isec/m essage): the startu p tim e for delivering a message. It is the interval from the tim e th e message handler responds to a request of m essage delivery to the tim e the first d a ta byte has been pushed into th e channel. For SP-2, Ta was observed to be 40^sec/m essage. T his is the m inim um tim e for deliver a sm all message on SP-2 using M PL. Since a s ta rtu p process is needed for delivering a message, th e costs of transporting data am ong m odules using a long message or using a sequence of short messages can be distinguished. Note th a t th e sta rtu p tim e for long messages and th at for short messages may not be the sam e. Tc (/isec/byte): th e tim e for a u ser’s program to copy d a ta from a m em ory location to another m em ory location. For SP-2, Tc was m easured to be 0.01/zsec/byte. This is m easured by copying a large am ount of d a ta betw een two ID arrays. T he copy rate is th e ratio of th e total copy tim e to the to tal num ber of copied bytes. Since d ata item s need to be m oved to consecutive m em ory locations for delivery using a long m es sage, thus, the cost of transporting data sto red at consecutive m em ory locations can be distinguished from that of transporting d a ta stored a t non-consecutive m em ory locations. In [40, 43], these param eters have been shown to have significant im pact on the strategies used to reduce the rem ote data-access tim e, to m ap initial d a ta to the modules, an d to rem ap the interm ediate output to the m odules. Our param eters do not try to capture (1) the possible d a ta contention in an interconnection network and (2) th e network latency. The rationales are based on the following observations. • In th e construction of the cu rren t generation of d istrib u ted m em ory m a chines, much atten tio n has been paid to avoiding d a ta contention a t the netw ork switches. T he interconnection netw ork has been designed such th at the to ta l network bandw idth will be scaled up as the m achine size increases. T his will reduce th e possible d a ta contention in the netw ork. 25 • In addition, due to the current network sw itch technology, th e netw ork la tency is a very small fraction of th e com m unication tim e if th ere is no d a ta contention a t the netw ork switches. A ssum ing T0 = t a, Tj = tj, and Tc — tc. In our model, th e tim e for a pair of m odules to com m unicate a message consisting of m bytes (stored in consecutive m em ory locations) is t0-\-mti. Note th a t copying th e d ata into consecutive m em ory locations takes m tc tim e. Copying th e d ata is perform ed by th e user program . In this dissertation, com putation tim e denotes th e tim e taken by th e processors to perform useful com putations, and com m unication tim e denotes th e sum of message construction tim e, startu p tim e, and th e tim e for transporting messages am ong m odules. O ur model does not a ttem p t to capture the classes of parallel m achines in which th e netw ork latency is large or th e prefetch m echanism is provided. T h e rationales are given as follows. A parallel m achine with large netw ork latency m ay be due to (1) insufficient network bandw idth or (2) slow netw ork switch. C urrent scalable parallel m achines are designed w ith no single bottleneck com ponent. T hus, the problem s of insufficient network bandw idth and slow network sw itch are usually solved before th e m achines are constructed. However, w ithout capturing th e n et work latency in our m odel, th e techniques for hiding th e network latency cannot be investigated. For the parallel m achines which support prefetch m echanism s, any message delivery from a m odule will not delay any com putation activ ity to be perform ed in th e module. In our m odel, a m odule cannot deliver a new message u n til th e previous message has been com pletely pushed into th e netw ork. T hus, a m odule which tries to deliver a new m essage before th e old message has been com pletely pushed into th e netw ork will keep idle. It im plies th a t our m odel encourages algorithm designers to schedule the com m unication activities in th eir algorithm s. 2.3 C om parison w ith O ther M odels M any models of parallel m achines have been proposed. In this section, several m odels of wide acceptance or trend of th e tim es are discussed. T he PR A M m odel and th e netw ork models have been used widely in analyzing th e algorithm s. Those 26 m odels will be com pared with our model first. Due to th e technological trends, several m odels for distributed m em ory m achines has been recently proposed. The models include the LogP model, the Postal m odel, and th e BSP m odel. These models do not consider the topology of the interconnection networks. T hese models encourage algorithm designers to reduce the num bers of rem ote data-access. The com parisons between our m odel w ith those structure-independent m odels will be m ade in this section. 2.3.1 The PR A M M odel T h e PRA M m odel [19] is the m ost popular m odel for representing and analyzing th e com plexity of parallel algorithm s. T he key assum ption regarding algorithm ic perform ance in the PRA M m odel is th a t the running tim e can be m easured as th e num ber of parallel m em ory accesses [13]. It can serve as a good m odel for representing th e logical structures of a parallel algorithm . However, by using the PRA M m odel to capture the com m unication activity in a distributed m em ory m a chine, the cost of perform ing rem ote data-access cannot be distinguished from th at of perform ing local data-access. T h at is, the m odel ignores th e extra cost in ac cessing d a ta from rem ote memory. Thus, this m odel does not encourage algorithm designers to reduce th e am ount of th e rem ote data-access in using d istrib u ted m em ory parallel m achines. In our m odel, rem ote data-access and local data-access has been distinguished by including functional elem ents (th e point-to-point channels and the m essage handlers) in our model. An ex tra cost is needed for accessing d a ta from rem ote m em ory com pared w ith th a t from local memory. P aram eters T0, Td-, and Tc are used to represent th e tim e com plexity of rem ote data-access. T he p processors in the PR A M model can access p m em ory locations concur rently w ithout causing any access contention. However, th e rem ote data-access in the current distributed m em ory m achines m ay have data-access contention in accessing d a ta item s from p different m em ory locations. T his can happen if some of th e d a ta item s to be accessed are stored a t th e sam e m odule. Thus, th e lim ited com m unication bandw idth available to a m odule in a d istrib u ted m em ory m achine cannot be captured if the PRA M model is used. In our m odel, we use message handlers to capture th e lim ited com m unication bandw idth available to a model. 27 T he message handler can only deliver d ata to and receive d a ta from th e channel using a fixed rate. 2.3.2 The Network M odels In netw ork models, th e key assum ption is th a t com m unication is only allowed betw een the the p air of processor/m em ory m odules directly connected by an in terconnection netw ork. Many algorithm s [2, 41, 42, 49] have been developed for parallel machines of specific netw orks. The com m unication structures of th e algo rithm s employ the topology of th e network to achieve high perform ance com puting. O ne of the common features in th e current distributed m em ory m achines is th at the parallel m achines allow the d a ta to be pipelined into th e high bandw idth net works, and the d a ta can be tran sp o rted betw een a pair of com m unicating modules w ithout disturbing th e com putation or com m unication activities in o th er m od ules. Unlike tran sp o rtin g d ata using a stored-and-forw ard routing schem e [57], the com m unication cost is not sensitive to the num ber of hops between any pair of com m unicating m odules. This suggests th a t th e topology of the interconnection netw ork in current distributed m achines is n o t a key issue in designing effective algorithm s. In our model, instead of capturing the netw ork using its topology, we use point-to-point channels an d message handlers to cap tu re the featu re of an interconnection netw ork. T h e syntax of th e M PL provided by the current d istrib u ted m em ory m achines supports point-to-point com m unications to increase the portability. W ith a little m odification, the w ritten program s using M PL and be transform ed into those using M PI [44], However, for those algorithm s developed on a network m odel for a specified architecture, significant effort is required to m odify the algorithm s to be ru n on different architectures. 2.3.3 The LogP M odel T he LogP model [14] is based on four param eters th at specify ab stractly th e net work latency, the efficiency of coupling com putation to com m unication, th e com m unication bandw idth, and th e com puting bandw idth. In th e LogP m odel, the 28 size of a message is fixed. T he m otivation for proposing th e LogP m odel is based on the technological trends in constructing d istributed m em ory m achines. The param eters of the LogP m odel are described as follows. L: an upper bound on the latency, or delay incurred in com m unicating a message containing a word (or sm all num ber of words) from its source m odule to its target module. o: th e overhead, defined as th e length of tim e th a t a processor is en gaged in the delivery or reception of each message; during this tim e intervals, th e processor cannot perform other operations. g: the gap, defined as the m inim um tim e interval betw een consecutive message transm issions or consecutive message receptions at a proces sor. T he reciprocal of g corresponds to th e available per-processor com m unication bandw idth. P: th e num ber of processor/m em ory m odules. F urther, it is assum ed th a t the interconnection netw ork has a finite capacity, such th a t at m ost \^ \ messages can be in transm it from any processor or to any pro cessor at any tim e. T he m odel is asynchronous, th a t is, processors w ork asyn- chronously and the hardw are latency experienced by any m essage is unpredictable, b u t is bounded above by L in the absence of stalls. O ur m odel is different from the LogP model in three aspects: th e associated cost in delivering a message, th e associated cost in preparing a message, and the message reception policy. • In LogP m odel, L seems to cap tu re the switch latency of an interconnection network. The tim e for delivering a message includes L, o, and th e tim e for d ata tran sp o rtatio n . Due to th e advancing in the netw ork technology, the switch latency in the network is ju st a sm all fraction of the startu p tim e for a parallel m achine of reasonable size. In our model, we do not a tte m p t to capture th e netw ork switch latency. T he T0 in our m odel is to c a p tu re the software overhead in delivering a message. Software overhead refers to the period of tim e a processor cannot perform any useful com putation. 29 • In the LogP model, the message to be delivered is of fixed size. T hus, the overhead o is in linear proportion to the am ount of d a ta to be transported. The syntax of the m essage-passing library (M PL) provided in the cu rren t dis trib u ted m em ory m achines supports d ata delivery using messages of various sizes. T he d ata in each of the messages are delivered followed by a startu p process. A lthough th e d ata to be delivered can be packed into packets of fixed size, however, algorithm designers are usually n o t allowed to control this process. In our model, a startu p cost (which is a software overhead) is associated w ith a m essage delivery, regardless of the size of the m essage. In general, a message is the d ata stored at consecutive m em ory locations in a source m odule and will be delivered to consecutive m em ory locations at a target module. To construct a message, m em ory to m em ory copy operations are required. In our model, th e operations for constructing a m essage has been captured using th e p aram eter Tc. Since th e data in a message can not be used until the m essage has been com pletely received. T h is encourages users of the parallel m achines to choose an appropriate size of sm aller m essages for delivering d ata as soon as possible. Thus, p a rt of the d a ta can be used by the processor in th e target m odel for perform ing useful com putations. Due to delivering m ultiple messages, the to tal software overhead has increased. Thus, th e suitability of delivering data in an application problem using a sequence of short messages or a long m essage should be judged. • T he com m unication protocol can affect th e to tal com m unication tim e. Dif ferent m achines m ay provide different protocols. Furtherm ore, a m achine can also provide various com m unication protocols to su it the requirem ents in perform ing efficient com m unications for various com m unication p attern s. T he LogP model leave algorithm designers to decide th e message reception policy. However, th e program m ing languages or m ost M P L ’s do n o t pro vide any prim itive for users to tailor th eir message reception policies. In our m odel, the m essage reception policy is captured by th e message handlers which is a part of th e parallel machines. 30 2.3.4 The Postal M odel T he Postal m odel [5] is designed to model m essage-passing parallel m achines. T he model focuses on three aspects of com m unication features in the machines: full connectivity, sim ultaneous I/O , and network latency. In th e Postal m odel, the size of messages is fixed. T he n atu re of the P ostal model is described as follows. Full Connectivity: each processor P, in th e m achine can send a point- to-point message to any o ther processor Pj in the machine. Sim ultaneous I/O : each processor P, can sim ultaneously send one atomic piece of d ata to processor Pj and receive another atomic piece of d a ta from processor P*. Com m unication Latency: if at tim e t processor Pi sends an atomic piece of d a ta to a processor Pj, then processor p is busy sending the d ata during the tim e interval [<, t + 1], and processor Pj is busy receiving the d a ta during the tim e interval [f -f A — 1, t + A], T he Postal m odel is operated in asynchronous mode. A fter a m odel delivered a message, it is free to perform other operations including o ther m essage deliveries. T hus, this m odel captures the send-and-forget nature of com m unicating messages in a m essage-passing parallel m achine. O ur m odel differs from th e Postal m odel in five aspects: the associated cost in delivering a message, the associated cost in preparing a message, th e message reception policy, th e capacity of th e network, and the lim ited com m unication b and w idth available to a module. T he reasons for th e first th ree are sim ilar to those given to th e LogP model. O nly the reasons for the last two will be explained. • In th e P ostal, th e lim ited capacity of an interconnection netw ork has not been captured. By connecting a source m odule to an interconnection netw ork w ith infinite capacity, the source models can push any am ount of d a ta into th e netw ork as it w ant w ithout considering th e d ata reception ra te a t its targ et m odule. However, the netw ork capacity in the current d istrib u ted m em ory m achines is lim ited. In th is case, T he com m unication activity of th e source m odules can be blocked due to a satu rated network. Since th e P ostal m odel does not consider the lim ited capacity of the netw ork, the block effect can 31 be overlooked. In our model, the network capacity has been cap tu red by specifying th e m axim um allowable of d ata in th e point-to-point-channels. • It seems th a t th e Postal m odel using software overhead paid for each message to capture the lim ited com m unication bandw idth available to a m odule. For a m achine which can com m unicate messages of various size u n d er a finite capacity netw ork, the distinguish between software overhead and th e lim ited com m unication bandw idth is very im portant. For example, for a m achine in which each of the m odules has infinite com m unication bandw idth, the messages delivered at the sam e tim e will arrive a t its target m odule at the sam e tim e. T he message reception policy is less im portant for a m achine w ith infinite com m unication bandw idth. However, for a m achine w ith fi n ite com m unication bandw idth, its message reception policy will affect the com m unication or com putation activities in th e source module. 2.3.5 The B SP M odel T he perform ance of the BSP m odel [58] is described in term s of th ree types of functional elem ents: P rocessor/m em ory modules. Each m odule perform s arith m etic a n d /o r m em ory function. Facility for synchronization: It synchronizes all or a subset of th e m odules a t regular intervals of T tim e units. T he execution of a pro gram consists of a sequence of super-steps. In each super-step, each of th e m odules is allocated a task consisting of som e com bination of local com putations, message deliveries to and m essage receptions from o th er m odules. After each period of T tim e u n its, a global check is m ade to determ ine w hether the super-step has been com pleted by all th e com ponents. If it has, th e m achine proceeds to th e next super-step. A router: It delivers point-to-point atom ic-inform ation between pairs of processor/m em ory m odules. T he basic task of th e router is to realize a rb itra ry /^-relations, or, in other words, super-steps in which each m odule sends and receives at m ost h atom ic-inform ation. 32 T h e B SP m odel is designed from th e view point of parallel m achines users. However, the m ajor differences between our model and the BSP m odel can be found in two aspects: the associated cost in constructing a message an d the func tional elem ents for describing an interconnection netw ork. T he reason for th e first one is sim ilar to th a t given to th e LogP model. T h e latter will be explained as follows. D uring a super-step, a processor m ay send a t m ost h atom ic-inform ation and receive at m ost h atom ic-inform ation. Such a com m unication p a tte rn is called a /i-relation. In th e BSP m odel, either th e netw ork should be powerful enough to realize any h-relation in a constant tim e or th e super-step should b e able to be ad ju sted to accom m odate any /i-relation. T he capacity of th e netw ork m ay need to be infinite for realizing any /i-relation through a network in a constant tim e. T his is not realistic under current technology. To adjust a super-step to accom m odate any /i-relation on a netw ork w ith finite capacity, the netw ork should have an intelligent d a ta tran sp o rtatio n strategy which can change dynam ically to suit th e requirem ents in perform ing a given com m unication p attern . In our m odel, we use m essage handlers and point-to-point channels to capture th e featu re of the interconnection netw ork in current d istributed m em ory machines: finite netw ork capacity, lim ited com m unication bandw idth available to a m odule, and m essage re ception policy. Based on these functional elem ents, users of these parallel m achines should tailo r their algorithm s or application program s to achieve high perform ance com puting on the parallel machines. 2.4 Sum m ary C urrent scalable d istrib u ted m achines are form ed by a collection of processor/m em ory m odules com bined by an interconnection netw ork. In this chapter, a realistic m odel is proposed to bridge the algorithm designers and th e parallel m achines. T h e m odel captures th e com m on features of d istributed m em ory m achines to which an algorithm designer should atten d in order to achieve large speed-ups in using th e parallel m achines. T h e m odel does not try to cap tu re all the details of a speci fied parallel m achine. T he a ttrib u te of our m odel describes the integral behavior of th e system software and m achine hardw are. Thus, by setting th e a ttrib u te of 33 ou r model according to a certain class of parallel m achines, th e m odel can cap tu re the com m on behavior of the class of parallel m achines. T he com m unication facilities in current d istributed parallel m achines face substantial overhead in ini tializing message delivery and scheduling message reception. T he integral behavior of perform ing com m unication are captured by using point-to-point channels and m essage handlers. T he tim e com plexity for perform ing com m unication is described by using three param eters (T0,Td,Tc). According to th e a ttrib u te of our m odel, it encourages m achine users (1) to localize m ost of th e data-accesses, (2) to sched ule th e d a ta tran sp o rtatio n based on th e message reception policy of the m essage handlers, (3) to partially overlapping th e com m unication activities. 34 C hapter 3 Initial D ata-m apping In this chapter, data-m apping strategies for reducing com m unication tim e are in vestigated in two aspects: (1) a m ethodology is proposed for evaluating th e tech niques for m apping input d ata item s to th e modules (m odule-m apping) of a dis trib u ted m em ory m achine w ith bounded m em ory sizes, and (2) approaches are proposed for m apping input d ata item s to m em ory locations (m em ory-m apping). D ata dependencies during com putations make the m odules need to access d a ta from o th er modules. T he goal of applying a m odule-m apping strategy is to localize m ost of th e rem ote data-accesses in solving an application problem . Thus, th e total com m unication tim e in solving the application problem can be reduced. In general, th e d ata item s in a message are stored at consecutive m em ory locations in its source m odule and will be delivered to consecutive m em ory locations in its target m odule. Thus, m em ory to m em ory copy operations are needed for the source or targ et m odules to copy th e d ata item s to appropriate m em ory locations. We call this process message construction. T h e goal of applying a m em ory-m apping strategy is to totally or partially elim inate the m em ory to m em ory operations. By elim inating th e operations, the tim e for com m unicating messages am ong th e m odules can be reduced. In Section 3.1, a m eter for m easuring th e scalability of a parallel system is given and the syntax of the com m ands for tran sp o rtin g d ata am ong the m odules is discussed. T he suitability of applying various m odule- m apping strategies for achieving large speed-ups on a d istrib u ted m em ory m achine is investigated in Section 3.2. T he suitability of the m odule-m apping techniques is m easured on a scalability m eter. In Section 3.3, approaches are proposed to reduce the tim e for accessing d a ta from other modules; it is achieved by either storing 35 th e interm ediate o u tp u t of each m odule at appropriate memory locations (thus, th e tim e for constructing a message can be reduced) or choosing a n appropriate com m unication mechanisms. 3.1 Background C urrent d istrib u ted m em ory m achines are form ed by a collection of processor/m em ory m odules connected by an interconnected netw ork. Each of the m odules accesses d a ta from another m odule over th e interconnection netw ork. The com puting power of a d istrib u ted m em ory m achine can be scaled up by increasing th e num ber of processor/m em ory m odules. All th e modules of the parallel machine cooperate to execute a task. D uring executing a task, com m unication an d synchronization m ay be needed to exchange d ata am ong the m odules. T he com m unication and syn chronization are th e m ajor overheads in perform ing a parallel task. T h e overheads have adverse effect on th e perform ance of a parallel system . The term system refers to a m achine-algorithm pair. T hus, m inim izing the overheads are im p o rta n t for a parallel system to achieve large speed-ups. Scalability analysis can help m achine users evaluate th e perform ance of a par allel system . Efficiency is an im portant perform ance m eter for a p arallel system . It indicates th a t th e percentage of the available com puting power o f a parallel m achine has been dedicated to th e useful com putations of a task. It is well known th a t to m aintain th e efficiency of a parallel m achine at a desired level, th e re exists a relationship betw een th e problem size of an application an d the num ber of proces sors in th e parallel m achine on which the application problem is solved. In [34], the authors state th a t th e scalability of a parallel algorithm on a scalable architecture is a m easure of its capability to effectively u tilize an increasing num ber of proces sors. Thus, they developed a scalability m eter called isoefficiency [33]. It relates th e problem size of an application to the num ber of processors necessary for an increase in speed-ups in proportion to the num ber of processors used in th e appli cation. T he relationship between th e isoefficiency function of a system and some perform ance m easures has been discussed in [22]. T he scalability analyses have been m ade on several m odels which address th e topologies of the interconnection networks. T he analyses consider th e suitability of using various m odule-m apping 36 strategies in solving problem s on parallel m achines. However, due to advances in the netw ork technology, it is reasonable for the users of a d istrib u ted memory m achine to ignore th e topology of interconnection netw ork. Thus, th e scalability of a parallel system needs to be evaluated based on a m odel which does not use the netw ork topology to describe an interconnect network. Even though an optim al m odule-m apping is applied to m ap in p u t d ata to th e m odules, com m unications m ay be needed to exchange interm ediate o u tp u t am ong th e m odules. Com m unication tools, such as message-passing library (M PL), are provided by the m achine vendors for users to access d a ta from o th er modules. T he syntax of m any com m ands of the com m unication tools supports d a ta delivery using messages of various sizes. By using this type of com m ands, each of th e source m odules tran sp o rts d a ta item s stored at consecutive m em ory locations to consecutive m em ory locations in the target m odules. T hus, m em ory to memory copy operations m ay be necessary for a source m odule to move th e d a ta stored at different locations to consecutive m em ory locations. Similarly, th e operations m ay also be necessary for the target m odules to move th e d ata item s stored a t consecutive m em ory locations to their final m em ory locations. T he tim e for IBM SP-2 to copy one by te from a m em ory location to another m em ory location is 0.01 /^sec. It is m ore th an one th ird of the d a ta delivery ra te (0.028 ^ sec/b y te). For those processors w ith slower local memory access rate, com pared w ith th e processor (RS 6000) used in IBM SP-2, th e local m em ory accesses needed for moving d ata item s leads to a large com m unication tim e. T hus, the interm ediate o u tp u t should be stored at appropriate m em ory locations in each of th e m odules to reduce th e to tal com m unication tim e. 3.2 M apping D ata to M odules w ith B ounded M em ory Sizes C om m unication overhead in a distributed m em ory m achine includes rem ote data- access and synchronization. In th e d istrib u ted m em ory m achines, m em ory units are physically d istributed across th e m odules. As the num ber of processor/m em ory m odules of a d istrib u ted m em ory machine increases to suit th e tim ing requirem ent 37 of a large-scale application, th e total m em ory also increases in proportion to the num ber of modules. To m aintain the efficiency of a parallel system at a desired level, th ere is an u p p er bound on the num ber of m odules which can be applied to a given application of a specified problem size. The m axim um num ber of m o d ules depends on th e applications, the initial m odule-m apping strategies, an d the problem sizes. For a given application of a specified problem size, different m odule- m apping strategies m ay have different m em ory requirem ents. Furtherm ore, as the size of an application problem increases, th e m em ory requirem ents can increase in different rates for different m odule-m apping strategies. In this case, the m em ory requirem ent of an application o f a given problem size m ay exceed the m em ory space available in th e parallel machine. T h is implies th e scalability issue should consider th e im pact of available memory size. T he isoefficiency function has been used to analyze th e scalability of a parallel system . T he isoefficiency function of a parallel system depends on the execution tim e for useful work and com m unication overhead (e x tra work) perform ed in th e parallel system . T h e isoefficiency analysis does not consider the m em ory constraint in a distributed m em ory m achine. In this section, the scalability of a parallel sys tem is evaluated by considering th e suitability of using various m odule-m apping strategies on d istributed m em ory machines w ith bounded m em ory sizes. T he scal ability analysis is conducted on our structure-independent module. 3.2.1 Definitions and N otations Isoefficiency function [22] relates the problem size to th e function of the nu m b er of processors f{p). In general, th e isoefficiency can b e com puted by using th e following equation: E = (3.1) P x T(p,n) where • E : the desired efficiency of a parallel system . It is a constant. • Te(n)\ problem size. T h e amount o f com putations taken by an op tim al sequential algorithm w ith in p u t size eq u a l to n. 38 • T(p,n): parallel com putation tim e, p is th e num ber of processor/m em ory m odules in th e parallel system . In spirit, the equation (3.1) is to evaluate the capability of a parallel m achine to solve an application problem using a processor-tim e optim ality approach. T he to tal work (including useful work and com m unication overhead) perform ed by a parallel system is p x T(p, n). For a non processor-tim e optim al parallel algorithm em ployed on a parallel m achine, the order of m agnitude of th e p x T (p , n) can be larger than th a t of Te(n). In this case, the efficiency approaches zero as the problem size increases, and th e isoefficiency function of this parallel system does not exist. This im plies th at the scalability of some m achine-algorithm pairs cannot be evaluated using th e current definition of efficiency. In some applications, an optim al sequential algorithm of an application prob lem m ay not be chosen for parallelization because of the difficulty in parallelizing it or its unreality for the practical use. T he unreality of an optim al sequential algorithm m ay be due to a large constant factor in th e tim e com plexity of the algorithm . A nother reason for users to choose a non-optim al sequential algorithm to im plem ent its parallel version at some tim e is th a t the current technology sup ports high perform ance processors, however, substantial tim e is lost on perform ing the rem ote data-access. This m ay force users to consider choosing practical se quence algorithm s or applying appropriate m odule-m apping techniques, such as replicating d ata item s among th e m odules, for achieving large speed-ups. Let T(l,n) be th e tim e taken by a single m odule system to execute th e al gorithm . Accessing d a ta from rem ote m em ory and synchronization am ong the m odules are not required for perform ing com putations in a single m odule system . T hus, th e com putations perform ed in T(l,n) tim e u nits exclude this ty p e of op erations. In this section, we define an altern ate definition of th e efficiency of a parallel system for deriving the isoefficiency function of a parallel system . T he definition which uses T(l,n) and T(p,n) is given as follows: T he altern ate definition of efficiency uses T ( l ,n ) as a base. The isoefficiency function derived by using E quation (3.2) shows how well th e useful work of an algorithm can be parallelized. In spirit, th e Equation (3.2) is to evaluate the suitability of a parallel algorithm for solving an application problem of increasing size on a scalable parallel m achine. Since th e scalability is evaluated based on a sequential algorithm and its parallel version, we call th e isoefficiency function derived by this definition as the absolute isoefficiency function. In this section, we use absolute isoefficiency function to m easure th e scalability of a parallel system . T he advantage of taking the absolute isoefficiency function as a scalability m easure is as follows. For a parallel algorithm which can solve a problem in a sm all period of tim e by using larger am ount of operations (th a n those perform ed by an optim al sequential algorithm ), the isoefficiency function is still defined. Thus, users of a parallel m achine on which a non-optim al algorithm is im plem ented can evaluate the possibility of increasing the num ber of processors to achieve linear speed-ups. In contrast to th e absolute isoefficiency function, th e isoefficiency function derived using th e definition in Equation (3.1) is called relative isoefficiency function. In the following analysis, we use absolute isoefficiency function instead of relative isoefficiency function to m easure th e scalability of a parallel system . In [34], the authors also define th e Memory O verhead F actor (M OF) for a m achine-algorithm pair. T h e m em ory overhead factor is the ratio of the total m em ory required by a parallel algorithm to solve the application problem of a given size to th e am ount of m em ory required by an optim al sequential algorithm to solve the sam e problem of the sam e size. 3.2.2 Scalability A nalysis with Bounded M em ory Sizes In th e design of a parallel algorithm on a d istrib u ted m em ory m achine, effectively m apping d ata to th e processor/m em ory m odules of the parallel m achine is a key issue for achieving high perform ance com puting on the parallel system . Various m odule-m apping strategies [2, 41, 42, 49, 35] have been developed for network m odels. T h e m apping techniques partition d a ta into several groups and m ap each of the groups to th e modules based on the d a ta dependency between com p u ta tions and th e topologies of the interconnection networks. For some applications, 40 th e num ber of rem ote c l at a-accesses and th e num ber of synchronizations am ong m odules can be reduced by distributing several copies of the sam e algorithm or d a ta to each of the m odules and concurrently perform the com putations. T he effects of th e m odule-m apping strategies in which th e input d a ta item s are repli cated am ong th e modules have been investigated in [35, 55, 56]. By replicating th e data, th e isoefficiency function of such a system m ay im prove. However, the to tal m em ory in the d istrib u ted memory m achines m ay be used up quickly due to m ultiple copies of d ata or program s needed to be stored. The concept of virtu al m em ory can be used to com pensate th e insufficiency of th e physical m em ory. In this case, th e large space of I/O devices m ay be used as a tem porary storage and d a ta is transferred between th e I/O devices and th e m em ory as needed by the m odules. U nder the current technology, processors are getting m uch faster, but I/O devices are improving m ostly in capacity, not perform ance. It im plies th a t th e gap betw een fast processor and I/O device becom es wide. In this case, the tim e for transferring d ata between I/O devices and physical m em ory dom inates th e to tal execution time. T h e advantage of virtual m em ory becom e obscure and th e virtual m em ory is not suitable for som e applications in w hich fast execution is their first consideration. T his implies th a t we need to consider the constraint im posed by th e memory size of a parallel m achine in evaluating th e scalability of a parallel system . T he basic requirem ent for solving an application problem of a given problem size is th a t th e am ount of m em ory space needed by th e application problem ru n on a distributed m em ory m achine should be less th an the to tal m em ory size provided by th e d istrib u ted memory m achine. T he following inequality conveys this concept. M aig(n) < M arch{p,m). (3.3) where, • M aig(n) : th e m em ory size needed to solve an application problem w ith the am ount of input d ata equal to n, an d th e application problem is solved using A lgorithm alg. It is a function of in p u t size n. 41 Algorithm A ll- p a ir s - s h o r te s t- p a th begin for k = 1 to n do for i = 1 to n do for j = 1 to n do = rtnn{Pk- % j ] , P k~%k] + i M M ) endfor endfor endfor end{ A ll-p a irs - s h o rte s t-p a th } Figure 3.1: Sequential Floyd A lgorithm . • Marchip-,™) ' ■ the to tal m em ory in a distributed m em ory m achine arch on which th e A lgorithm alog is run. It is a function o f the num ber of pro cessor/m em ory m odules p in a distributed memory m achine and th e local m em ory size m in each module. From E quation (3.2), th e isoefhciency function f(p ) can be derived. T h e isoeffi ciency function sets an upper bound on the num ber of m odules can be applied to an application of a given size. T he inequality in (3.3) sets a lower b o u n d on the num ber of m odules should be used to store the in p u t data. As the problem size of an application is increasing, the lower bound an d the u pper bound m ay approach to a value. It im plies th a t the d istrib u ted m em ory m achine m ay only b e extended to a certain size for solving an application problem of increasing size. If th e num ber of m odules exceeds this bound, th e n it is im possible for th e parallel system to m aintain its efficiency at a desired level E. We call this ran g e the expansion range of a m achine-algorithm pair. In the following, we use all-pairs-shortest-path problem as an exam ple to illus tra te our concept. We assum e th at th e all-pairs-shortest-path problem is solved by th e Floyd A lgorithm [1], which is show n in Figure 3.1. Using checkboard partitio n , th e n X n cost-m atrix is divided in to p sub-m atrices. Each of th e sub-m atrices has size equal to ^ x -^=. T hese p sub-m atrices are m apped to th e p modules. Assume th a t the entries of each of th e sub-m atrices are stored at th e local m em ory locations using the row m ajor order. The com m unication patterns u sed in the checkerboard 42 A_th column lik - i) a kr < i j i k - l ) ik r • . <*) “ Ir fc_th column Figure 3.2: D a ta dependency in th e checkerboard partition. partition is illustrated in Figure 3.2. T he total com m unication tim e of th e parallel checkboard version of Floyd A lgorithm is 71 2n x (t0 + — (tc + td)) x log y/p + n tba. V p where tbs is th e tim e for synchronizing all the m odules of the parallel m achine. The tbs can be 0 (1 ) for a parallel m achine such as T M C CM-5, or can be O (logp) for a parallel m achine such as th e IBM SP-2. The to tal com putation tim e of th e parallel 43 version of Floyd A lgorithm is 0 (^ - ). The to tal overhead due to perform ing rem ote data-accesses and synchronizations is n {to + —- ( tc + td)) X n p logp + nptbs- \/P T h e worst case of sequential Floyd A lgorithm has tim e com plexity equal to 0 ( n 3). (3.4) According to E quation (3/2), to keep the efficiency of th e parallel system at a desired level E , we m ust have the isoefficiency function: f(p ) = ^ (p 15(lo g p )3). (3.5) A ssum e each of the m odules in a distributed m em ory m achine contains a local m em ory w ith size equal to m. T hus, the to tal m em ory of th e d istrib u ted m em ory m achine is pm. According to Equations (3.5) and (3.4), we have, n — fi(p05 logp). (3.6) Since the to ta l m em ory size provided in a parallel m achine should be larger th an th e m em ory requirem ent of an algorithm running on the parallel m achine, we have, p x m = f](n 2). (3.7) C om bining Equations (3.6) and (3.7), we have p = 0 ( 2 ^ ) . T h is equation suggests th at to solve a large-scale application problem , linear speed-ups can be achieved by expanding the d istrib u ted m em ory m achine from a single m odule to p m odules, where p = 0 ( 2 '/’ "). Note th a t each of th e m odules has a local m em ory w ith size equal to m. If we need to use more th an p m odules, 44 Base alg. Parallel V ariant Isoefficiency M OF E xpansion Range D ijkstra Source-Partition P3 P [ l,m 05] D ijkstra Source-Parallel (plogp )15 n fl m2 ] * • ’ flojcm)3 J Floyd Stripe (plogp)a 1 I1 ’ (logm)*] Floyd C heckerboard p 15(logp)3 1 [ 1 , 2 '/ ’"] Floyd Pipelined Checkerboard pl.6 1 oo Figure 3.3: Expansion ranges of various parallel system s. then efficiency cannot be m aintained at a desired level E (i.e. linear speed-ups is not possible) due to th e large overhead. 3.2.3 Observations and Discussions Based on th e above analysis, we consider th e expansion range of the parallel sys tem s p artially listed in [35]. T he results are shown in Figure 3.3. T he definition of the term inology used in Figure 3.3 for describing m odule-m apping strategies is sam e as those in [35]. In Figure 3.3, [x, y] m eans th a t th e parallel system can m ain tain its efficiency at a desired level for the num ber of m odules betw een x and Q(y). oo m eans th a t for any num ber of m odules used in th e parallel system , th e system still can m aintain efficiency at a desired level. From Figure 3.3, we can observe some interesting results. Some parallel system s w ith good isoefficiency functions m ay have sm aller expansion range; however, some system s w ith poor isoefficiency function have larger expansion range. For exam ple, • Floyd Checkerboard (System A): - f{p) = 0(p1 ,5 (iogp)3 )- - Expansion range is [1 ,2 '/’ "]. • D ijkstra Source_Parallel (System B): ~ f(p ) = © ((plogp)1'5). 45 - Expansion range is [1, T he isoefficiency function of System A is poorer th a n th at of System B. However, th e expansion range of System A is larger th an th a t of System B . T he speed-ups of a parallel system is equal to efficiency x number of processors. Assum e System s A and B need to m aintain efficiency at some desired level. For a problem size which can be run on both system s, th e speed-ups of System B is higher th an th a t of System A (due to the lower order of m agnitude of th e isoefficiency function of System B). However, for a problem size up to a certain value, only System A can be used to solve the large-scale problem w ithout degenerating th e efficiency of the parallel system . It is due to th a t System A has larger expansion range. A ccording to the definition of expansion range, System A as well as S ystem B can be observed linear speed-ups w ithin their expansion ranges. Since S ystem A has larger expansion range, larger speed-ups (com pared with System B) m ay be observed for a given application which requires large am ount of in p u t data. 3.3 M apping D ata to M em ory Locations M any d istrib u ted m em ory machines provide several m echanism s for effectively com m unicating long messages of various sizes and effectively com m unicating sh o rt messages of fixed size. D ata exchange am ong the m odules of a d istributed m em ory m achine can be achieved by either using a long m essage or using a sequence of sh o rt messages of a fixed size. A message consists of d a ta item s stored a t consecutive m em ory locations. T hus, d ata item s need to be moved to consecutive m em ory locations before the d a ta item s can be delivered by using a long message. T h e operations of moving d a ta between m em ory locations may not b e necessary if a sequence of short messages are used to deliver d ata. Since the num ber of m essages used to deliver d ata increases, the to ta l startu p tim e also increases. Thus, it is ob vious th a t a decision should be m ade for choosing an appropriate com m unication m echanism for delivering d ata am ong th e modules. 46 To capture th e com m unication activities in the current d istrib u ted m em ory m achines, two types of com m unication m echanisms (Cs and Cb) are defined. As sum e the d ata delivery rates Td = td (/rsec/byte) are th e same for both types of com m unication mechanisms. • Cs: a com m unication m echanism which delivers d a ta using a sequence of sm all messages of k bytes. According to the definition of our m odel, com m unicating d a ta of M bytes between a pair of m odules takes a t m ost ^ + + (3.8, where T'0 is th e startu p tim e for delivering a short message. E quation (3.8) can be simplified to Mt'a -f td, where C = y + ( 1 - ^ - (3.9) According to th e definition of the startu p tim e in our model, th e first byte of the message is in the channel after the startu p process. T hus, if d ata of M bytes are delivered using M short messages of one byte, th e tim e for delivering th e d ata is Mt'0 + t d, (3.10) where t'0 is th e startu p tim e for delivering the message of one byte. Com paring E quation (3.9) w ith E quation (3.10), the tim e for delivering M bytes by using messages of k bytes with th e startu p tim e equal to T'0 is equivalent to th at for delivering the sam e am ount of d ata using messages of one byte rjf - j w ith the sta rtu p tim e equal to - f + (1 — ^)td. In th is case, we can virtually consider th a t th e Cs delivers d ata using a sequence of messages of one byte and the sta rtu p tim e is denoted to be t0, where tQ = - f + (1 — j)td- • Cb: a com m unication m echanism which delivers d a ta using a long message. T he size of a long message is in proportion to th e am ount of d a ta to be delivered. T h e d a ta item s need to be moved to consecutive m em ory locations before the d a ta item s can be delivered. Denote the s ta rtu p tim e for delivering a message using Cb to be t 0. 47 In general, after som e com putations have been perform ed, exchanging data am ong the m odules m ay be required for the parallel machine to continue its com putations. Assum e the d a ta items in m odules i to be delivered to other m odules are partitioned into c groups: D 0, D \ , . .., D c-\. D enote the set of d a ta groups in mod ule i which will be delivered to m odule j to be Sij, where 5,-j C { D 0, D \ , .. . , D c_ i } . To reduce th e com m unication tim e, strategies should be carefully em ployed for m apping d a ta item s to m em ory locations and for choosing an app ro p riate com m unication m echanism s for a given m em ory m apping. By carefully m apping data to th e m odules, the m em ory to m em ory copy operations can be p artially or to tally elim inated. By choosing an appropriate com m unication m echanism , either th e m em ory to m em ory copy operations can be elim inated or the to tal tim e for sta rtu p processes can be reduced. In [43], th e tim e for copying d ata from a m em ory location to another m em ory location has been considered for deciding an optim al m odule-m apping strategy. In th is section, strategies are proposed for reducing com m unication tim e for perform ing windows operations. T he developm ent of th e strategies is in two aspects: (1) for a given m em ory-m apping, the strategies for applying Cb and Cs are investigated and, (2) a m em ory-m apping strategy is proposed for com m unication using Cb- W indow operations are widely used in perform ing low level vision tasks. In perform ing window operations, the o u tp u t for a pixel e in an image depends on th e inform ation of the pixels enclosed in a rectangular area of the im age. In general, the pixel e is located at the center of th e rectangular area. E xam ples of window operations are edge detection [45] and im age com ponent labeling [2, 15]. To parallelize window operations, an im age consisting of n x n bytes is partitio n ed in to p subim ages. T he p subimages is m apped th e p m odules of a d istrib u ted m em ory m achine. This procedure is called “m odule-m apping” . A m odule-m apping strategy for perform ing window operations is given in Figure 3.4. In Figure 3.4, the im age has been partitio n ed into subim ages / 0, h , ..., /is (each of the subim ages consists of y | bytes). Subim age /, is assigned to m odule i, for 0 < i < 15. During perform ing com m unication, the boundary d ata item s of for 0 < i < 15, in a m odule need to be delivered to the o th e r eight m odules which store the neighboring subim ages of /;. 48 cmo o o o x > o o o o o o o o o n x n im age p -1 6 Figure 3.4: M odule-m apping for window operations. We first investigate th e suitability of various com m unication m echanism s for perform ing window operations using th e m odule-m apping strategy in F igure 3.4 is used. To illustrate our idea, we analyze the tim e needed to deliver d a ta using C s and Cfc. We assum e th a t each of th e processor/m em ory m odules uses a two- dim ensional array to store a subim age. T he entries of each of th e tw o-dim ensional array are stored in the local m em ory using row m ajor order. In window operations, th e boundary o u tp u t d ata item s of /,, for 0 < i < 15, need to be used by its neighboring subim ages for com puting results. If a mask of 2h x 2h bytes are used, then th e boundary interm ediate o u tp u t item s of a subim age which have the distance (to th e boundary of the subim age) less th an h need to be delivered to o ther m odules. Assum e th e d a ta item s in a module to be delivered to other m odules are partitio n ed into eight groups: A, B , C , D , F, G , H, I as shown in F igure 3.5. According to th e locations of th e d ata in the subim age, the groups can be form ed into eight sets of d ata groups: GO, G l , ..., G l as shown in F igure 3.5; each set is sent to one of the eight m odules which contains the neighboring subim ages. T h e following result considers the tim e for perform ing a com m unication using C3 and Cb for a given m em ory-m apping A (as shown in Figure 3.6). The tim e for perform ing a com m unication for window operations is given by (1) 4h{-^= + h)t'0 + 8td, if C„ is used. (2) 810 + 2h(-^f= + 2h){td + 2tc) -(- , if Cb is used. 49 Subimage n/p 1/2 h — A B C D E F G H I GO={A,B,C} G2={I,H,G} G4={A} G6={I} G1={C,F,I} G3={G,D,A} G5={C} G7={G} Figure 3.5: E ight sets of d ata groups. Memory-mapping A C/5 a .2 * 4 - > 0 *2 1 < 0 s to e • ^ «/ 5 C 3 S i o c Interleaving o f {A,B,C} Interleaving o f {D,E,F} Interleaving o f {G,H,I} Figure 3.6: M em ory-m apping A. 51 532348485353914848904853482348 From th e above results, we know th at if th e startu p tim e t'0 is large com pared with td and tc, th en C\ is a desirable com m unication m echanism to be used for perform ing window operations. However, if th e startu p tim e t'Q is close to th e data delivery ra te td, th en using Cs can lead to fast com m unication. In a systolic array, small am ount of d a ta can be efficiently com m unicated betw een the processing el em ent during each com m unication phase due to the sm all s ta rtu p tim e. A nother com m unication strateg y can be em ployed on m em ory-m apping A: delivering d ata using m ultiple various-sized messages w ithout any m em ory to m em ory copy ac tivity. Each of these messages delivers the d a ta which have been initially stored in consecutive m em ory locations. Thus, m ore th an one m essage may need to be com m unicated betw een any pair of neighboring modules. Using this strategy, the tim e for perform ing a com m unication for window operations is , nh 2hntd 2 /i( — — + 2/1 + 2 )t0 H — y/P y/P It is easy to see th a t by using this strategy, th e m em ory to m em ory copy operations can be totally elim inated. For delivering d a ta using a long message, th e d ata to be delivered should be stored in consecutive memory locations. Thus, for a given m odule-m apping, ap propriately m apping d ata into th e local m em ory locations m ay further reduce the tim e of perform ing com m unication using Cb- A m em ory-m apping to achieve this goal is illustrated in Figure 3.7. T he m em ory-m apping B can partially elim inate the m em ory to m em ory copy operations in perform ing com m unication using Cb- Using m em ory-m apping B on th e m odule-m apping as shown in Figure 3.4, only the d a ta item s in group A need to be copied to the reserved area. T he tim e for delivering d ata using Cb in perform ing window operations is 810 + 4 h ( — — + h)td + 2 h2tc. V p It is easy to see th a t the tim e for perform ing m em ory to m em ory copy operations is only 2h2tc by using the m em ory-m apping approach. 52 Memory mapping B E A B C F I H G D £ o a o 0 1 -H & 1 $ W ) £ • cd 8 o £ L reserve for A Figure 3.7: M em ory-m apping B. 53 3.4 Sum m ary T he size of a scalable parallel m achine can be scaled up by increasing the nu m b er of processor/m em ory modules in the m achine. The available com puting power and available physical m em ory also increase proportionally to the nu m b er of m odules. As the problem size increases, th e m em ory requirem ent for a given m odule-m apping m ay increase at a different rate. It implies th a t for a given m odule-m apping of an application problem , the requirem ents in m em ory size and com putational pow er can be very different. In this chapter, a m ethodology has been proposed to evaluate th e suitability of various m odule-m apping applied to a d istributed memory m a chine w ith bounded memory size. T he evaluation is based on a scalability m eter, th e isoefficiency function of a parallel system . We have shown th a t although th e isoefficiency function of a parallel system exists, it m ay not m aintain efficiency a t a desired level for any num ber of m odules exceeding its expansion range. Our resu lt is useful for a program m er to choose an appropriate m odule-m apping strategy for solving a problem . In addition, based on the requirem ents of m em ory size and com putational power for solving a certain class of applications of typical problem sizes, m achine m anufactures can decide an appropriate memory size to be installed in a m achine. M any of th e current distributed m em ory m achine provide several com m unica tion m echanism s to suit the requirem ents of perform ing various com m unications. To reduce com m unication tim e, it is im p o rtan t for th e m achine users to choose an appropriate com m unication m echanism based on a given m em ory-m apping s tr a t egy. In this chapter, approaches to reduce th e tim e for perform ing com m unications on those m achines w ith various m echanism s are studies. Furtherm ore, we also p ro pose a m em ory-m apping strategy for perform ing window operations. The strateg y reduces the com m unication tim e by placing th e interm ediate o u tp u t w ith the sam e targ et m odule a t consecutive m em ory locations. 54 C hapter 4 C om m unication L atency R educing Accessing d ata from rem ote m odules is a m ajo r overhead in exploring parallel processing on d istrib u ted m em ory machines. T he time required for d a ta to be tran sp o rted from its source m odule to its ta rg e t module is defined to be commu nication latency. T h e com m unication latency includes th e tim e for m oving d ata item s to consecutive m em ory locations (if necessary), the sta rtu p tim e, th e d ata tran sp o rtatio n tim e, and the tim e for moving d a ta items from consecutive m em ory locations to th eir final m em ory locations (if necessary). Based on our model, the processor in a m odule will not be able to access the d a ta in a m essage until the message has been com pletely received by th e module. T h u s, the com putation activity of a processor may be suspended if th e required d a ta cannot be accessed by th e processor. To achieve high perform ance com puting on d istributed m em ory m achines, it is im p o rtan t to reduce th e time for a module to access d ata from other modules. In Section 4.1, the im portance of th e problems considered in th is chapter is discussed. Techniques for reducing the com m unication tim e for several funda m ental com m unication patterns are investigated in Section 4.2. In Section 4.3, a m essage-grain balancing technique is proposed to avoid m odule-access contention which m ay happen in distributing d ata am ong the m odules; and a closed form is derived for evaluating the necessary condition of applying the message-grain balancing process. 55 4.1 Background Several program m ing paradigm s are used by program m ers to design their codes for solving application problem s. T he program m ing paradigm s can be classified in to two categories: the com putation activities of some m odules overlap with th e com m u nication activities of the o th er m odules, or All th e m odules synchronously altern ate between th e com putation phases and th e com m unication phases T h e latter program m ing paradigm makes program m ers m ore easily to develop th eir codes and to debug th e errors in th eir program s. Using th e program m ing paradigm , th e activities of the m odules in a m achine can be described as follows. D uring a com putation phase, each of the m odules either keeps idle, or accesses d ata from its local memory, perform s com putations based on the d ata, and stores the com puted results back to its local memory. D uring a com m unication phase, each of m odules eith er keeps idle, or delivers to and (or) receives d ata from rem ote modules. For th is type of program m ing paradigm , techniques are needed for globally accelerating th e d ata delivery activities in any of the com m unication phases to reduce th e com m unication latency. In Section 4.2, techniques are proposed for perform ing several fundam ental com m unication p attern s. These com m unication p attern s include one-to-one, one- to-m any, and m any-to-one. We only consider th e case: each of the source m odule(s) tran sp o rts equal am ount of d ata to the targ et m odule(s). For th e fundam ental com m unication p attern s considered in this section, module-access contention only happens to m any-to-one com m unication p attern . Since each of th e source m odules contains th e d a ta item s which targ et for th e same m odule, the ra te of the d a ta receptions a t th e targ et m odule is th e only factor which affects th e com m unication tim e. M any-to-m any com m unication which delivers d a ta of fixed am ount am ong th e m odules can be perform ed using m ultiple one-to-one com m unications. T hus, we do not consider th a t fixed am ount of d ata to be transported am ong the m odules for m any-to-m any com m unication p attern . 56 D istributing d ata am ong the m odules has its practical usefulness in m any ap plication problem s. Exam ples are balancing load am ong th e m odules of a parallel m achine and perform ing parallel histogram m ing on a parallel machine. In perform ing load balancing and parallel histogram m ing, various am ount of d a ta in each of th e m odules are tran sp o rted to the o th er m odules based on th e load inform ation or th e predeterm ined d a ta m ovem ent rules. In Section 4.3, a technique for dis trib u tin g various am ount of d ata am ong th e m odules is proposed. D uring th e d a ta distrib u tio n process, each of th e m odules delivers d ata to th e other m odules and th e am ount of d a ta to be delivered to th e other m odules can be very different. In general, d a ta distribution is m any-to-m any com m unication. Based on our m odel, m any-to-m any com m unication may suspend th e th e com m unication activities in som e m odules for a period of tim e. T his is due to th e m odule-access contention. To avoid m odule-access contention, m any-to-m any com m unication can be perform ed using linear perm utation [8]. The linear perm utation consists of p-1 steps; during step j , 1 < j < p, m odule i delivers d a ta to m odule (i + j) . However, this approach can only directly apply to th e case: each of the m odules distributes d a ta of fixed am ount to th e o ther m odules. For distributing d a ta of various am ounts am ong th e m odules, th e m odule-access contention still exists. T his suggests th a t a novel approach is required for d a ta distribution in which each of th e m odules distrib u tes d a ta of various am ounts to th e other m odules. O ur technique is proposed to handle this scenario. O ur d ata distribution algorithm solves this problem by balancing th e m essage-grains first. W e also derive a closed form for judging th e necessary condition of applying the m essage-grain balancing technique to d ata d istrib u tio n process. 4.2 C om m unication A ctiv ity H iding In th is section, techniques for the m odules to access d a ta from o ther m odules are proposed for reducing th e com m unication latency. Several fundam ental com m unication p attern s are considered. T hese include one-to-one, one-to-m any, and m any-to-one. We only consider the case: delivering d ata of fixed am ount to o th er m odules. In th is section, th e term scattered d a ta item s (in b y te) refers to th e d a ta item s which are not stored in consecutive m em ory locations. Those scattered d a ta 57 items can be delivered using a sequence of short messages of fixed size or longer messages of various sizes. Delivering d ata using a sequence of short m essages of fixed size can be perform ed w ithout moving th e d ata item s to consecutive m em ory locations. A sequence of short messages are delivered by issuing a sequence of com m ands, su ch as read or write, from the source processor. For delivering d a ta using a long m essage, m em ory to m em ory copy operations m ay be required to move the scattered d a ta item s to consecutive m em ory locations before the m essage can be sent. B ased on our model, th e copy operations can overlap with the d a ta tran s portation. Thus, th ere is an opportunity to reduce the com m unication latency by overlapping the com m unication activities. By overlapping th e com m unication ac tivities, th e com m unication costs can be p artially hidden. For com parison purpose, we also derive the tim e for delivering d ata using a sequence of messages of fixed size. T h e notations, Cs and Cb, have been defined in Section 3.3 to cap tu re the features of different com m unication m echanism s provided by a distributed m em ory m achine. For clarity, we repeat th e definitions of these notations. • C s: a com m unication m echanism which delivers d a ta using a sequence of sm all messages of k bytes. According to the definition of our m odel, com m unicating d a ta of M bytes between a p air of m odules takes at m ost (41) w here T'0 is th e startu p tim e for delivering a short message. E quation (4.1) can be simplified to Mt'0 + td, where i = J + (1 - |> f* (4.2) A ccording to th e definition of the s ta rtu p tim e in o u r model, th e first byte of th e message is in th e channel after th e startu p process. T hus, if d ata of M bytes are delivered using M short messages of one byte, th e tim e for delivering the d a ta is M t0 + ti, (4.3) 5 8 M em ory # A data item O A sent data item m The number of scattered data items module i O o O r a • • • • • module j Figure 4.1: An illustration for one-to-one com m unication. where tQ is th e startu p tim e for delivering the message of one byte. Com paring E quation (4.2) with E quation (4.3), the tim e for delivering M bytes by using messages of k bytes w ith th e startu p tim e equal to T'0 is equivalent to th a t for delivering the sam e am ount of d ata using messages of one byte t ' w ith th e startu p tim e equal to -f - + (1 — \)td- In this case, we can virtually consider th a t Cs delivers d ata using a sequence of messages of one by te and th e startu p tim e for delivering a message is t'a, w here t'0 = ^ + (1 — ^)fd- • Cb'. a com m unication m echanism which delivers d a ta using a long message. T he size of a long message is in proportion to th e am ount of d a ta to be delivered. T he d a ta item s need to be moved to consecutive m em ory locations before th e d a ta item s can be delivered. Denote th e startu p tim e for delivering a message using Cf, to be t 0. 4.2.1 One-to-one In one-to-one com m unication, a source m odule delivers scattered d a ta item s of m bytes to th eir targ et m odule, and the d ata item s will be stored scatteredly a t the targ et m odule. T he one-to-one com m unication is illustrated in Figure 4.1. 59 Startup time Memory copy time Data transportation time Module i Channel Module j Time Figure 4.2: O verlapping in perform ing one-to-one com m unication. T he following results show the tim e for perform ing one-to-one com m unication using eith er Cs or CV If Cs is used to perform one-to-one com m unication, then th e to tal startu p tim e required for delivering d a ta is in lin ear proportion to the am ount of th e data. T his leads to the following result: th e tim e for perform ing one-to-one com m unication using C3 is, m t0 -(- td. It is easy to see th a t th e to tal sta rtu p tim e dom inates th e tim e for perform ing one-to-one com m unication. A straightforw ard approach for perform ing one-to-one com m unication using C \> is to use a long m essage for delivering the d a ta . The approach consists of four non-overlapping steps: 1. Move th e scattered d ata item s in the source module to consecutive memory locations. 2. S ta rt up a com m unication. 3. T ransport th e d ata. 60 4. Copy d a ta item s stored a t consecutive memory locations in th e target m odule to their final memory locations. T h e tim e of perform ing one-to-one com m unication using this approach is: tQ + m x (td + 2tc). Some of th e distributed m em ory m achines provide different com m unication m echanism s [18, 31] for different com m unication requirem ents. For exam ple, a light weight protocol is used for active message [18] to com m unicate a short m es sage of fixed size. The sta rtu p tim e for delivering d a ta using such a m echanism is usually sm aller than the startu p tim e for delivering d ata of various sizes. In general, the large startup tim e in delivering d ata using messages of various sizes is partially due to th e tim e required for buffer m anagem ent in th e com m unicating m odules. Thus, th e users should choose an appropriate com m unication m echanism to transfer scatte red d ata item s based on th e num ber of d ata item s to be deliv ered. Based on th e tim e derived for perform ing one-to-one using C a and C\ on our m odel, Cs should be em ployed when the num ber of d a ta item s to be delivered is less th an w, w here w < , to~t and t' > + 2tc. If t 0 < td + 2tc th en Cs can lead tj A te to a faster com m unication. For large am ount of d ata, Cb should be used to deliver d ata. In this case, th e tim e for perform ing one-to-one com m unication can be reduced by delivering th e scattered d a ta items using a sequence of messages of an ap p ro p riate size, and overlapping th e message construction operations w ith d ata tran sp o rtatio n s. T his strateg y is illu strated in F igure 4.2. The size of the message chosen to deliver d a ta depends on the features of th e com m unication m echanism s and th e to tal am ount of d a ta to be delivered. D enote h to be th e size of messages chosen for delivering m scatte red data item s, a to be max{<d,<c}, and T(h) to b e the tim e for perform ing one-to-one com m unication. T hus, we have, 772 771 T(h) = j t 0 -f 2htc + ( — — l ) a T ht^, 61 w hich can be sim plified to T (h) = ft) + h(2tc + td) - a. Taking th e derivative and equating it to zero (th at is, we have —m (t0 — a) —--------1 - (2tc + td) = 0. T his im plies th a t h = Thus, th e m inim um of T(h) is yj4m (t0 - a)(2tc + td) - a , where th e num ber of d a ta item s m is . t o - a m > ----------. ~ 2tc + td 4.2.2 One-to-m any In perform ing one-to-m any com m unication, m odule 0 delivers scattered d ata item s of pm bytes to o th er p m odules; each of th e p m odules receives m bytes. An illu stratio n of one-to-m any com m unication is given in Figure 4.3. In our m odel, th e first byte of a message has been delivered to th e channel after the sta rtu p process. According to th e definitions of delivering d a ta using Cs, startu p tim e t'0 is associated with each byte delivery. Thus, based on th e definition of Cs and th e n atu re of our m odel, th e tim e for perform ing one-to-m any com m unication using C s can be perform ed in: p m t0 + t d. N ext, we consider perform ing one-to-m any com m unication using Cj. A four- step approach (which is sim ilar to th a t of perform ing one-to-one com m unication) can be applied. T he four steps are listed as follows. 1. Move all the scattered d a ta item s (pm bytes) in th e source m odule to con secutive m em ory locations, 62 M em ory # A data item O A sent data item m The number of scattered data items p m ooo ooo ooo 099 ... • • • Module 0 Module 1 Module 2 Module 3 ■ i ■ Module p -1 Module p m 9999 • • • • Figure 4.3: An illustration for one-to-m any com m unication. 2. S tart up the com m unications. 3. T ransport d ata through th e channels using p messages w ith size equal to to bytes. 4. As soon as a message is com pletely received by any of the targ et m odules, the m odule will move the arrived d ata item s from consecutive m em ory locations to th eir final locations. 63 First level Module 0 Second level module-group module-group . . . . . . . module-group Figure 4.4: An illustration for a m odule-partition approach. T he activities in Steps 2 and 3 are alternately executed by m odule i for p tim es. T he activities in Step 3 overlap w ith th e activities in Step 4. T he tim e for perform ing one-to-one com m unication using th is approach is: pt0 + (pm — p + 1 )td + (p + 1 )m tc. This approach is inefficient if the num ber of m odules and th e startu p tim e are large. An approach which partitions th e m odule to yjp groups as em ployed in [23] can be used to perform one-to-m any com m unication. However, our approach also considers how to overlap the com m unication activities to reduce th e com m unication latency. Figure 4.4 shows the m odule-partition approach. Using the m odule- p artitio n approach, p target m odules are partitioned into y/p groups; each group contains y/p m odules. Module 0 sends a message of size y/pm to a specified m odule in each of th e groups. After receiving th e message, the specified m odule in a group im m ediately sends y/p messages to th e other m odules in th e group; each of the messages contains d a ta item s of m bytes. By overlapping m em ory copy operations w ith d a ta tran sp o rtatio n , the tim e for perform ing one-to-m any com m unication is a t most 2 y / p t 0 + ‘ l y / p m t i + ( y /p + 1 ) m t c ■ + p m a . 64 time time second level first level ► number o f groups num ber of groups \ time number o f groups F igure 4.5: Trade-off in the m odule-partition approach. However, dividing the m odules into ^/p groups for perform ing one-to-m any m ay not be an o p tim al partition. As shown in Figures 4.5, an optim al num ber of m odules exists for achieving m inim al com m unication tim e in perform ing one-to- m any com m unication using th e two-level m odule-partition approach. T he num ber of groups depends on the num ber of m odules in a d istrib u ted m em ory m achine, the am ount of d a ta to be delivered, th e com m unication param eters of th e m achines. D enote g to be th e num ber of m odule groups to which th e m odules of a parallel 65 m achine are partitioned, and T(g) to be the tim e for perform ing one-to-m any com m unication. Thus, we have, p pmtd pmtc 1 \9) = -to + --------- 1 h gt 0 + m tc + pma. 9 9 9 Taking th e derivative and equating it to zero (th a t is, we have pt 0 + pmtd + pmtc 0 " T to — 9 2 T his im plies th a t ____ p(ta + m tc + rntd) 5 = V r . (4 - 4 ) Two observations can be m ade based on E quation (4.4): • The num ber of groups approaches y/p, if t 0 is very large com pared w ith the term m{td + tc). • The num ber of groups is increasing if (1) th e value of m(td + tc) increases and (2) t 0 is small com pared w ith m{td + tc), th en the num ber of groups should increase. Large m (td+tc) im plies th at the am ount of d a ta tran sp o rted among the m odules is the dom inant factor in perform ing one-to-m any com m unica tion. T hus, if we choose the num ber of m odule group g to be a large value, then th e total am ount of d ata transported am ong the m odules decrease. This can lead to a faster com m unication. T hus, the m inim um of T(g) is ypt, 0{t 0 + m tc + m td) + m (pa + tc). 4.2.3 M any-to-one In m any-to-one com m unication, one target m odule receives d a ta item s from the other p source modules; each of th e modules delivers scattered d ata item s of m bytes to th e targ et module. The m any-to-one com m unication is illu strated in Figure 4.6. If perform ing m any-to-one com m unication using C s, then th e p source 66 Memory 9 A data item 0 A sent data item m The number of scattered data items o d b o o c fb o o # m# # ■ • • • • Module 1 Module 2 Module 3« ■ i ■ Module p - 1 Module p Module 0 Figure 4.6: An illustration for m any-to-one com m unication. m odules can send th e first byte of the d ata concurrently, however, the target m odule takes the d ata one by one from the channel. T hus, the com m unication tim e depends on the num ber of com m unicating m odules, th e startu p tim e t'0, and th e d ata tran sp o rtatio n rate 14. B ased on these, the tim e for perform ing m any-to- one com m unication using Ca is t 0 + m x (m axjptd, <„}). (4.5) T w o interesting conclusions can be m ade based on E quation 4.5. / • If p > then th e tim e for perform ing m any-to-one com m unication using C 3 is sm aller than th a t using Cf, . Note th a t perform ing com m unications using Cs needs m em ory to m em ory copy operations and in general, tQ > t'0. I • if p <C -f and th e am ount of d a ta to be delivered is large th en perform ing m any-to-one com m unication using C & can lead to a shorter com m unication tim e. T he reason is th at the startu p tim e for sending messages using Cb will not increase in proportion to th e total am ount of d a ta to be delivered. A straightforw ard approach for perform ing m any-to-one com m unication using C\ is to use a long message for delivering the d ata. The approach consists of four non-overlapping steps given as follow: 1. All th e source m odules move th e scattered d ata item s of m bytes to consec utive m em ory locations. 2. Each of th e source m odules sta rts up th e com m unications. 3. Each of th e source modules tran sp o rts d ata. 4. T he targ e t m odule moves th e received d a ta item s stored at consecutive m em ory locations to th eir final m em ory locations. U sing the four-step approach, the tim e for perform ing m any-to-one com m unication is tQ + m(ptd + (p + 1 )tc) time. To further reduce the com m unication tim e, th e d a ta delivery activities a t the source modules can be divided into th ree phases. 68 fm iiin n iiiu iiin rT T T T iiiirm S ta rtu p tim e M e m o ry c o p y tim e D a ta tra n sp o rta tio n tim e Module 1 Module 2 Module p Channel Module 0 hq in iiih iiim iih i . Time F igure 4.7: O verlapping in perform ing m any-to-one com m unication. 1. In th e first phase, each of th e m odules copies subsets of scattered d a ta items in its m odule to consecutive memory locations. 2. In th e second phase, the d ata stored a t consecutive m em ory locations are delivered by message handlers; and a t th e sam e tim e, each of th e modules copies the rest of scattered d ata item s in its local m em ory to consecutive m em ory locations. 3. In th e th ird phase, each of th e m odules delivers th e m essages constructed in th e second phase to the targ et module. This approach is illustrated in Figure 4.7. To well overlap th e com m unication activities in the second phase, th e sizes of th e messages used to deliver d ata at the second and th e th ird phases are chosen to be equal to ■-w and re- r M Ptd+tc Ptd + tc spectively. Using th e three-phase algorithm , th e tim e for perform ing m any-to-one com m unication using Cb is at m ost 2ta + p m ( td + (a - td)) + m t c. 69 For tc < tj, this approach hides the cost of m em ory copy operations by a factor of P- 4.3 M essage-grain Balancing D ata d istribution is an im p o rtan t process for a m achine user to globally move d a ta am ong th e modules. This process is operated as follows: each of th e m odules of a d istrib u ted memory rriachine delivers d ata to th e o th er m odules of the parallel m achine. Any of the rnodules may deliver various am ount of d a ta to its ta rg e t m odules and receive various am ount of d ata from th e source m odules. Due to (1) th e possibly high variance in th e message sizes and (2) the lim ited com m unication bandw idth available to per-m odule, th e tim e required by any of th e m odules to deliver d a ta to or receive d a ta from th e other m odules can be very large. W ith out carefully scheduling th e d ata delivery among th e modules, th e m odule-access contention will seriously degenerate th e perform ance of the parallel system . In this section, a determ inistic technique is proposed to m inimize th e m odule-access contention problem. 4.3.1 D ata D istribution M any application problem s [28, 61] need to d istrib u te interm ediate output d u r ing com putations. In [28], d a ta distribution is perform ed to balance th e am ount of d ata am ong the m odules. In their load-balancing algorithm , d ata to be d istrib u ted have no specified target m odules; thus, the com m unication p a tte rn s required by their load-balancing algorithm can be tailored to a p a tte rn which will cause very few m odule-access contention. In [61], a parallel version of colum n so rt [37] is used to move d a ta am ong th e m odules. In the parallel version, each of the m odules delivers (and receives) equal am ount of d ata to (from ) each of th e o ther m odules betw een each of the four local sorts. In [30], a m apping technique is proposed to m inim ize th e total am ount of d ata to be moved am ong the m odules in perform ing d a ta distribution. However, in this section, a technique is proposed for d a ta distrib u tio n s in which each of the m odules delivers (an d receives) various am o u n t 70 of d a ta to (from ) each of th e other m odules. We also do not assum e any specified data-m apping strategy is used to restrict the com m unication patterns. Basically, d ata distribution is m any-to-m any com m unication. The potential problem which m ay happen to m any-to-m any com m unication is th a t m any m es sages can arrive at the sam e module sim ultaneously. We refer this scenario as m odule-access contention. Basically, th e module-access contention is caused by th e lim ited com m unication bandw idth available to per-m odule. To avoid m odule- access contention, the linear perm utation algorithm [8] can be applied to d istrib u te d ata. T his algorithm divide the m any-to-m any com m unication into p rounds. D uring round j, 0 < j < p, m odule i, 0 < i < p, sends messages to m odule ((i + j ) m od p). This approach can effectively avoid m odule-access contention for the case in which each of th e source m odules delivers the same am ount of d a ta to th eir targ et modules. For another case in which each of the source m odules de livers various am ount of d a ta , the risk of module-access contention still exists. In [51], b arrier synchronization point is inserted betw een rounds to avoid the m odule- access contention problem . However, th e tim e to perform the perm u tatio n in each of th e rounds depends on th e m axim al size of th e messages in each round. T he com m unication tim e is large if there is a pair of m odule com m unicating a long m essage during each of th e rounds. In this case, th e total com m unication tim e accum ulated from the p rounds is unacceptable. Also, the tim e for perform ing a barrier synchronization can be very large for the m achines w hich perform th e syn chronization w ithout any hardw are support. An exam ple is IBM SP-2. To perform a b arrier synchronization in IBM SP-2 of 64 nodes, 521 psecs is required [10]. T hus, to perform d a ta distribution on IBM SP-2 of 64 nodes using linear perm u tatio n algorithm w ith inserted b arrier synchronization points, the tim e for perform ing barrier synchronizations is a t least 32 m secs. W ith o u t inserting a synchronization point betw een rounds, directly applying th e linear perm utation algorithm will not effectively avoid th e m odule-access contention problem . An exam ple will be given to illu strate this. Using th e linear perm utation algorithm , the d a ta delivery from m odule j — 1 to m odule i can not sta rt until the pending m essage which is sent from m odule j to m odule i has com pletely received by module i. In this case, a long m essage which is received by a m odule can delay other incom ing messages to be received 71 module 0 module i module j-2 module j-1 module j i ~~1 D ata to be delivered wmmm— m m m D elivering data - W aiting for data delivery F igure 4.8: An illustration for module-access contention. by the m odule. This is illustrated in Figure 4.8. In Figure 4.8, the m odule i which is receiving message from m odule j will not receive m essage from m odules 0,..., j — 2, j — 1. To illustrate the scenario of module-access contention, n o tations are defined as follows. • A m atrix is used to specify a com m unication pattern. T h e entry dij o f the m atrix, which is at row i and colum n j, specifies the am ount of d a ta (in bytes) to be delivered from m odule i to m odule j. In th is section, we assum e p divides dtiJ, for 0 < i,j < p. • d„ and dT are the m axim um of th e am ount of d a ta to be delivered and to be received by a m odule respectively. • L = m a x{d„,dr}. 72 M : large amount of data mi small amount of data Destination modules 0 1 2 3 4 5 6 1 8 9 10 11 12 13 14 15 16 17 18 19 0 m m m m m m m m m m m m m m m m m m m M 1 m m m m m m m m m m m m m m m m m m M m 2 m m m m m m m m m m m m m m m m m M m m 3 m m m m m m m m m m m m m m m m M m m m 4 m m m m m m m m m m m m m m m M m m m m 5 m m m m m m m m m m m m m m M ' m m m m m 6 m m m m m m m m m m m m m M m m m m m m 7 m m m m m m m m m m m m m m m m m m 8 m m m m m m m m m m m M m m m m m m m m 9 m m m m m m m m m m Mm m m m m m m m m 10 m m m m m m m m m m m m m m m m m m m m 11 m m m m m m m m m m m m m m m m m m m m 12 m m m m m m m m m m m m m m m m m m m m 13 m m m m m m m m m m m m m m m m m m m m 14 m m m m m m m m m m m m m m m m m m m m 15 m m m m m m m m m m m m m m m m m m m m 16 m m m m m m m m m m m m m m m m m m m m 17 m m m m m m m m m m m m m m m m m m m m 18 m m m m m m m m m m m m m m m m m m m m 19 m m m m m m m m m m m m m m m m m m m m Figure 4.9: A com m unication p attern for distributing d a ta am ong 20 m odules. Considering th e com m unication p attern given in Figure 4.9, we have ds = dr = M + 19m. A ssum e M m. In this case, the am ount of d a ta to be delivered from m odule 9 to m odule 10 is so large as to delay the d a ta delivery from m odules 8 ,7 , ...,1,0 to m odule 10. Similarly, th e am ount of d ata to be delivered from m odule 8 to m odule 11 is so large as to delay the d ata delivery from m odules 7,6,..., 1,0 to m odule 11. Based on the com m unication p a tte rn specified in Fig u re 4.9, during round 9 — j, for 9 < j < 0, the d ata delivery activity of m odule 0 will be delayed by m odule j for (M — m x j)td and will be delay by m odule j — 1, j — 2 ,..., 0 for mtd This is illustrated in Figure 4.8. Thus, th e tim e for perform ing d a ta distribution specified by th e m atrix in F igure 4.9 takes at least pt 0 + lOMfd- T his is significantly large, com pared w ith th e lower bound on the tim e (= M + 19m) for perform ing th e com m unication 73 A lgorithm D a ta _ d is trib u tio n _ l; begin Balancing message-grains am ong the modules; Perform ing barrier synchronization; Perform ing linear perm utation; end{ D a ta _ d is tr ib u tio n _ l} Figure 4.10: A lgorithm .! for d ata distribution. specified by the m atrix. Based on our m odel, perform ing d a ta distrib u tio n on a parallel m achine w ith p m odules, the com m unication tim e can be up to pt 0 + where M is the m axim um am ount of d a ta delivered from m odule i to m odule j , for 0 < i ,j < p. T hus, if M = 100m and p — 100, then th e tim e for perform ing d a ta distrib u tio n can be up to 100to + 5000mtd. A technique which sim ulates a 3-stage Clos network has been proposed in [60] to solve the m odule-access contention problem . In [60], sam e am ount of d ata is delivered am ong th e m odules, th a t is, dr = ds. O ur technique is proposed to handle the case: dr ^ ds. In our technique, each of th e source m odules delivers th e d ata item s which target to the sam e m odule using one message. A ssum e th a t th e d ata item s which target to the sam e m odule have been stored a t consecutive m em ory locations. We refer th e am ount of d ata stored at a message as message-grain. T he m ajor steps of our algorithm is illustrated in Figure 4.10. D uring th e first step, each of th e p m odules evenly distributes th e d a ta item s of th e sam e targ e t am ong the p m odules. T hus, all the m odules get equal am ount of d a ta from each of the p m odules. Note th a t in this step, m em ory to m em ory copy operations are required to rearrange the d a ta item . T his step can be perform ed in pt 0 + d ^ tj + tc). During the second phase, b arrier synchronization is perform ed to ensure th a t message- grains have been balanced. During th e th ird step, linear perm utation algorithm is perform ed to deliver th e d ata using messages of equal grains. N ote th a t before the linear perm utation algorithm can be used, m em ory to m em ory copy operations are 74 required to store th e d ata item s which targ et to th e sam e m odule at consecutive m em ory locations. T his step can be perform ed in ptQ + dTtd + L tc. T hus, our algorithm can be perform ed in at most 2 pt„ + (ds -+ • dr)td + (L + ds)tc + < (,s? where 4s is the tim e for perform ing barrier synchronization. For parallel m achines which provide fast barrier synchronization m echanism s, such as Cray T3D (.33 psecs [31]) and TM C CM-5 (5 psecs [36]), th e tim e for perform ing barrier synchro nization can be ignored. A sim ilar approach is used in [51] for solving the problem incurred by inserting p barrier points in th e linear perm utation algorithm . T he approach uses 2 p — \ barrier points. For IBM SP2, a b arrier synchronization takes 521 psecs [10] for 64 nodes. It is significantly large com pared w ith CM-5 and T3D. T hus, th e num ber of b arrier points inserted to the algorithm s which are designed to be run on a certain class of m achines w ith large startu p tim e, such as IBM SP-2, should be as small as possible. T h e above discussion focus on delivering d ata item s which have th e sam e targ et using a long message; th a t is, perform ing com m unication using Cb. If C a is used to d istrib u te the d a ta item s, then a strategy should be designed for scheduling the huge num ber of short messages transporting am ong th e m odules. Unlike th e activity of delivering d a ta from a m odule to another m odule using one message, the activ ity of any of th e d ata delivery can be suspended by incom ing short m es sages a t any tim e. Thus, a strategy is needed to avoid th e possible module-access contention. This can be prevented by inserting 2p — 1 barrier points. As we m en tioned before, it is very tim e consum ing for perform ing a barrier synchronization on a parallel m achine w ithout any hardw are support. By em ploying a fine-grained scheduling, the barrier points can be significantly reduced. We assum e th a t delivering d ata of one byte will experience th e full startu p tim e t'o. In the following, we will propose a scheduling policy for d ata delivery to avoid any possible m odule-access contention at th e target m odules. D enote Dij to be the set of d ata item s stored a t m odule i to be delivered to m od ule j. N ote th a t we assum e p divides D{j. P artitio n D (J into p d ata subsets of equal size which is denoted to be • • •, T h e m ajor steps 75 A lgorithm Data_distribution_2; begin Balancing th e num ber of bytes to be delivered to th e sam e targ ets among th e modules; Perform ing barrier synchronization; For / = 1 to ^ p Perform ing the linear p erm u tatio n algorithm s on the p bytes; each of the p bytes targets for different modules; end{Data_distribution_2} Figure 4.11: Algorithm-2 for d a ta distribution. used to d istrib u te d a ta am ong modules are given in Figure 4.11. T h e d a ta de livery policy used in th e first step (for balancing the d a ta item s) is described as follows: m odule i delivers th e <_th byte in s, for all 0 < k < p, dur ing ((p — 1)(< - 1) + 1) t'0, {{p - l)(f - 1) + 2) t'0, ..., (p - l) t'0. T he first step takes dat ' 0 + tj,. During th e th ird step, if the bytes targ etin g to p different m odules can not be found, the m issing d ata item s are replaced by pseudo bytes. T he th ird step takes dTt ' 0 + tj. T hus, the to tal tim e for perform ing d ata d istribution is a t m ost 2 td + ( ds + dr)ta -f- tbs, where tbs is the tim e for perform ing a barrier synchronization. 4.3.2 On the Overhead of Balancing M essage-grains T he overhead for balancing message-grains am ong the m odules includes th e d ata rearrangem ent (which requires m em ory to m em ory copy operations) an d extra num ber of messages (p2 messages) to be delivered among th e p m odules of a dis trib u ted m em ory m achine w ith p modules. In order for o u r d ata distrib u tio n algorithm benefit from the message-grains balancing process, the gain from the 76 process should be larger th an the overhead paid for it. The analyses m ade in [51, 59, 60] do not consider this. In the following, a closed form is derived for m achine users to judge the w orth of applying the message-grain balancing process to d ata distribution. Several notations used in our closed form are defined as follows. For clarity, Figure 4.12 is used to illu strate the m eanings of these notations. • Mj, 0 < j < p, is the m axim um of {d 0j, dij,d 2,j,-.. ,dp_i,j}. • m j, 0 < j < p, is the m inim um of {d 0,j, di,j,d2j r .. ,dp_ i j }. • M a is th e m axim um of {Mo, M i, M2,..., Mp_x}. • m j is th e m axim um of {mo, mi,m2,... ,m p_i}. • 6 = m ax{X 3^o(^u ~ m j)|0 < i < p). D uring th e m essage-grain balancing process, only p a rt of the d a ta item s w hich target th e sam e m odule should be balanced. T he am ount of d a ta needed to be balanced is bounded by a box as indicated in Figure 4.13. Thus, m odule i needs to deliver a t least Y ^o (d ij ~ m j) d ata item s to th e o th er m odules. T he tim e for perform ing this step is at least pt 0 + 8{td + tc). A fter this step , d ata item s are rearranged. A fter d ata rearrangem ent, all the d a ta item s can be moved to their targ et m odules in at least pt 0 + pm^t^. The to tal overhead for m essage-grain balancing is at least 2 pta + (< 5 + pm b)td + 8tc. (4.6) However, if we do not perform th e message-grains balancing, the tim e for perform ing d ata distrib u tio n is at most pt 0 + (pm b + pM a - pm a) x td. (4.7) Based on E quations (4.6) and (4.7), we have: the m essage-grains balancing process is not required if {pMa - pm a) x t d < pta + 6(td + *c). For a parallel m achine which has a separate network for perform ing global reduc tion operation (such as CM -5), pM a — pm a and 8 can be com puted fast. T hus, 77 max C ^ D C ” * 2 > min mm mm. min min 0,p-l 1.p-l 2,2 p-2,0 V 2 ,l p-2,2 max max max max max M i friciX t Ma Figure 4.12: N otations used in th e closed form . 78 Data C Data to be balanced ) Sources m. d itQ- m 0 £ W ) e 5 r di r m\ di 2 - m2 dtj3- m3 F igure 4.13: An illustration for th e am ount of d ata th a t needs to be balanced. 79 users of such a class of parallel m achines can com pute the values of pM a — pm a and delta first. Based on the com puted values, the user of a parallel m achine can judge the w orth of perform ing m essage-grain balancing for a given com m unication p attern . In some cases, using th e inform ation accum ulated from previous activi ties am ong th e m odules, the u p p er bound on pMa — p m a and 6 can be calculated locally. In [59], a variant of our d ata distribution algorithm is proposed for perform ing d a ta distribution on a m achine w ith large startu p tim e. T he algorithm em ploys m essage-grain balancing process for three tim es. The to tal com m unication tim e is 5y/pt 0 + 5L x (td + tc). Using th e closed form we derived, it is obvious th a t the th ird m essage-grains balancing process is not necessary. This is because pMa —pm a and 6 has th e same upper bound (p 2 ). If we perform the balancing process, the com m unication tim e m ay not be reduced. T hus, perform ing d a ta distrib u tio n on a parallel m achine w ith large sta rtu p tim e can be achieved in at m ost 4s/pt<, + 4L x (td + tc). C om pared w ith the tim e for perform ing d a ta distribution in [59], th e tim e which has been saved by using less m essage-grain balancing processes is yjpt 0 + L x (t^ + tc)- 4.4 Sum m ary In this chapter, we show th at using an appropriate message size to delivered d ata item s am ong m odules can reduce the com m unication tim e. T he choice of the message size is based on the com m unication a ttrib u te of a parallel m achine and th e com m unication p attern s to be perform ed on th e parallel m achine. Once the message size has been decided, th e data item s are delivered using a sequence of messages of th e size. By overlapping the m essage construction operations w ith th e message transm ission operations, p art of th e com m unication activities are hidden. T hus, th e to tal com m unication tim e can be is reduced. 80 Data, distribution is widely used in m any applications. T he am ount of d ata in a m odule to be d istributed to other m odules can be very different. W ith o u t dy nam ically scheduling th e message delivery operations or em ploying sm art message handlers, module-access contention can happen. In this chapter, a m essage-grains process is investigated. T he process is employed before a straightforw ard process for d istributing d a ta is perform ed. T he balancing process needs ex tra m em ory to m em ory copy operations and ex tra d a ta transm ission operations. B alancing the m essage grains m ay not be necessary for some com m unication p attern s. To avoid th e gain from th e balancing process is less th an the overhead paid for th e process, a closed form is given for judging the w orth of perform ing th e balancing process for a given com m unication pattern. 81 C hapter 5 D ata-rem apping An integrated parallel system w hich is developed for solving a class of application problem s on a parallel m achine m ay consist of several m ajor steps. In each of the m ajor steps, the com ponents of th e parallel m achine cooperate to execute a bunch of tasks. In the integrated parallel system , th e m ajor steps need to be executed one by one and th e o u tp u ts of previous steps m ay need to be used as th e inputs of the next steps. As th e parallel system is im plem ented on a distributed m em ory m achine, the o u tp u ts of the previous steps m ay be stored in such a way th a t most of the data-accesses in the next step need to go over the interconnection netw ork. T hat is, th e d ata layout of the o u tp u t causes inter-m odule d a ta dependency during perform ing the required com putations in th e next step. T h e inter-m odule d ata dependency can be elim inated by partitioning the d ata in each of th e m odules into several data-groups such th a t th ere is little d ata dependency am ong th e data- groups. T hen, each of th e data-groups is assigned to the m odules of the d istrib u ted m em ory m achine. In this way, m ost of the d ata item s which cause d ata dependency can be localized in th e local m em ory of each of th e modules. T hus, the overhead due to accessing d ata from other m odules can be minim ized. In Section 5.1, previous works related to rem apping d ata to minim ize th e parallel overhead are discussed and a classification of data-rem apping strategies is given. T h e tim e com plexity of applying data-rem apping for an application problem w ith fixed com m unication p attern s is investigated in Section 5.2. In Section 5.3, a dynam ic data-rem apping technique is developed to solve th e class of application problem s in w hich the p attern of inter-m odule d ata dependency can only be known in run-tim e. 82 5.1 Background T he perform ance of a running large-scale parallel system m ay be degenerating due to (1) increasing degree of inter-m odule d ata dependency, o r (2) increasingly uneven com putational load am ong the modules. Several researches concerning about these can be found in [9, 10, 14, 20, 40, 61]. It seems th a t their approaches focus on eith er elim inating th e adverse effect of inter-m odule d a ta dependency or balancing th e com putational load am ong th e m odules; but n ot both. In [10], a heuristic was used in their algorithm to reduce th e adverse effect of inter-m odule d ata dependency in a m ajo r step of parallel vision recognition system . In [61], a data-rem apping technique was developed to reduce the degree of inter-m odule d ata dependency in a m ajor step of a parallel system for detecting buildings from a aerial picture. The suitability of applying fixed data-rem apping techniques to th e class of application problem s w ith fixed com m unication patterns is investigated in [14, 40]. In [9, 20], data-rem apping techniques for im age processing are proposed to balance th e load am ong th e modules. D ata-rem apping can be classified into two categories: fixed (or static) d ata- rem apping or dynam ic data-rem apping. In the fixed data-rem apping, the targ et modules to which the d a ta should be m oved during rem apping process can be determ ined a t the beginning of the execution. It is suitable for th e class of ap plication problem s in which all the com m unication p attern s are known after an initial m odule-m apping strateg y is applied. This class of application problem s in clude com puting Fast Fourier Transform (F F T ), perform ing colum nsort on a set of num bers, and com puting all_pairs_shortest_paths. In contrast to fixed d ata- rem apping, dynam ic data-rem apping m ay need th e inform ation o f th e global d a ta layout to determ ine the targ et modules to which th e d a ta should b e moved. T h e in form ation need to be gathered in run-tim e. To gather the inform ation, each of the m odules m ay need to access d a ta from th e other m odules. Thus, fast d ata m ove m ent and effective data re-layout policy are necessary for dynam ically rem apping d ata item s am ong the m odules. 83 columns 0 12 3 4 Figure 5.1: An illustration for 16-point F F T butterfly graph. 5.2 F ixed D ata-rem apping In this section, the tim e com plexities of using several approaches for com puting F F T are investigated to highlight several im p o rtan t factors. These factors d eter m ine th e effectiveness of em ploying a data-rem apping strategy. T he Fast Fourier Transform (F F T ) com putes the D iscrete Fourier Transform (D F T ) of an n-dim ensional com plex vector (xo, xi,- ■ ■ We assum e p is a power of 2 and p 2 < n. T he butterfly algorithm [12] can be used to com pute F FT . Figure 5.1 shows an 16-point butterfly graph. T h e initial m odule-m apping for the 84 ^ M ixlulc 1 ) 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Figure 5.2: A n illustration for cyclic layout. bu tterfly algorithm can be cyclic layout or blocked layout. For cyclic layout on a parallel m achine w ith p m odules, th e F F T point /,, 0 < i < n, is stored a t m odule (i m od p ). T he inter-m odule d a ta dependency of the cyclic layout for com puting 16-point F F T b u tterfly is shown in Figure 5.2. Assum e th e size of a point is one byte. For blocked layout, the F F T point /,-, 0 < i < n, is stored at m odule ([i/p ]). In [14], a com bination of the cyclic layout and blocked layout (denoted hybrid layout) has been proposed to reduce the am ount of rem ote data-accesses. T he hyb rid layout uses cyclic layout for a period of tim e, th en th e interm ediate o u tp u t is rem apped to form the blocked layout. T he inter-m odule d a ta dependency of the hy b rid layout for com puting 16-point F F T butterfly is illustrated in F igure 5.3. M odule □ M odule 2 fSSS M odule 3 columns 0 1 2 3 4 85 row s { § § | M odule 0 M odule I [ 1 M o d u le 2 ^ M o d u le 3 com m unication m i i i i K colum ns 0 Figure 5.3: An illustration for hybrid layout. 5.2.1 On the Com plexity of Performing Data-rem apping In this section, the factors affect the tim e com plexity in perform ing data-rem apping is investigated. A ssum e parallel system s A and B are developed to solve an ap plication problem . System B employs data-rem apping b u t System A d o n ’t. In System B, the am ount of rem ote data-access is less th an th a t of System A. An in tu itio n ab o u t these parallel system s is th a t the execution tim e of System B is shorter th an th a t of System A. The intuition is only correct for some classes of parallel m achines b u t is not tru e for all parallel m achines. C om puting F F T will be used to illu strate this. We choose hybrid layout to be our data-rem apping strategy in com puting F F T . We will show th a t for a given problem size, an unique solution m ay not exist for determ ining “w hether th e data-rem apping strategies should be applied” w ithout looking into th e com m unication feature of a parallel m achine. To illu strate th a t different com m unication features of th e parallel m achine m ay lead to different decisions, we consider two types of com m unication m echanism s: Cs and Cb', which are defined in Section 3.3. If cyclic layout or blocked layout is used, then each of th e m odules accesses d a ta of n ,°SP bytes from another module. If hybrid layout is used, th en each of th e p m odules accesses d ata of ^ bytes from the other p — 1 m odules. Since the tim e for perform ing com m unication using Cs is in proportion to the am ount of accessed rem ote d ata, it is obvious th at th e com m unication tim e required by the hybrid layout is less th an th a t required by th e blocked layout. Thus, if Cs is used to com pute n-point F F T , where n > p2, then hybrid layout is m ore desirable com pared w ith the others. However, if Cb is used for perform ing com m unication, th en the situ atio n is very different. By using Cb, d ata item s are delivered using messages of various sizes. T here is a startu p tim e associated w ith launching a m essage to th e network. Thus, it is obvious th a t delivering d ata using Cb depends on th e num ber of different source-target pairs and the am ount of accessed rem ote d ata. Assum e each of the m odules directly delivers d ata of th e sam e target to its targ et m odule using one message. T hen, th e m axim al num ber of different source-target pairs gives a lower bound on th e num ber of messages to be sent. During com puting F F T , th ere are log p source-target pairs if cyclic layout is used and there are p source-target pairs if 87 hybrid layout is used. B ased on above analysis, we derive th at th e com m unication tim e for com puting F F T using cyclic layout is log2 p(tQ+ - ( t d + 2tc)). (5.1) P In th e hybrid layout, th e com m unication patterns used in perform ing data-rem apping is a p-to-p com m unication. A lthough the am ount of accessed rem ote d ata is re duced, however, the num ber of different source-target pairs increase. Based on this, we derive th a t the com m unication tim e for com puting F F T using hybrid layout is pt 0 + ^ ( t d + 2tc). (5.2) Based on Equations (5.1) and (5.2), we have th a t th e hybrid layout is not desirable if th e in p u t size is less th a n jj bytes. From th e analyses m ade in this section, we know th a t the necessity of applying data-rem apping depends on the feature of the com m unication m echanism provided by a parallel machine. T h e most sim ple case is th a t only Cs is only provided in th e parallel m achine. In this case, the total am ount of rem o te data-access reflects th e com m unication time. For another case in which only C & is used in the parallel machine, th e am ount of rem ote data-access and th e com plexity of th e com m unication p attern in perform ing data-rem apping affect th e effectiveness of the data-rem apping. 5.2.2 On the M essage Sizes for Performing Data-rem apping T he tim e com plexity derived in Section 5.2.1 im plies th at if th e com puting F F T problem has large input size, then em ploying hybrid layout always leads to sm aller com m unication tim e. T hus, if a F F T problem of large input size runs on a p a r allel m achine which provides various types of com m unication m echanism s, th en an appropriate m echanism may need to be chosen carefully for perform ing d ata- rem apping process. Tw o types of com m unication m echanism s (Cs and Cb) are investigated in this section. In the following, we show th at the choice of th e com m unication mechanisms should be based on the in p u t size of th e problem . Using hybrid layout for com puting n-point F F T , each of th e p m odules accesses rem ote d ata of ^ bytes. Thus, the tim e for rem apping d a ta using Cs in com puting F F T is given by: — + U. (5.3) P T h e tim e for rem apping d ata using Cb for com puting n-point F F T in a parallel m achine of p m odules is given in E quation (5.2). The factor p associated w ith th e t 0 in E quation (5.2) can be reduced to ^Jp if the m odule-partition strategy in Section 4.2.2 is used. However, this strategy increases th e am ount of rem ote data-access. As th e results shown in Section 5.2.1, the data-rem apping process is required only for large input size. Large input size implies th a t th e am ount of d ata tran sp o rted am ong the m odules is large. Thus, th e m odule-partition strategy will not be applied in this section to com pare the tim e for perform ing data-rem apping using Cs. Based on Equations (5.2) and (5.3), it follows th a t perform ing data- rem apping using Cb leads to a sm aller com m unication tim e com pared th a t using C 3, if _ . Pipto - td) n > ^—• K - t d - 2 te Several observations can be m ade from the above equations. As we m entioned, the t 0 is usually larger th an t'a due to the overhead of buffer m anagem ent. In addition, under current technology, some com m unication m edians have very high d a ta transm ission rate. To com pute F F T on this com putational environm ent (th at is, ta > t'0 ~> ti and t 0 > t0 ^> tc), we have the following result: th e m echanism to be applied is Cb if n > Cs if p2 < n < / " i A nother interesting observation can be m ade for a parallel m achine w ithout enough physical m em ory space. It also im plies th a t the input size is very large. T hus, some of th e d a ta item s should be stored in a slow device such as disks. In this case, the m easured tc can be large due to the d a ta swapping between th e disk and physical I m em ory. If the m easured tc larger th an -a^ , th en Cs is a preferable com m uni cation m echanism for perform ing data-rem apping. Based on above analyses, if an 89 application is run on a parallel m achine w ith bounded memory size, then m ech anism Cb m ay be good for a range of problem size. Beyond or below the range, m echanism Cs can leads to a shorter com m unication tim e. 5.3 D ynam ic D ata-rem apping In general, th e interm ediate level im age understanding (IU) tasks in a vision system o p erate on clusters of m arked pixels (which are th e o u tp u t of low level analysis) to ex tract inform ation for high level analysis [46]. As the system im plem ented on a d istributed m em ory m achine, the interm ediate level IU task s may involve arb itra ry d ata distribution, dependency am ong com putations to b e perform ed in various m odules, and frequent irregular inter-m odule com m unications. An ex am ple of these is: scanning the elem ents of th e contours in an image in parallel. T h is scanning process is defined to be contour-tracing. T h e parallel contour-tracing problem captures the com m on com putational features of the interm ediate level IU tasks in a parallel vision system . In this section, a technique for perform ing data-rem apping is developed for th e contour-tracing problem in th is section. O u r data-rem apping technique is developed for elim inating the inter-m odule d ata d e pendency and for balancing th e load am ong the m odules. After d a ta item s have been rem apped, the contours can be traced locally in each of th e modules. For com paring th e perform ance of our data-rem apping technique, a parallel algorithm for in-place contour-tracing is also described. A worst case is considered for analyz ing th e tim e com plexity in tracing contours w ith data-rem apping an d th a t w ithout data-rem apping. In general, th e worst case generates large am ount of data to be com m unicated and high degree of inter-m odule d ata dependency th a n the o th e r cases. Since Cb usually perform s well for delivering large amount of d ata, it is used to analyze th e tim e com plexity in perform ing com m unications. 5.3.1 Definitions and N otations In general, an n x n im age is partitioned into p subim ages (of size x ~ j for m apping d ata to the parallel m achine w ith p modules. An im age of size 128 x 128 is shown in Figure 5.4. Assum e the subim ages are num bered using the row 90 Figure 5.4: A 128 by 128 raw image. m ajor order. T hen, in our m odule-m apping, subim age i is assigned to m odule i for perform ing low level operations. Each of the p m odules is responsible for pro cessing one of the subim ages. Using this m odule-m apping, each of the subim ages is surrounded by at m ost eight subim ages (which are denoted as the neighboring subim ages). In each m odule, contour-elem ents are detected by m arking pixels. Contour-segm ents are form ed by linking the m arked pixels in th e m odules. T he im age of size 128 x 128 in which th e m arked pixels are linked is shown in Fig ure 5.5. Inform ation such as th e length of a segm ent and the m odule which contains th e im m ediate successor of th e segm ent can be easily com puted during th e linking step. All th e inform ation is useful in perform ing data-rem apping. A contour i can be represented by an ordered set S{ of contour pixels. A t each contour pixel, properties of th e contour-pixel (such as its coordinates, intensity, etc.) are stored. In perform ing contour-tracing, th e contour pixels are scanned one by one following th e order in 5,. T he length of contour 5,, denoted |S',|, is defined as th e num ber of elem ents in th e set. A global contour denotes a contour which is em bedded in at least two m odules. A local contour refers to a contour which is com pletely em bedded in one module. 91 Figure 5.5: A 128 by 128 im age w ith m arked pixels. T he input to th e contour-tracing problem is a collection of contours. T he o u tp u t of tracing 5, is {u,-(0),u,-(l),. . . ,u,-(|5,-| — 1)}. At any current elem ent, th e o u tp u t is com puted based on the o u tp u t already extracted for the elem ent im m ediately preceding the cu rren t elem ent and the attrib u tes of th e current elem ent. T h a t is, Ui(j) depends on u ,(j —1) and th e d a ta stored at position j in Si, for 1 < i < |5,-| — 1. We assum e th at, • it takes t tim e to com pute the o u tp u t for any contour-elem ent, • a global contour is segm ented into a t most yfp segm ents, and • one byte is enough to store any inform ation regarding to a pixel. T he last assum ption will be used for calculating th e com m unication tim e. 5.3.2 In-Place Com putations C ontour-tracing can be perform ed directly w ithout redistributing th e contour- elem ents. We refer to this approach as in-place contour tracing. T he in-place algorithm can be perform ed in synchronous or asynchronous fashion. In th e syn chronous algorithm , the m odules synchronously altern ate betw een com putation 92 and com m unication. Inserting a barrier synchronization betw een th e com putation and com m unication phases can delay th e com putational activity of some m od ules even though th e activity can proceed. This delay effect can be elim inated if no barrier synchronization point is inserted betw een th e com putation and th e com m unication phases. Using this asynchronous algorithm , each of the m odules com m unicates w ith its neighboring m odules as soon as th e m odule finishes its com putation phase. O u r in-place contour tracing is developed to reduce th e com putational delay effect. Let ready contour-segment denote a segment having th e inform ation needed to s ta rt the tracing process. T he asynchronous contour-tracing algorithm oper ates as follows: each of the p m odules alternates betw een its own com putation and com m unication phases. T h at is, different m odules m ay be in different phases a t any instant. D uring the com putation phase, m odule i , 0 < i < p, scans the ready contour-segm ents in its module. After all of its ready contour-segm ents have been scanned, m odule i com m unicates w ith its eight neighboring m odules. T hen, m odule i proceeds to scan the contour-segm ents which are recently ready. T he to tal num ber of messages will not increase too m uch since th e com m unication is perform ed after all the ready contour-segm ents have been com pletely traced. O ur algorithm is developed to elim inate th e delay effect caused by irrelevant m od ules. By elim inating this effect, the contours entirely em bedded in a m odule-group are traced independently from the o th er contours entirely em bedded in an o th er m odule-group. T his feature is very suitable for perform ing contour-tracing on clusters of w orkstations for some types of images. An exam ple is the im age in which th e m ajor objects are far from each other. For clusters of w orkstations, resource sharing is very im portant. T hus, the m odule-group which finishes tracing contours can release th e com puting power to other users w ithout w aiting until all th e o th er contours have been traced in other modules. T he tim e com plexity for perform ing contour-tracing depends on th e input im age. In th e worst case, th e algorithm takes com putation tim e and 8t oy/ p + 4 n (2 tc + td) com m unication tim e. T h e advantage of this algorithm is th a t the contour-tracing task can take th e o u tp u t directly from the previous task w ithout redistributing th e d ata. T he dis advantage of th e algorithm is th a t the num ber of contour-elem ents assigned to th e 93 A lgorithm C o n to u r-tra c in g begin Label Clusters; M ove Data; T race Localized Contours; end { C o n to u r-tra c in g } Figure 5.6: M ajor steps in perform ing data-rem apping. m odules m ay be unbalanced. Even if equal num ber of contour-elem ents are stored in th e m odules, inter-m odule d ata dependencies can still severely degenerate th e perform ance of the parallel machine. 5.3.3 Com putations with Data-Rem apping In this section, a parallel contour-tracing algorithm w hich employs data-rem apping process is developed. A fter th e data-rem apping process has been perform ed, the elem ents of a contour are localized in a module. T hus, rem ote data-access is not required during tracing th e localized contours. D enote S = J2i Si- T he strategy of almost even load is also used in the data-rem apping process. A lm o st even load m eans th a t either a m odule stores a t most contour-elem ents, or a m odule stores p exactly one contour which consists of more th an ^ contour-elem ents. T h e data-rem apping process has three m ajor steps as shown in F igure 5.6. D etails of th e m ajor steps will be discussed as follows. T he step ‘label clusters’ assigns a label /, 0 < / < p, to each of th e contours. B ased on th e assigned labels, the contours are red istrib u ted am ong th e m odules during the step ‘move d a ta ’. T he o utput of module i of th e ‘label clu sters’ step is p sets of contour-elem ents, w here set j , for 0 < j < p, stores the contour-elem ents to be delivered from m odule i to module j . D enote the w eight of a cluster (or contour) to be the am ount of com putation required to trace th e cluster (or contour). T he ‘label clu ster’ step includes (1) calculating the weight for each of th e clusters, and (2) determ ining the target modules for all the contours. Figure 5.7 shows the 94 A lgorithm Label Clusters begin C alculate weight for each of th e contours; Assign a label to each of the contours; Propagate the label of a contour to all th e segm ents of the contour; end{Label Clusters} Figure 5.7: M ajor steps in perform ing cluster-labeling. m ajor steps in ‘label clusters’. During perform ing th e weight calculation (label propagation), each of th e m odules needs to gather (m ulticast) d a ta from (to) the o ther m odules. T he com m unication p attern s depend on the approaches used to g ath er d ata and to m ulticast inform ation. Two approaches can be used for these purposes. Theses approaches include divide-and-conquer and parallel segm ent- tracing. By using divide-and-conquer, the w eight calculation consists of logp steps. D uring step j , 0 < j < J 2*2 , each block of modules cooperates to get the weights or p artial weights of the contours. D uring step j , th ere are ^ blocks; a block consists of 2J x 2J m odules. D enote th e m odule at th e left-bottom corner of a block as th e header of the block. A ccording to th is, the block and its header have different configurations for different steps. T he weight calculation operates as follows. D uring step j , for 1 < j < th e header of each of th e current blocks gathers p artial weights from th e four headers which are defined for step j — 1 and enclosed in th e current block. This approach is illu strate Figure 5.8. It is obvious th a t this approach dose not depend on th e num ber of segm ents to which a contour is cut. T he com m unication tim e in perform ing this process is 4f0logp+ 4n(< d+ 2< c). T he divide-and-conquer process can operate in reversal way to propagate labels, and th e com m unication tim e is the sam e as th a t for perform ing w eight calculation. A nother approach is parallel segm ent-tracing. In this approach, the segm ents of each of th e contours are scanned one by one. N ote th a t th e weights of all the segm ents have been calculated before th e segm ent-tracing is perform ed. Thus, th e com m unication tim e is only depend on the the m axim um of th e num bers of segm ents to which a contour is cut; but does not depend on th e contour-elem ents. 95 step 3 step 2 step 1 step 0 Figure 5.8: An illustration for th e divide-and-conquer approach. If th e m axim um of the num bers of segm ents is sm all, then em ploying parallel segm ent-tracing is a good choice. To illustrate our idea, an exam ple of data- rem apping which employs parallel segm ent-tracing is shown in Figure 5.9. In addition, th e alm ost even load strategy is also explained using th e same figure. In th e exam ple, contours are em bedded in a parallel m achine w ith four m od ules. A fter parallel segm ent-tracing has been perform ed, th e weights of all the contours are stored at the modules which contain th e tails of th e contours (see (B) in Figure 5.9). After th e weights of all the clusters have been calculated, the sum of th e weights in each of the m odules is calculated (see (C ) in Figure 5.9). To label the clusters, the sum in each of the m odules is broadcast to all th e other m odules for determ ining th e target m odules to w hich the contours will be redis trib u ted . A ssum e th e sum of the w eights in m odule i is V F, an d the weight of cluster j in m odule i is in.y. T he sums in the m odules can be broadcast using all- to-all com m unication (see (D) in Figure 5.9). M odule i determ ines the labels for th e clusters 5,0, 5,: i ,. . . , 5 tJ which have th eir tails in m odule i based on th e weights 96 (A ): Initial state (B): W eight calculation 14 0 33 21 14 0 21 33 14 0 21 33 140 21 33 14 0 21 33 (C ): S um o f local w eights (D): W eight collection I I ' IO i 1 ' 2 ' 3 — i f - c : --------- — - { 1 25 ! 8 y L 1 1 (E): Labeling o p erations in the m odule located at the left bottom comer. (G) A fter m oving data (F): Label propagation Figure 5.9: An exam ple for labeling clusters. W 0, Wi,..., Wio, tun, ..., Wij,... W p- 1 . This can operate currently at th e modules. To achieve balanced load, we use almost even load strategy to determ ine th e ta r get m odules for all th e contours. In this strategy, each of th e m odules partitions Wo) W i,..., tuioi u>n, ■ ■ i ,... W p- i into p groups such th a t the am ount of com p u tatio n s to be perform ed on each group is alm ost equal. Then, the inter-m odule d a ta dependencies are elim inated by assigning all the elem ents of a contour to a targ et module. T he assignm ent operates as follows: if th e ^ t h pixel of S', is in group j , then th e label for contour Si is assigned j (see (E) in Figure 5.9 ). T he label for a global contour need to be propagated to the m odules which store the segm ents of th e global contours (see (F) in Figure 5.9). Parallel segm ent-tracing is also applied to propagate the labels in reversal way. A fter the label propagation process has been perform ed, all-to-all com m unication is perform ed to move th e d a ta am ong th e m odules (see (G) in Figure 5.9). T he label assignm ent is perform ed in a decentralized way. In th e worst case, each m odule can contain segm ents of global contours. T he weights of th e clus ters can be calculated in at m ost 0(n) com putation tim e and 8t 0 % / p + 4n(td + 2t c) com m unication tim e. Since there are a t most — clusters in a m odule, th e sum of p ’ 2 th e weights in a m odule can be com puted in 0 ( ^ - ) com putation tim e. Collecting th e sums from all the o ther m odules can be perform ed in p t0 + pt& com m uni cation tim e. Label assignm ent can be perform ed locally in 0 (p + ^ ) com puta tio n tim e. T he com m unication tim e for perform ing label propagation is at m ost 8toy/ p + 4n(td + 2tc). After the label of each of th e contours has been propagated to th e m odules which contain the segm ents of th e contour, d ata m ovem ent can be perform ed in at m ost 2p t0 + ^ - ( 2 tc + td) using th e technique developed in Section 4.3.1. Therefore, the to tal execution tim e for p < n can be asym ptotically ex- 2 2 pressed as follows: com putation tim e is 0 (n + + p ) = 0 ( z j-) and com m unication tim e is 0 (p -f- n) = O(n). Since the parallel in-place algorithm has O ( ^ ) com pu tatio n tim e and 0 (n) com m unication tim e, our data-rem apping strateg y can lead to a tim e com plexity of low order if th e length of th e longest contours is O ( ^ ) . According to th e definition of th e contour-tracing problem , the elem ents of all th e contours need to be scanned one by one. T hus, th e lower bound on the tim e for tracing contours on a parallel m achine is equal to the tim e for tracing th e longest contour So on a serial m achine. We show th a t after redistributing the 98 contours, contours can be traced in m inim al tim e or in asym ptotic processor-tim e optim ality. O ur data-rem apping technique is designed to elim inate interprocessor d a ta dependency as well as achieve load balance. After data-rem apping has been perform ed, each of the m odules has a t m ost ^£1 contour-elem ents or one long contour w ith length larger th an T hus, the tim e for perform ing contour-tracing is a t m ost m a x { ^ , [Shi} x t tim e. It is easy to see th a t tracing the contours can be perform ed to result in (1) processor-tim e optim ality, if > |5 0|, or (2) m inim al running tim e, if ^ < |So|. T h e m erit of contour-tracing w ith d a ta rem apping is th a t there is no com m u nication needed am ong the m odules after th e data-rem apping process has been perform ed. Thus, each of th e m odules can perform contour-tracing independently. T he shortcom ing of this scheme is th a t there m ay be a high variance in th e lengths of th e contours. T hus,the num bers of contour-elem ents which are assigned to the m odules can be very large. In the worst case, some of th e m odules m ay not be able to store th e elem ents assigned to them due to lim ited local m em ory space. T his shortcom ing can be solved by d istributing contour elem ents equally am ong th e modules. To avoid excessive com m unication, th e contour-elem ents need to be carefully assigned to the m odules. In addition, scheduling of the com putations is needed to m inim ize the adverse effect of the possible inter-m odule d a ta dependen cies. 5.3.4 Im plem entation D etails and Experim ental R esults T h e algorithm s were im plem ented on a SP-2 w ith a dedicated pool of 64 m odules a t th e Maui High Perform ance C om puting C enter. We com pare our algorithm s using various sizes of im ages (128 x 128, 256 x 256, 512 x 512, IK x 1A"). Each of th e images was run on configurations of size 4, 8, 16, 32 and 64 m odules. T he codes were w ritten using C and M PL message passing library. For em ploying M PL, th e startu p tim e for com m unicating a message is 40 \isec and th e peak d a ta transm ission ra te is 35.54 M bytes/sec. W e did not targ et any specific linear approxim ation algorithm . Thus, a dum m y loop was used to sim ulate the com putation tim e t for com paring our algorithm s. N ote th a t, t denote the am ount of com putation needed by a m odule to ex tract 99 inform ation from a contour-elem ent. By changing th e num ber of iterations of th e loop, t will be different, t (in psec) can be calibrated using the following formula: t = 14 + 0.104 x number o f loop iterations. At the tim e of this w riting, data item s are redistributed w ithout employing message-grain balancing. We denote this to be one-phase d a ta redistribution. D ata m ovem ent th a t employs m essage-grain balancing is denoted two-phase data redistribution. To make a decision on th e choice between one-phase and two-phase, experim ents have been conducted on large sizes of images. O ur result show th a t one-phase algorithm is b etter th a n two-phase. F igure 5.10 illustrates this. T here are several reasons which leads to th e one phase is a better choice: • a small num ber (< 64) of dedicated pool of m odules were available during the experim ents, and • small observed variance in sizes of th e messages due to th e features of th e tested images. D enote the algorithm which traces contours w ithout redistributing data to be A lgorithm A and the algorithm which traces contours employing data-rem apping to be A lgorithm B. The com parisons of th e perform ance between A lgorithm A and A lgorithm B are shown in Figures 5.11, 5.12, 5.13, and 5.14. For image sizes 128 x 128 (Figure 5.11), 256 x 256 (Figure 5.12), and 512 x 512 (F igure 5.13), A lgorithm B always takes less tim e th an Algorithm A (regardless of th e size of the parallel m achine and the num ber of loop iterations). However, th ere is a cross-over point betw een the two algorithm s if th e I K X I K im age is used on SP-2 with 4 m odules. Figure 5.14 illu strates this. Cross-over point between th e algorithm s was observed a t 200 iterations of the dum m y loop. C om pared w ith A lgorithm A, A lgorithm B is faster if th e num ber of iterations is large. T h at is, if t is large, the gain from the data-rem apping is larger th a n th e overhead incurred from th e data-rem apping. T his implies th at th e com putational load of a m odule can enhance the adverse effect of inter-m odule d a ta dependencies. We believe th at th e to ta l execution tim e of A lgorithm A can be reduced if load balancing is perform ed based on the local contours. T he reason is th a t by redistributing th e local contours, no e x tra com m unication is introduced and th e m axim al com putational load am ong th e m odules is reduced. 100 Execution Tim es (m sec) Execution Times (m s e c ) T w o - P h a s e N u m b e r o l P r o c e s s o rs 128 by 128 11 0 256 by 256 512 by 512 IK by IK F igure 5.10: Com m unication tim es for m oving data using various approaches. 101 A lg o rith m A ( 6 4 P N s ) ___ P 3 0 0 A lg o rith m A ( 4 P N s ) A lg o rith m B ( 6 4 P N s ) _ - - - A - - 200 100 A lg o rith m B ( 4 P N s ) N u m b e r o f L o o p Ite r a tio n s Figure 5.11: Execution tim es for a 128 x 128 im age. In our experim ents, as th e num ber of modules increases, th e speed-ups o f A l gorithm A decreases more rapidly th an th a t of A lgorithm B . In A lgorithm A, com m unication is perform ed after all th e current ready-segments (of local contours and global contours) have been scanned. T his approach delays th e processing of the successor segm ents of th e global contours. This delay accum ulates and b e comes severe if th e am ount of inter-m odule data dependencies increase. If th e num ber of m odule increases, th e inter-m odule d ata dependencies also increase. T hus, A lgorithm A perform s worse th an A lgorithm B, if large num ber of m odules are used. This can also explain why A lgorithm A perform s worse th an A lgorithm B for sm aller im ages (of size 128 x 128, 256 x 256, an d 512 x 512). In our experim ents, we found th at th e d ata dependencies betw een m odules is th e m ost im portant factor which affects th e speed-ups in dealing w ith this kin d of problem s. This can be illustrated using o u r experim ental data. Consider the case of tracing contours in a 512 x 512 image using A lgorithm B on SP-2 w ith 4 m odules. A fter th e data-rem apping process has been perform ed, the observed num bers of contour-elem ents in the m odules were 9638, 9657, 9559, and 9610. Compared w ith th e loads of th e m odules before rem apping the d ata, they are 9288,10245, 9002, and 9929. T he loads am ong th e m odules are not significantly different. However, th e execution tim e of A lgorithm B is alw ays sm aller th an th a t of A lgorithm A (see Figure 5.13). This is because th e elem ents of th e global contours (w hich 102 E xecution Time (m se c ) fl) E xecution Time ( m s e c ) 8 0 0 A lg o rith m A ( 4 P N s ) . 6 0 0 A lg o r ith m A ( 6 4 P N s ) 5 0 0 frth m B ( 4 P N s ) 3 0 0 200 A lg o r ith m B ( 6 4 P N s ) 100 5 0 0 N u m b e r o f L o o p I te r a tio n s 5.12: Execution tim es for a 256 x 256 image > 500 >000 A lg o rith m A ( 4 P N s) A lg o rith m B ( 4 P N s ) 1 5 0 0 1000 5 0 0 A lg o rith m B (6 4 P N s ) 1000 N u m b e r o f L o o p Ite r a tio n s Figure 5.13: E xecution tim es for a 512 x 512 im age 3 0 0 0 2 5 0 0 £2000 | J 1 5 0 0 l < 2 1000 0 0 5 0 0 1 0 0 0 1 5 0 0 N u m b e r o f L o o p I te r a tio n s Figure 5.14: Execution tim es for a 1024 x 1024 image. were observed to be 8% of th e total contour-elem ents) have been localized after the data-rem apping process has been perform ed. T his im plies th a t balance of the loads am ong the modules cannot guarantee large speed-ups. C onsider another case: tracing contours in a \ K x IK im age using A lgorithm B on SP-2 with 4 processors. In this test image, it was observed th at only 1% of the contour-elem ents belong to global contours. It is a sm all num ber com pared w ith th e previous case. After th e d ata rem apping, th e num ber of contour-elem ents in th e m odules were observed to be 6057, 5842, 5892, and 5927. T he load is balanced com pared w ith th e initial d istribution which was 11277, 12441, 0, and 0. In this case, A lgorithm A can provide faster execution th an A lgorithm B only if the am ount of com putation perform ed at a contour-elem ent is very small (see Figure 5.14). T hus, d ata rem apping m ay not be necessary for any load redistribution. From the two cases, it seems th at unbalanced load is not th e m ajor issue in obtaining large speed-ups for this application. However, th e inter-m odule d ata dependency is th e m ajor adverse factor. However, this adverse factor can be en hanced if the load in the m odules becom e large or th e load is im balanced. A lg o rith m A ( 4 P N s ) A lg o rith m B ( 4 P N s ) A lg o rith m A ( 6 4 P N s ) A lg o rith m B ( 6 4 P N s ) 104 5.4 Sum m ary We conclude th is chapter by th e following statem ents. In this chapter, m eth o d ologies for perform ing fixed data-rem apping and dynam ic data-rem apping are in vestigated. For fixed d ata rem apping, we show th a t not only th e am ount of d ata to be redistributed can affect the effectiveness of a data-rem apping strateg y but also th e com plexity of the com m unication patterns can affect th e com plexity of the effectiveness of th e d ata rem apping strategy. For dynam ic data-rem apping, two strategies for perform ing contour-tracing are investigated. T h e first strateg y is to trace contours w ithout perform ing data-rem apping process. T he second strateg y traces contours after d ata item s have been rem apped. T he data-rem apping pro cess is applied to elim inate inter-m odule d ata dependencies as well as balance the loads am ong th e m odules. O ur results show th a t d ata dependency is th e m ajor adverse factor in tracing contours. Even if the num ber of global contours is sm all, th e adverse effect becomes large if the com putational loads of th e m odules becom e heavy or the loads am ong th e m odules are im balanced. Thus, we suggest th a t for perform ing interm ediate level IU tasks on distributed m em ory m achines, elim in at ing th e inter-m odule d ata dependency first, then, try to balance th e loads am ong th e modules. 105 C hapter 6 C onclusion T h e com puting power of th e conventional serial com puters has steadily increased to m atch th e com putational requirem ents of the applications. However, th e com p u tin g power of a serial com puter has been lim ited by the speed of light. Due to th e increasing com plexity of the em erging applications, the perform ance of th e serial com puters may not satisfy the tim e constraints im posed by th e applications. T hus, it is n atu ra l for hum an beings to em ploy an ensem ble of processors, i.e. a parallel m achine, to suit th e com putational requirem ents. Parallel processing has m ade a trem endous im pact on m any applications. To achieve large speed-ups for solving application problem s on a parallel machine, novel approaches to effectively utilize the com puting power of parallel m achines are needed. In this chapter, con trib u tio n s of th e dissertation are identified and some plausible directions for future research are outlined. 6.1 C ontributions of th is D issertation C urrently, m any vendors are offering d istrib u ted m em ory parallel m achines. These m achines are form ed by a collection of essentially processor/m em ory m odules com bined by an interconnection network. Accessing d ata from the rem ote m em ory is perform ed using the interconnection netw ork. The tim e for a m odule to access d a ta from rem ote memory is m uch longer th a n th at for th e m odule to access d a ta from its local memory. In general, rem ote data-access is needed for th e m odules in a parallel m achine to exchange data. To reduce the overhead of d a ta exchange, th e am ount of inter-m odule data-access should be m inim ized. 106 In this dissertation, we have proposed a structure-independent model to capture the features of current distributed m em ory machines. These features are im portant to th e users of these m achines for designing fast algorithm s or developing applica tions which m atch th e tim ing requirem ents. O ur structure-independent m odel is realistic and robust. In such parallel m achines, high perform ance com puting can be achieved by localizing m ost of the data-access. In this dissertation, several tech niques are proposed to m ap d ata on th e parallel m achines such th a t m ost of the data-access can be located in the local memory. T he im pacts of these techniques are also investigated based on our model. O ur techniques which are developed for per form ing high perform ance com puting on d istributed m em ory m achines include (1) th e strategies for m apping input d ata item s to m odules and to m em ory locations, (2) th e approaches for reducing com m unication latency by partially overlapping com m unication activities and by balancing m essage-grains, and (3) th e techniques for fixed data-rem apping and dynam ic data-rem apping. T he technique developed for dynam ic data-rem apping is applied to solve th e applications which have arbi trary d a ta distribution and dependency am ong com putations to be perform ed in various m odules. To show th e usefulness of our approach, experim ents were con ducted on an IBM SP-2 w ith a dedicated pool of 64 modules. Section 1.3 provides a detailed sum m ary of our research contributions m ade in this dissertation. 6.2 Future D irections Parallelism for m any large-scale applications has generated significant interest in th e recent past. A lthough we have addressed several key issues in reducing com m u nication costs for designing algorithm s or developing applications on d istrib u ted m em ory m achines, a thorough study is required to understand the inherent prob lem in im plem enting an integrated parallel system for large-scale applications. Various fu tu re research avenues have em erged from this dissertation. • In our model, we ab stract the com m unication facility of a parallel m achine into point-to-point channels and message handlers. O ne of th e im portant roles of a message handler is a scheduler for receiving incom ing messages. T he message handler switches am ong th e channels according to its message 107 reception policy to receive the d ata item s from th e channels delivered from th e m em ory in other m odules. C urrently, the message reception policy in our model used to analyze the com m unication tim e is first-com e-first-served. However, we believe th a t for certain com m unication patterns, different m es sage reception policies can lead to very different com m unication costs. For exam ple, com pared w ith first-come-first-serve, scheduling m essage reception using shortest-m essage-first or round-robin may b e a better policy for tra n s porting d ata w ith high variance in th e am ount. By investigating various m essage reception policies for the m essage handlers, m achine designers can decide w hat kinds of policies should be provided in a d istrib u ted m em ory m achine to suit the requirem ents of various applications. T h e research is also useful for algorithm designers or compiler developers to choose an ap pro p riate reception policy (protocol) from those provided by a d istributed m em ory m achine. • D ue to the lower price/perform ance ratio com pared with M P P , clusters of w orkstations will becom e th e m ajor resources for perform ing parallel and d istributed com putations in the future. For evaluating the perform ance of a given algorithm on clusters of w orkstations, th e architectural feature of netw ork com puting needs to be investigated. In general, the n u m b er of links and switches used by clusters of w orkstations m ay be smaller com pared w ith th e M PP of th e same num ber of processor/m em ory modules. T hus, the to tal available com m unication bandw idth provided by the network m ay not be large enough for all the m odules to perform rem ote data-access a t th e sam e tim e, even though the com m unication pattern does not cause any m odule- access contention at th e target m odules. Exam ples are th e w orkstations connected by E thernet or FDDI, in which all d a ta traffic shares a single physical m edium . In addition, larger packets can be used in th e netw ork to tran sp o rt d a ta betw een any pair of processor/m em ory m odules to reduce th e com m unication tim e. A n exam ple is the packet format used by M yrinet, in which the payload of a M yrinet packet is of arbitrary length [7]. In tu itively, transporting d a ta using a large packet is p rone to suffering from d a ta contention in th e netw ork com pared w ith th at using small packets. Thus, 108 th e users of the clusters of w orkstations may need to pay atten tio n to the d ata contention in th e network in developing th e ir applications. For a bad com m unication p a tte rn to clusters of w orkstations, the tim e for perform ing the p attern can be very long due to the d ata contention in th e netw ork. To reduce the com m unication tim e, users of such m achines may be needed to use several effective com m unication p attern s to carry o ut the bad com m unication pattern . • An analytical tool is required for designing algorithm s which effectively over lap com putations and com m unications. T he purpose of th e tool is to ease th e developm ent and analyses of th e algorithm s. The m ethodologies and notations used by the tool should be concise and consistent. C urrently, al gorithm designers use asym ptotic expression to represent com putation tim e and exact expression to represent com m unication tim e. Due to th e inconsis ten t representation, it is very difficult to see to w h at degree th e com putation activities can overlap w ith the com m unication activities. • O ur data-rem apping technology has been developed to handle inter-m odule d ata dependency and im balanced load among th e modules. We have inves tigated th e linear d ata dependency am ong the subtasks. For som e applica tions, the d a ta dependency am ong th e subtasks can be m ore com plex. In this case, a graph can be used to describe the d a ta dependency am ong th e subtasks. To investigate the effectiveness of our data-rem apping technique on com plex d ata dependency, we will develop a program w hich random ly generates th e load for th e m odules and the d a ta dependency am ong the m odules, and will apply our technique to the generated task. Furtherm ore, various data-rem apping techniques will be proposed for running applications on distributed m em ory m achines. T h e effectiveness of applying these tech niques on distributed m em ory m achines will be discussed and th e suitability of these techniques will be classified according to th e inherent n atu res of th e applications. 109 • Message passing is a paradigm used widely on certain classes of parallel machines. M PI (M essage Passing Interface) is the new stan d ard for m u lti com puters introduced by th e M essage-Passing Interface Forum (M P IF ) in April 1994 [44]. The goal of M PIF is to develop a widely used stan d ard for w riting m essage-passing program s. T he m ain advantages of establishing a m essage-passing standard are portability and ease-of-use. T he perform ance of the M PI highly depends on their im plem entations. C urrently, the perfor m ance of th e M PI available on several com m ercial m achines is not good. To suit the requirem ent of high perform ance com putation, th e com m unication prim itives of th e M PI need to be carefully im plem ented. In th e future, we will investigate th e M PI com m unication prim itives and explore th e possibility of designing fast com m unication prim itives for th e M PI. 110 B ibliography [1] A. V. Aho and J. E. Hopcroft and J. D. Ullm an, D ata Structures and A l gorithm s, C om puter Science and Inform ation Processing, Addison-W esley, Reading, MA, 1983. [2] H. M. Alnuweiri and V. K. P rasan n a, “Parallel A rchitectures and A lgorithm s for Im age C om ponent Labelling,” IE E E Transaction on P attern Analysis and M achine Intelligence, Vol. 14, No. 10, pp. 1014-1034, 1992. [3] H. Aoyama and M. Kawagoe, “A Piecewise Linear A pproxim ation M ethod Preserving Visual Feature Points of Original Figures,” C VG IP: Graphical Models and Image Processing, Vol. 53, No. 5, pp. 435-446, 1991. [4] D. B allard, “G eneralizing the Hough Transform to D etect A rbitrary Shapes,” P attern Recognition, Vol. 13, pp. 111-122, 1981. [5] A. Bar-Noy, S. K ipnis, “D esigning Broadcasting A lgorithm s in th e P ostal M odel for M essage-Passing System s,” In Proceedings o f 4-th A C M Sym p. on Parallel Algorithm s and Architectures, pp. 13-22, 1992. [6] G ordon Bell, “U LTR A C O M PU TER S, A Teraflop Before Its T im e,” C om m u nications o f the A C M , Vol. 35, No. 8, A ugust 1992. [7] N. J. Boden, D. C ohen, R. E. Felderm an, A. E. K ulaw ik, C. L. Seitz, J. N. Seizovic, and W . Su, “M yrinet: A G igabit-per-second Local A rea N etw ork,” IE E E Micro, pp. 29-39, February 1995. [8] S. H. Bokhari, “C om plete Exchange on th e iP S C /860,” IC A S E Technical Report, No. 91-4, NASA Langley Research C enter, Jan u ary 1991. [9] A. N. Choudhary, B. N arahari and R. K rishnam urti, “An Efficient H euristic Scheme for D ynam ic R em apping of Parallel C om putations,” Parallel C om puting, 19, pp. 621-632, 1993. [10] Y. Chung, V. K. P rasanna, and C.-L. Wang, “A Fast Asynchronous A lgorithm for Linear Feature E xtraction on IBM SP-2,” In Proceedings o f W orkshop on C om puter Architectures fo r M achine Perception, 1995. I l l [11] Y. C hung and V. K. Prasanna, “Tim ing M easurem ent for Perform ing Basic Com m unication P attern s on IBM SP-2,” Technical R eport, D epartm ent of EE-System s, U niversity of Southern California, 1995. [12] J. M. Cooley and J. W . Tukey, “An A lgorithm for The M achine C alculation of Com plex Fourier Series,” M ath. Comp, 19, pp. 297-301, 1965. [13] T. H. Cormen, C. E. Leiserson, and R. L. R ivest, ''Introduction to Algorithm s, M IT Press, 1989. [14] D. Culler, R. K arp, D. P atterson, A. Sahay, K. E. Schauser, E. Santos, R. Subram onian, and T. Von Eicken, “LogP: Towards a R ealistic Model of P a r allel C om putation,” Proc. o f 4-th A C M S IG P L A N Sym p. on Principles and Practices o f Parallel Programming, pp. 1-12, 1993. [15] R. C ypher, J. L. C. Sanz, L. Snyder, “Algorithm s for Image C om ponent L abel ing on SIMD M esh Connected C om puter,” IE E E Transaction on Com puter, Vol. 39, No. 2, pp. 276-281, 1990. [16] R. Cypher, A. Ho, S. K onstantinidou, and P. Messina, “A rchitectural R e quirem ents of Parallel Scientific A pplications w ith Explicit C om m unication,” Proceedings o f The 20th Annual International Sym posium on C om puter A r chitecture, pp. 2-13, 1993. [17] M. Dubois, C. Scheurich, and F. A. Briggs, “M em ory Access Buffering in M ul tiprocessors,” In Proceeding 13th Annu. International Sym posium on C om puter Architectures, pp. 434-442, 1986. [18] T . V. Eicken, D. E. Culler, S. C. Goldstein, and K. E. Schauser, “A ctive Messages: a M echanism for Integrated Com m unication an d C o m putation,” Proceedings o f The 19th Annual International Sym posium on C om puter A r chitecture, pp. 256-266, 1992. [19] S. Fortune and J. W yllie, “Parallelism in R andom Access M achines,” In P ro ceedings o f the 10th A nnual Sym posium on Theory o f Com puting, pp. 114-118, 1978. [20] D. Gerogiannis and S. C. O rphanoudakis, “Load Balancing R equirem ents in P arallel Im plem entations of Im age Feature E xtraction T asks,” IE E E Trans actions on Parallel and Distributed System s, Vol. 4, No. 9, Septem ber 1993. [21] G rand Challenge: High Perform ance C om puting and C om m unications, C om m ittee on Physical, M athem atical, and Engineering Science, N ational Science Foundation, 1991. 112 [22] A. G u p ta and V. K um ar. “Analysis Perform ance of Large Scale P arallel Sys tem s,” Technical Report 92-32, D epartm ent of C om puter Science. University of M innesota, M inneapolis, 1992. [23] S. E. H am brusch and A. A. K hokhar, “C 3: An A rchitecture-Independent M odel for Coarse-G rained Parallel M achines,” In Proceedings o f Sixth IE E E Symposium, on Parallel and D istributed Processing, pp. 544-551, 1994. [24] J. L. Hennessy and D. A. P atterson. C om puter Architecture -A Q uantitative Approach, M organ K aufm ann, 1990. [25] T. Heywood and S. Ranka, “A P ractical H ierarchical M odel of Parallel Com putation: I. T he M odel,” Journal o f Parallel and Distributed Computing, Vol. 16, pp. 212-232, 1992. [26] A. H uertas, C. Lin and R. N evatia, “D etection of Buildings from M onocular Views of Aerial Scenes using P erceptual Groupings and Shadows,” Proceedings o f A R P A Image Understanding W orkshop, 1993. [27] IE E E . Sym posium Record - Hot Chips IV , A ugust 1992. [28] J. J a J a and K. W . Ryu, “T he Block D istributed M em ory Model for Shared M em ory M ultiprocessors,” In Proceedings o f International Parallel Processing Sym posium , pp. 752-756, 1994. [29] D. V. Jam es, A. T. Laundrie, S. Gjessing, and G. S. Sohni, “D istributed- D irectory Scheme: Scalable Coherence Interface,” IE E E Computer, 23(6), p p . 7 4 -7 7 ,1 9 9 0 . [30] E. T. Kalns and L. M. Ni, “Processor M apping Techniques Toward Efficient D ata R edistribution,” In proceedings o f International Parallel Processing S ym posium , pp. 469-476, 1994. [31] R. K. Koeninger, M. Furtney, and M. W alker, “A Shared M emory M P P from C ray R esearch,” Digital Technical Journal, Vol. 6, No. 2, Spring 1994. [32] J. K uskin, et al., “T he Stanford FLASH M ultiprocessor,” Proceedings o f The 21th A nnual International Sym posium on C om puter Architecture, pp. 302-313, 1994. [33] V. K um ar and V. N. Rao. Parallel D epth-first Search, P art II: A nalysis, In ternational Journal o f Parallel Programming, 16 (6), pp. 501-519, 1987. [34] V. K um ar and A. G upta, “Analyzing Scalability of Parallel A lgorithm s and A rchitectures,” Technical Report 91-18, D epartm ent of C om puter Science, U niversity of M innesota, M inneapolis, 1991. 113 [35] V. K um ar and V. Singh, “Scalability of Parallel Algorithm s for T h e All-pairs S hortest-path Problem ,” Journal o f Parallel and D istributed Com puting, Vol. 13, p p 124-138, 1991. [36] T. T. Kwan, B. K. Totty, and D. A. Reed, “C om m unication and C om putation Perform ance of the CM-5,” In Proceedings o f Supercomputing ’ 93, pp . 192-201, 1993. [37] T. Leighton, “T ight Bound on the Com plexity of Parallel S o rtin g ,” IE E E Transactions on Com puters, C-34(4), pp. 344-354, A pril 1985. [38] D. Lenoski, J. Laudon, K. G harachorloo, W . D. W eber, A. Gupta, J . Hennessy, M. Horowitz, and M. Lam, “T h e Standford Dash M ultiprocessor,” IE E E Com puter, pp. 63-79, M arch 1992. [39] K. Li, “Scalability Issues of Shared V irtu al M emory for M ultiprocessors,” in Dubois and T hakkar (eds.), Scalable Shared-M em ory Multiprocessors, Kluwer Academ ic Publishers, B oston, MA, 1992. [40] C.-C. Lin and V. K. P rasanna, “Analysis of Cost of Perform ing C om m unica tions Using Various C om m unication M echanism s,” in Proceeding o f Sym po sium on the Frontiers o f M assively Parallel Computation, pp. 290-297, 1995. [41] C.-C. Lin and V. K. P rasanna, “Scalable Parallel E xtraction of L inear Fea tures on M P-2,” In Proceedings o f Workshop on C om puter Architectures fo r M achine Perception, pp.352-362, 1993. [42] W . M. Lin and V. K. P rasanna, “Parallel Algorithm s and A rchitectures for C onsistent Labeling,” Parallel Processing fo r Artificial Intelligence, L. Kanal, V. K um ar, H. K itano and C. S uttner, Editors, 1994, Elsevier Science Pub lishers B. V. [43] S. M iguet and V. Poty, “R evisiting the A llocation of Regular D a ta Arrays to a Mesh of Processors,” In Proceedings o f Fourth IE E E Sym posium on Parallel and Distributed Processing, p p .62-69, 1992. [44] M essage Passing Interface Forum , “M PI: A M essage-Passing S ta n d ard ,” Tech nical Report CS-94-230, C om puter Science D epartm ent, U niversity of Ten nessee, Knoxville, 1994. [45] R. N evatia and K. R. B abu, “Linear F eature E xtraction and D escription,” C om puter Graphics and Im age Processing, 13, pp.257-269, 1980. [46] R. N evatia, M achine Perception, Prentice-H all Inc., 1982 114 [47] D. J. Palerm o, E. Su, J. A. Chandy. and P. Banerjee, “C om m unication O pti m izations used in th e Paradigm Com piler for D istributed-M em ory M ulticom p u ters,” in Proceedings o f International Conference on Parallel Processing, 1994. [48] D. A. Patterson and J. L. Hennessy, C om puter Organization and Design, The H ardware/Software Interface, M organ K aufm ann Publishers, San M ateo, California, 1994. [49] V. K. P rasanna and C. S. Raghavendra, “A rray Processor w ith M ultuple B roadcasting,” Journal o f Parallel and Distributed Com puting,, pp. 173-190, 4, 1987. [50] A. Rosenfeld, R. A. Hum m el, and S. W. Zucker. “Scene Labeling by Re laxation O peration,” IE E E Transactions on System s, M an, and Cybernetics, SMC-6.-420-423, 1976. [51] S. Ranka, R. Shankar, and K. A lsabti, “M any-to-M any Personalized Com m u nication with B ounded Traffic,” in Proceeding o f Sym posium on the Frontiers o f M assively Parallel Computation, February 1995. [52] A. P. Reeves, “P arallel Algorithm s for real-tim e im age processing,” in M ulti com puter and Image Processing, K. P reston and L. U hr (E d), A cadem ic Press, pp. 7-18, 1982. [53] C. R einhart and R. Nevatia, “Efficient P arallel Processing in High Level Vi sion,” In Proceedings o f A R P A Image Understanding Workshop, 1990. [54] Howard. J. Siegel et al., “R eport of the P u rd u e W orkshop on G rand Challenges in C om puter A rchitecture for th e Support of High Perform ance C om puting,” Journal o f Parallel and D istributed Computing, 16, pp. 199-211, 1992. [55] Xian-He Sun and Lionel M. Ni, “A nother View on Parallel S peed-up,” In Proceedings o f Supercomputing ’ 90, pp. 324-333, 1990. [56] Xian-He. Sun, “P arallel C om putation Models for Scientific C om puting on M ulticom puter,” Ph.D. D issertation, C om puter Science D epartm ent, Michi gan S tate University, 1990. [57] A. S. T anenbaum , Com puter Networks, Prentice-H all Inc., New Jersey, 1981. [58] L. G. V aliant, “A Bridging M odel for Parallel C om putation,” C om m unications o f the AC M , Vol. 33, No. 8, pp. 103-111, 1990. [59] C.-L. W ang and V. K. P rasanna, “Parallelization of P erceptual G rouping on D istributed M em ory M achines,” In Proceedings o f Workshop on C om puter 115 Architectures fo r M achine Perception, 1995. U niversity of Southern California, January 1995. [60] C.-L. Wang, V. K. P rasanna, H. K im , and K hokhar, “Scalable D ata Parallel Im plem entations of O bject Recognition using G eom etric Hashing,” Journal o f Parallel and Distributed Computing, pp. 96-109, M arch 1994. [61] C.-L. W ang and V. K. Prasanna, “Low Level Vision Processing on C onnection M achine CM -5,” In Proceedings o f Workshop on C om puter Architectures for Machine Perception, pp. 352-362, 1993. 116
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
PDF
00001.tif
Asset Metadata
Core Title
00001.tif
Tag
OAI-PMH Harvest
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-oUC11257272
Unique identifier
UC11257272
Legacy Identifier
9621717