Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Mapping parallel algorithms onto parallel architectures
(USC Thesis Other)
Mapping parallel algorithms onto parallel architectures
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
M APPING PARALLEL ALGORITHMS ONTO PARALLEL ARCHITECTURES by M ohamed Moez Ayed A Thesis Presented to the FACULTY OF THE SCHOOL OF ENGINEERING UNIVERSITY OF SOUTHERN CALIFORNIA In P artial Fulfillment of the Requirements for the Degree M ASTER OF SCIENCE IN COM PUTER ENGINEERING i M ay 1987 Copyright 1987 Mohamed Moez Ayed UMI Number: EP43876 All rights reserved INFORMATION TO ALL USERS The quality of this reproduction is dependent upon the quality of the copy submitted. In the unlikely event that the author did not send a complete manuscript and there are missing pages, these will be noted. Also, if material had to be removed, a note will indicate the deletion. Dissertation Publishing UMI EP43876 Published by ProQuest LLC (2014). Copyright in the Dissertation held by the Author. Microform Edition © ProQuest LLC. All rights reserved. This work is protected against unauthorized copying under Title 17, United States Code ProQuest LLC. 789 East Eisenhower Parkway P.O. Box 1346 Ann Arbor, Ml 4 8 1 0 6 - 1346 Cf>S 5 ' 8 7 A97S 3 2 3 S ~ ^ ,4 t2 ~ This thesis, written by 'fMotiamed M o e z'.Ayefl........................... under the guidance of Faculty Committee and approved by all its members, has been presented to and accepted by the School of Engineering in partial fulfillment of the re quirements for the degree of .....................E a s t e r . . ( 5f..gcX e.B-C© ...ia-.......................... ........Computer ..Engineering.......... D ate May. 1987. Faculty Committee \j, X . c-s Chairman TABLE OF C O N TEN TS LIST OF FIGURES iv ABSTRACT v INTROD UCTIO N 1 CHAPTER 1 5 1. Form al definition of the problem 5 2. G raph isomorphism and the mapping problem 6 3. M apping using the FEM as array processor 8 4. Future research 12 CHA PTER 2 13 i 1. Form alization of the problem 13 | I 2. A more restricted problem 15 | j 2.1 Algorithm s using single transfers 15 1 I 2.2 Algorithm s using different transfers 18 ; i 2.3 G raph num bering and the m apping problem 23 j 2.3.1 Numbering of full binary trees 25 ! j 2.3.2 An optim al num bering algorithm for full binary trees 26 ' I 2.3.3 Algorithms using different transfers 31 j i 3. F uture research 32 j i CHA PTER 3 33 1. Definition of the problem 35 \ 2. D ata m apping 37 2.1 Static mapping 37 2.2 Dynamic mapping 39 3. Future research 41 CH A PTER 4 42 1. Approach to the problem 42 2. A restricted problem 46 3. Future research 52 REFEREN CES 53 i i i LIST OF FIGURES Figure Page 1.1 A 6 by 6 FEM 9 2.1 M apping F of logical memories to physical memories 14 2.2 M apping found using the theorem 17 2.3 Row m ajor order m apping 19 2.4 Shuffled row m ajor order m apping 20 2.5 Construction of the shuffled row m ajor order mapping 21 2.6 Linearly connected array of memories 24 2.7 Symmetric of a node 26 2.8 29 2.9 M apping using the shuffle & exchange perm utations 32 3.1 Expression tree 34 3.2 38 I 4.1 D igraph for A X Z +BX2+CX+D 44 ! 4.2 M apping for Y * — A X 3jrBX2+CX+D 45 : Abstract This thesis is a survey of the general problem of mapping parallel algorithms onto parallel architectures. In this problem, the objec tive is to find a mapping th at minimizes the communication time of the algorithm. Four instances of the mapping problem are discussed. The first one deals with parallel algorithms made up of several modules execut ing in parallel. Bokhari developed a heuristic algorithm to map modules executing in parallel onto the finite element machine. In the second instance, the algorithm is represented by data communication between log ical memories, which are mapped to physical memories connected by an SIMD network. Considering this same instance, Kung & Stevenson showed th a t with respect to any any injective partial transfer P, there exists a mapping F, such th at on the linearly connected network, the transfer P can be performed in at most 4 routing steps. Irani & Chen related this problem to the graph numbering problem. They obtained an optimal numbering algorithm to any full binary tree. Thompson & Kung showed th at w ith respect to a certain mapping, The Bitonic sort is optimal to within a constant factor, when executed on the ILLIAC-IV network. In the third instance, algorithms applicable to array variables with special types of index functions are considered. In the fourth instance, we discuss the mapping of low-level machine operations to low-level parallel processors. v Introduction In this thesis, we discuss the general problem of m apping paral lel algorithm s onto parallel architectures. Given a parallel algorithm and a parallel architecture, we w ant to execute the algorithm on the architec ture, so th a t the execution tim e is m inimum, w ithout changing the algo rithm or the architecture. The execution of a parallel algorithm can be seen as a sequence of alternating com putation and comm unication steps. A com putation step is done on the d ata to produce interm ediate results. The comm unication step consists of transferring d ata in preparation for the next com putation step. Thus the total execution tim e of the algorithm is equal to the sum of the comm unication tim e and the com putation time, i l U ntil recently, researchers did not realize the im portance of the communi cation tim e. W hen trying to minimize the execution tim e of an algorithm, they always ignored the communication tim e and tried to minimize the com putation tim e. W henever they executed an algorithm on an architec- j ture, they considered the trivial identity mapping. Thus the communica- J tion tim e of the algorithm was assumed to be constant. It was shown th a t i t considering the problem of finding a mapping th at minimizes the commun ication time can improve the execution time of the algorithm greatly. Usually, the comm unication tim e is comparable to the com putation time, j i Hence trying to minimize the communication time can reduce the total \ execution time considerably. 1 In this thesis, we assume th a t the com putation tim e is fixed and we try to minimize the comm unication tim e. The approach used is to find a suitable m apping of the algorithm organization onto the architecture organization. There are m any instances of the m apping problem. In this thesis four instances are discussed. In the first one, the parallel algorithm is m ade up of several modules th a t execute in parallel. The objective is to m ap modules to processors such th a t pairs of modules th a t communicate w ith each other are m apped to processors as close as possible (ideally adja cent), in order to minimize communication tim e between processors. In I the second instance, we consider a m apping problem, in which the algo- 1 i rithm is represented by d ata communication between logical memories, which are m apped to physical memories connected by an SIMD network. In the third instance, we consider algorithm s applicable to array variables, w ith special types of index functions. In this problem, we map the initial d ata and the com putations to an array of processors. In the fourth I I instance, we consider the m apping of low-level machine operations to low- level parallel processors. i I i i The general m apping problem is very hard. It was shown th at it is equivalent to the graph isomorphism problem and th a t it is related to J * the graph num bering or bandw idth reduction problem. For this reason, j researchers considered subproblems of the general m apping problem. 2 Bokhari [1] developed a heuristic algorithm to m ap modules executing in parallel to the finite element machine (FEM), which is an array of microcom puters interconnected in an ‘eight-nearest neighbor’ interconnection pattern. In the second instance, Kung and Stevenson [ 2] considered the m apping of parallel algorithm s onto the linearly connected network. They showed th a t w ith respect to any injective partial transfer P , there exists a m apping F, such th a t the transfer P can be performed in at m ost four routing steps. Irani & Chen [3] related this problem to the graph num bering problem. They were able to give an optim al numbering algorithm to any full binary tree. However, the proof of this optim ality is still an open problem. Considering this same instance, Thompson 8c Kung [2] showed th a t w ith respect to a certain mapping, the Bitonic sort is optim al to w ithin a constant factor, when executed on the ILLIAC-IV net work. Irani 8c Chen [4] considered the third instance of the m apping prob lem, assuming th a t a linearly connected netw ork is available. They obtained some results about optim al alignm ent of operands in processors, j optim al com putation ordering and m apping and rem apping of data. In j the fourth instance, Me Dowell 8c Appelbe [5] considered a linearly con nected netw ork and algorithms whose com putation graphs are binary ! trees. They obtained a heuristic algorithm to m ap nodes of the tree (data i and com putations) to processors. ! We should note th at due to the difficulty of the problem, most of the proposed solutions are heuristics. This thesis is organized as follows: In chapters 1, 2, 3 & 4, the first, second, third & fourth instances of the mapping problem stated above are discussed respectively. Chapter 1 In this chapter, we consider algorithm s made up of several modules th a t execute in parallel. We are given an incompletely connected array. O ur task is to m ap the modules onto the processors, such th a t the num ber of comm unicating modules th a t are m apped onto directly con nected processors is maximum, so th a t to minimize the communication tim e between modules. I 1. F o r m a l d e fin itio n o f th e p r o b le m We represent the algorithm by a graph G = (V ,E ), where the nodes in V correspond to the different modules, and an edge between nodes x and y exists if and only if modules x and y com m unicate w ith each other. Similarly, we represent the array of processors by a graph Ga = ( Va ,Ea ), where Va is the set of processors, and an edge between two processors exists if and only if they are directly connected, (ie: adjacent). In our analysis, we assume th a t jV| — \Va \. A mapping of modules to pro cessors (PE ’s) is a function f: V — ► Va which is one to one and onto. The i cardinality of the mapping f, denoted by |f| is the num ber of edges (x,y) of i ! j I E such th a t (f(x),f(y)) € Ea . Our objective is to find a m apping f w ith j I maximum cardinality among all possible mappings. Ideally this cardinal- ; I ity is equal to |E|. In order to derive an expression for |f|, we use the fol- | lowing notation. j Any graph G = (V ,E ) can be represented by a function G:V X V — ► {0,1}, such th a t for all x and y elements of V, G(x,y) = 1 if (x,y) belongs to E and it is 0 otherwise. i T h eorem 1/ I =-5- E G (x ,y) G a(/(*)./ (y)) 1 x,yev P r o o f : Let A = {(x,y) € E / (f(x),f(y)) £ E a }. Let |A| = cardinality of A. Then |f| = |A|. For all (x,y) elements of A, G(x,y) = Ga (f(x),f(y)) = 1. This is equivalent to G(x,y) X Ga (f(x),f(y)) = 1. Therefore £ G(x,y).Ga( f ( x ) J ( y ) ) = 2 \A |, x , y € V because in summing over all (x,y) elements of V2, each edge is counted twice. □ 2. G r a p h iso m o r p h ism a n d t h e m a p p in g p r o b le m | 1 1 D efin ition : G raph isom orp h ism j Two graphs Gx — (VX ,E^) 8c G2 = ( F 2 , £ , 2 ) w ith | = \ V 2\ j I I are isomorphic to each other if there exists a function f : VX — *V2 which is f one to one and onto, such th a t for all (x,y) € V x , (x,y) £ E x if and only if j (f(x),f(y)) £ E 2. I T h eo rem If a polynomial time algorithm A for solving the general m ap ping problem (given arbitrary G and Ga ) exists, then we would be able to solve the graph isomorphism problem in polynomial time. P r o o f : Suppose th a t the polynomial time algorithm A exists. Then for any two arbitrary graphs Gx and G2, we can determine an optim al m ap ping f of Gx onto G 2 w ith maximum cardinality in polynomial tim e. Case 1 : If |f| < \E t|. This implies th a t no isomorphism exists between G x and G2. Case 2 : If |f| = x|. This implies th a t for all (x,y) £ V x, if (x,y) 6 Ei, then (f(x),f(y)) £ E 2. Now assume th a t (f(x),f(y)) £ E 2. — If |£7j| = \E2\, then (x,y) £ E x. Thus f is an isomorphism from G x into G 2. Therefore Gx and G 2 are isomorphic. — If l^il \E2\• This implies th a t no isomorphism between G x and G2 exists. | I Hence we could find out w hether G x & G2 are isomorphic in polynomial ’ tim e. □ ; L em m a The general mapping problem is computationally equivalent to the graph isomorphism problem. P r o o f : The proof follows directly from the previous theorem. Since the general mapping problem using arbitrary G and Ga is NP-Complete, scientists considered some subset of this problem. Bokhari [l] considered a specific array processor (the finite element machine, FEM), which is discused next. 3. M a p p in g u sin g th e F E M as a r r a y p r o c e sso r D efin ition : F E M The FEM is an array of microcomputers interconnected in an ‘eight-nearest neighbor’ interconnection pattern as shown in figure 1.1 . Given the graph G = (V,E) of the algorithm and the graph Ga — (Va ,Ea ) of the array, finding an optimal mapping of G onto Ga is the same as using the identity mapping (nodes in G and Ga are numbered from 1 to N, where N = |V| = \Va |) and trying to renumber the nodes of G so th at the num ber of edges in E that fall onto edges in Ea is maximum. If we use the 8 figure 1.1 . A 6 by 6 FEM . 1 adjacency m atrices M and Ma of G and Ga respectively, then renum bering ’ the nodes in G is the same as perm uting the rows and columns of M. More precisely, exchanging the num bering of two nodes i and j is equivalent to exchanging row i and row j, column i and column j of the adjacency m atrix. An exact, efficient (i.e.: polynomial tim e) algorithm to solve this problem hasn’t been found. Bokhari [1] developed a heuristic ( algorithm , called m apper, th a t doesn’t guaranty to find an optim al map- j ping. This is discussed next. j A lgorithm Mapper : Input: Adjacency m atrix Ma of the FEM , adjacency m atrix M of G. Output: A perm utation of M such th at the m apping defined by this new m atrix has a better cardinality. In the following program , cardinality(X) is the cardinality of the mapping defined by m atrix X. The algorithm uses a very straight forward method. The following is an informal description of the algorithm . For complete details, consult [l] (section 5). A lg o rith m : Begin i 1. MAT < — M /* initial m apping */ BEST +- M /* best mapping found so far */ 2. Block Search: F or each node in graph G, try the pairwise exchange of this node w ith I all other nodes, do the one th a t leads to the best im provem ent in car- i i dinality and update M AT. W hen no more improvements can be done using these pairwise exchanges (ie: we reached a dead ends situa tion), then go to 3. i 3. If card(MAT) > card(BEST) then go to 4. Otherwise (ie: y0 j improvements ), output BEST as our best result and HALT. 4. Block Jum p: (probabilistic jum p) Best < — MAT Interchange random ly n pairs of nodes of M AT in order to break out of the dead ends situation. Go to step 2. End. P erfo rm a n ce o f th e a lg o rith m : W e note th a t block Search doesn’t guaranty to lead to the best m apping and th a t sometimes we reach a situation of dead ends . Block Jum p often leads to a poorer mapping, but it m ay lead to a better m ap ping after we execute block Search on the new m atrix M AT, obtained after doing the probabilistic jum p. Since we have no way of knowing the cardinality of the optim al m apping, we cannot tell how close to optim al the m atrix Best is. However, we can use the num ber of edges |E| in G as a m easure of the performance of the algorithm , keeping in m ind th a t the optim al cardinality is less than or equal to |E|. Experim ents showed th at in most cases the algorithm perform ed well. W hen the algorithm was used to m ap the FEM onto itself (in this case we know th a t the optim al cardi nality is |E|), results showed th a t the performance of the algorithm was very well. ______________________________________________________________________ 1J_ Tim e complexity of the algorithm : The num ber of edges in an N-node FEM is equal to 4N. Thus the cardinality of any mapping cannot exceed 4N. The total gain in the cardinality cannot exceed 4N. In the worst case, each pass through block Search (block Augment in the original algorithm in [l]), leads to a gain of 1. Thus in the worst case, we execute 0(N ) passes through this block. Since the total time t of the algorithm is t — 0 ( # of passes through block Search x time taken by this block), then t = 0 (N 3). 4 . F u tu r e resea rch An interesting research problem would be to choose some other array th a t uses some different interconnection network, such as the linear network or the mesh-connected network. Furtherm ore, since there are m any known results in graph theory about binary trees, an attem pt should be made to use binary-tree-connected arrays. 12 Chapter 2 In this chapter, we consider parallel algorithms which use logi cal memories to represent their organization. During the execution of the algorithm, d ata is transferred between these memories. This is called logi cal transfer. D ata transfer between physical memories is called physical transfer. 1. F o r m a liz a tio n o f th e p ro b lem We are given an array processor w ith n memory modules > A/n_i connected by an interconnection network. The parallel algorithm uses n logical memories m0, . . . , mn_v We assume th at during the execution of the algorithm, r logical transfers take place at r different time instances. These logical transfers are partial functions Pj, 0< j < r - l , mapping {0,l,...,n-l} into itself. A logical transfer Pj takes place is equivalent to saying th at at time j, m ,- sends data to mP (,•), for all i such th at P} -(i) exists. Each mi? 0 < i < n-1, is mapped to MF^), where F is a bijection on {0,l,...,n-l}. F is called a mapping of logical memories to physical memories. The mapping problem is to find a mapping F, such th a t when the algorithm is executed, the num ber of routing steps of the algorithm is minimum among all possible mappings F. (See figure 2.1) 13 Connections defined by the algorithm b lje c tlo n F Connections defined by the interconnection network imR I n-1 n -1 figure 2.1 . Mapping F of logical memories to physical memories R ou tin g W hen a logical routing occurs from m ,- to , a physical rout ing takes place from Mp{i) to MF^ ). Assume th at a logical transfer Pj, 0 < j < r-1 is executed. Then each Mj such th at f j{i) exists, sends data to Mfj(i), where J j, called the physical transfer corresponding to logical transfer Pj, is some partial function on {0,l,...,n-l}. Logically, this corresponds to mF- s e n d i n g data to mF~\f -{i)y Since with respect to Pj, mF-i^ sends data to we have the following equality : F~\f j{i))=Pj{F-\i)), This is equivalent to : F~lf j =Pj F~l, 1 k Therefore, we have the following result: f j —FP jF~x. L et’s define R(F) to be the num ber of routing steps of the algorithm w ith respect to the m apping F. L et’s define D(f j) to be the m inimum num ber of routing steps executed when M ,- sends d ata to for all i such th at fj(i) exists, (ie: this is the m inimum num ber of routing steps needed for transfer Pj to be executed). Then R (F)= S £>(/,)= S D (FPS F-1) j =0 j =0 2 . A m o r e r e s tr ic te d p r o b le m The general problem stated above is NP-com plete. Many researchers considered a fixed interconnection network. In the following, we will restrict ourselves to a fixed interconnection netw ork and consider a general parallel algorithm. 2.1 Algorithms using single transfers In this section, we consider the linear interconnected array of memories [ 2]. We assume th a t P q = P i= ■ ■ • = P r-i=P - T h eorem For any injective logical transfer P , there exists a m apping F such th a t when P is executed, the num ber of routing steps th a t occur in the linear connected array is < 4 . 15 P r o o f : W e can represent P by a digraph G = (V ,E ) such th a t the nodes correspond to the logical memories m Q > mi, • • • ,m„_i and there is a directed edge from m ,- to mj if and only if P(i) = j (ie: mf sends d ata to m j ). Now consider num bering the nodes of G such th at, if node m, gets integer k, then mf is m apped to Mk . Thus this num bering gives us a m apping of log ical memories to physical memories. Case 1 : Assume th a t P is a total function. Thus P is a bisection. Therefore G is nothing more th an a collection of disjoint cycles. We num ber G by num bering the cycles one by one using Breadth-first num bering starting at any node in the cycle. In this way, the integers assigned to two adjacent nodes differ by at most 2. Thus on the linearly connected array of memories, and w ith respect to the m ap ping defined by this num bering, the num ber of routing steps executed when the transfer P takes place is < 4. □ Case 2 : Assume th a t P is not a total function. Then G is a collection of disjoint cycles and linear lists. We num ber the cycles and the linear lists one by one using Breadth-first numbering. All the above results still hold. □ 16 E xam p le : [2] Let P be the perfect shuffle perm utation on {0,1,...,15}. The digraph G representing P is shown in figure 2.2 . N ote th a t in figure 2.2, node i represents logical memory . Also the num bers between brackets represent the integer assigned to the node. In the following, we consider an example of algorithm s using single transfers. E x a m p le : M atrix tra n sp o sitio n [2] L et A be a 2r by 2r m atrix stored in row m ajor order in the memories m,-. W e need an algorithm to get the m atrix transpose of A. (ie: A stored in column m ajor order, or equivalently, we w ant A,y and to exchange storage). Stone showed th a t we can do th a t by performing r perfect shuffles P on the mf ’s. the usual procedure which is to use the \LT qJT \dp (Oj £4 3 [81 [10] [if] figure 2.2 . Mapping found using the theorem, when the logical transfer is the perfect shuffle perm utation on (0,1,...,15). 17 identity mapping (/» ,■ is mapped to M,-) would take O(r.n) routing steps, where n is the num ber of logical memories (ie: n = 2r x 2r ). This is because each transfer P takes O(n) routing steps. Using the mapping given in the theorem, each transfer P takes a maximum of 4 routing steps. Thus the complete algorithm will take no more th an 4.r routing steps. A nother way to find an optim al mapping is to consider the logical transfer P ’, in which logical memory containing Ai;- sends data to logical memory containing A # . let’s represent P ’ by a digraph G ’ in a m anner similar to the one given in the proof of the above theorem. Then G’ will consist of — disjoint cycles, each consisting of 2 nodes. Using the numbering given in the previous theorem, memory mk containing and memory me con taining Aji will be mapped to adjacent physical memories. Thus with respect to this mapping, the algorithm will take 2 routing steps only. 2.2 Algorithms using different transfers In this section, the logical transfers P j ’s are different. Recall th at the num ber of routing steps required by the algorithm with respect to r -1 a mapping F is R(F)= where f j =FPj F~1. One way to minim- /= o ize R(F) is to minimize D(f j) for each transfer Pj. However a mapping th at minimizes D(/y) for a transfer Pj doesn’t in general minimize D{f j ) for another transfer Pj . Therefore it is difficult to find an optimal map ping for an algorithm th at uses different transfers. In the following, we 18 consider an example of algorithms th at use different transfers. E xam p le : T h e B iton ic sort on a m esh-conn ected n etw ork [2] We consider a mesh-connected network (such as the ILLIAC- IV) of n = 2r memories M 0,Mi, ' ' ' We w ant to execute Batcher’s Bitonic sort algorithm on n = 2r elements, where r is an even positive integer. The algorithm uses r different logical transfers P 0, • • • ,Pr_i. For each transfer Pj , m, sends data to mP.^ , where i & Pj (i) differ by the j ’th bit in their binary representation, for all i. W hen the algorithm is exe cuted, each transfer Pj is executed (r-j) times. L et’s consider the case when r= 4 , and use a 4 by 4 ILLIAC-IV-like network. First let’s map the array in row m ajor order as indicated in figure 2.3 . We call this mapping T . In this case D ( / 0) = 2, D ( / 1 )=4, D ( / 2) = 2, D ( / 3)= 4. Therefore,. i° i1 i2 i3 tn » **• iH j in. T T i — i4 i5 I6 I7 ™ 8 J9-----"lD— J l l m 12“ m l 5 ~ ml4“ ‘ m15 figure 2.3 . Row m ajor order mapping 19 r- 1 R ( / ) = 2 D (f j )=28. Since transfer is performed more often than j=0 transfer P j +i, one way to improve this mapping is to choose a mapping such th at Pj is made less expensive than Pj+\. Figure 2.4 represents one such mapping Fa , called ‘shuffle row m ajor order’, discovered by Kung & Stevenson. In this case, D ( / 0)=2, D ( / 0 = 2 , D ( / 2)=4, D ( / 3)= 4 and R (F a )= 26, which is < R(I). The general way to construct this mapping for an arbitrary size array is a recursive procedure. First construct a block of 4 memories in row major order. Then using 4 such blocks numbered from 0 to 3, connect them in row major order, and this process continues recursively. To illustrate this let’s construct the "shuffled row major order” mapping for r = 4 . Figure 2.5 shows how such mapping is constructed. ra 10“ m l l m lA m 15 figure 2.4 . Shuffled row major order mapping. 20 first block : m 0 — m j I 1 1 1 n% 2 — m 3 4 such blocks : Block 0 : m0 — m 1 1 1 1 1 m 2 — m3 Block 1 : m4 — m 5 1 1 1 ' m 6 -- T 7 Z 7 Block 2 : m8 — mg | 1 1 1 m 1 0 — m ll Block 3 : m 1 2 — m 1 3 1 1 1 1 m 1 4 — m 1 5 Connect the 4 blocks : Block 0 ----- Block 1 Block 2 ----- Block 3 Keep doing the above procedure until all memories are used. figure 2.5 . Construction of the shuffled row major order mapping for r = 4 . 21 Consider the general case of an array of size n —2r , where r is an even non-negative integer. Using the row major order mapping I, Also, DU j> 2j+1 0 < j < —-1 - - 2 D{f . r) Thus + [ r-(r-l )]2 Therefore r - 1 R ( /) = r . 2+ ( r - l) 22 + • • • 4- 2 2+(r-™ )2+ r - ( f + D 2 +. R (/)=r2+(r-l)22+...+(-~-+l)2 2+-^-2+(^-l)22+...+1.2 2 I It I r r I 2 + A n D ( T .^...- /* • O 2' Since each term in the sum m ation is <r 2 2 , then R ( I) = 0 ( r 2 2) . There- fore R (I)=0 {\fn log2 n ) . If we use the shuffled row m ajor order mapping Fs , then we have the following results : 2 2 4-1 if j is even D (f i ) = ] D (/y.O if j is odd - 1 Also, Thus r - l R(Fs)='E(r-j)D(fJ) /= o 22 R (Fs)=r 21 +(r -l)2 1+(r-2)22+(r -3)22 + ...+ |V-(r -2) 2 2 +V [ r - ( r - l ) ] Therefore, r r i? (Fg ) = r 2+(r - l) 2+(r — 2)22+(r -3)22 + ■ • • + 2.2 2+ 1.2 2 All the term s XA’s in the sum m ation are such th at X k < C 2 2 . Therefore (F5) = O (22) = 0 (\/n") . Thus this mapping is much better than the trivial mapping I. This example shows th at choosing a good mapping instead of the trivial identity mapping improves the execution time of the algorithm greatly. 2.3 Graph numbering and the mapping problem [3] D efinitions A numbering of G is a one to one function f from V onto {0,l,...,n -l}. The bandwidth of G relative to a numbering f is /?/ (G), where P f ( G ) ~ j£=0 The bandwidth of G is /3(G)=MIN , over all numberings f. Let G = (V ,E ) be a graph (directed or indirected) with n nodes. | m ax| mzxl \ f (U )-f (W )\ ,(U ,V )e E \ if E ^ t Assume th at we have the linearly connected array of n memories (See figure 2.6). Let P be a logical transfer per formed during the execution of the algorithm. Let G = (V ,E ) be a graph representing the transfer P as described in the previous theorem. Let F be the physical transfer th at corresponds to P as defined in section 2. Let f be a numbering of G. Thus f is a mapping of logical memories onto physi cal memories, (ie: / (m ,)= j if and only if mt is m apped to My). W ith respect to this mapping f, D{F)=@f ((?), if communication occurs in one direction only on the linear network, and D(F)—2x/3f (G), if communica tion occurs in both directions on the linear network. In general, when we execute P , communication occurs in both directions on the linear array. Therefore we will use D (F)=2X/3f {G). Since our goal is to minimize D(F), we can see th at the mapping problem is reduced to the bandwidth reduction problem. In other words, we want to find a numbering / 0 such th a t @fo (G) is minimum among all possible numberings f of G. Obviously g (G)—(3(G), the bandwidth of G. Infortunatly, the bandwidth reduction of a graph is an NP-complete problem. In fact most of the numbering M0 — M x — . . . — Mn_x figure 2.6 . Linearly connected array of memories. 2k algorithms available don’t achieve the bandwidth of the graph. 2.3.1 N u m bering o f full binary trees [3] We assume th at the graph G representing P is a full binary tree w ith N levels and n nodes. An optim al mapping is a numbering / 0 of G for which 0 / o (G ) is minimum among all possible numberings. L et’s find 0 f o (G). Obviously, 0f ^G )—0{G), the bandwidth of G. Assume th at the algorithm consists of finding the minimum of 1 integers. One way to do this is to store these integers in the leaves of G, assuming th a t G has 1 leaves. Then in each non-leaf node, starting with the nodes in the highest level (bottom level), we store the minimum of the 2 integers stored in its children. In this manner, the minimum of all the 1 integers will be stored at the root of G. The minimum communication time te m i„ of this algo rithm (using mapping f), is such th at tc m in <20 j (G)(N-1). In addition, we know th at on the linearly connected array, temin—n - 1. Thus ^ (0 )- r 2 ^ I T '1 - Therefore, The bandwidth of G with respect to the sequential numbering (for exam ple starting at root and going from left to right downwards) is 2N~l. Since this is not close to the optimal bandwidth 0(G), we need to find a better mapping. 25 2.3.2 A n op tim al num bering algorithm for full bin ary trees [3] We present an algorithm th at gives a numbering f such that pj (G) is optim al (i.e. equal to ft(G)). Let TL & TR be the left and the right subtrees of the root respectively. The algorithm assigns numbers to the nodes of TR and the root of G. Then for each node V € TL , we assign to it the num ber 2n -*, the 2" ’s complement of i, where i is the number assigned to the node in TR corresponding to V (i.e. symmetric to Y with respect to the vertical axis going through the root. See figure 2.7). Let p—j3{G), the bandwidth of G. Graph G : V ’ is the node in TL corresponding to V. figure 2.7 . Symmetric of a node. 26 Algorithm Begin I- Assign numbers to subtree TR of G as follows : 1- Assign 1 to the right-most term inal node. 2- For numbers perform the following sequential assignment (in increasing order). For each num ber i, (i) If i-(3 has been assigned to the right child of some node U (U m ust have not been numbered yet), then f(U )=i; Else (ii) If there are unnumbered term inal nodes, then assign i to the right-most one; Else (iii) Assign i to a node such th at the difference between i and the num ber assigned to its right child is largest. (This difference is not greater than (3. Otherwise the parent node should have been assigned in step 2(i) the value /3 + (number assigned to its right child)). II- The subtree TL is numbered by complementing the numbering for Tr . III- Assign 2/v_1 to the root of G. End 27 P roof of the optimality of the algorithm A complete proof of this optim ality is still an open problem. We w ant to show th at 0j =0. This is equivalent to saying th at property P , such th at P = {[|f(any non-terminal node) - f(its right child)) < 0] AND [|f(any non-terminal node) - f(its left child)) < 0]} is satisfied. L et’s prove property P for T r . We claim th at : (1) f(right child (V)) < f(V) < f(right child (V)) + 0, for any non term inal node V £ Tr . (2) f(left child (V)) > f(right child (V)), for all non-terminal nodes V £ T r , P r o o f o f claim (1) We note th at all non-leaf nodes ~V £ T r are numbered accord ing to step 2 (i) or 2(iii) of the algorithm. Therefore f(V) > f(right child (V)). Now assume th at f(V) > f(right child (V)) + 0. Then f(V) = f(right child (V)) + 0 + x , x > 0. However the integer y = f(right child (V)) + 0 is smaller than f(V). Thus it was assigned before f(V) was. And since at th a t time V wasn’t numbered yet, then y would be assigned to V, which contradicts our assumption. Therefore f(V) < f(right child (V)) + 0. □ 28 P r o o f o f claim (2) L em m a For any 2 nodes on the same level in TR , f(node at the left) > f(node to the right). P r o o f o f lem m a: We use induction on the level num ber k, 0 < k < N. Basis step: level k = 0 (leaf nodes). These nodes have been numbered according to 2 (ii). Thus, the lemma holds for these nodes. Induction step: Assume th at the lemma holds for level k, 0 < k < N - 1. L et’s prove th at it holds for level (k + 1). For any two nodes U & V on level (k + 1), we have the picture shown in figure 2.8 . From the hypothesis, we can say th at f(x) > f(y). Assume th a t f(U) < f(V). Then we numbered U before V. U was numbered according to step 2 (i) or 2 (iii), because it is not a leaf. Level (k + l) .. Level k figure 2.8 . 29 Case 1: Assume th at we numbered U according to 2 (i). Then f(U) = f(x) + p. Since f(y) + p < f(x) + p, then f(y) -f- p was assigned before f(x) + p. A t th at time V w asn’t numbered. There fore f(y) + P was assigned to V. Thus V was num bered before U. This contradicts our assumption th at f(U) < f(V). Case 2: Assume th a t we numbered U according to step 2 (iii). Let the integer assigned to U be f(U) = m. Then m - f(x) > m - f(y). Therefore we have a contradiction , because f(x) > f(y). Conclusion: U was numbered after V. Thus f(U) > f(V). □ The proof of claim (2) follows from the lemma. □ Claim (1) implies th at |f(V) - f(right child (V)| < p, for all non-leaf nodes V G Tr . Applying the algorithm to some examples shows th at for most non-terminal nodes V E TR , f(V) > f(left child (V)). For the other non term inal nodes V 6 TR , for which f(V) < f(left child (V)), f(left child (V)) - f(V) < p (this has to be proved). This observation which still has to be proved formally and claim (2) imply that |f(V) - f(left child (V))| < p, for all non-leaf nodes V 6 TR . Therefore, property P is true for TR . □ We note th at for any pair of nodes U,V 6 TR and their corresponding nodes U’,V’ € TL respectively, \ f ( V h f { V )\ = \ $ n- f (U))-{2n- f {V))\ = \ f [U)-f {V)\. Thus property P is also true for TL . Hence to prove th at P is true for G, it is left to prove th at |f(root of G) - f(its right child)| < P and |f(root of G) - f(its left child)| < /?. Note th at the two absolute values above are equal, and f(root of G) > f(V), for all V E T # . Therefore it is only left to prove th at f(root of G) - f(its right child) < p, which is still an open prob lem. □ 2.3.3 A lgorith m s using different tran sfers W hen the algorithm uses r different transfers P 0, • • • ,Fr_1 , / instead of representing each Pj by a different graph Gj , we can superpose all graphs Gj to get a resulting graph G. Thus G represents all transfers P o > ‘ ‘ ,Pr- 1- The problem is then to find a numbering of G such th at ftf(G) is minimum among all possible numberings (ie: f such that f3j (G)=j3(G)). However, if r is very large, this becomes difficult to do. E xam p le [2] We use the shuffle & exchange perm utations as logical transfers. Let r — 2 , P 0 = S, the shuffle function, P j = E, the exchange function. Assume th at we have n memories M 0, • ■ ■ ,Mn-i connected by a linear interconnection network. W ith respect to the identity mapping I, R {I)=D (F0)+D {Fi), where Fj is the physical transfer corresponding to P j . Thus D l)+2=ra . Let n = 16. The graph G representing P 0 & P l and a numbering f of G which is better than I are shown in figure 2.9 . The numbers in brackets are assigned to their corresponding nodes. W ith respect to this mapping f, an S transfer followed by an E transfer can be done in 12 steps, rather than 16 if the mapping I was chosen. figure 2.9 . Mapping using the shuffle & exchange perm utations as logical transfers. 3. F u tu r e resea rch In sections 3.1 and 3.3, we considered only linear arrays. An interesting problem is to consider some other kinds of interconnection net works, such as the mesh connected array, the hypercube or the shuffle exchange network. For these networks, we should also try to redefine the bandwidth. Also numbering of arbitrary trees is still an open problem. 32 Chapter 3 In this chapter we consider algorithms applicable to array vari ables w ith N components and w ith special types of index functions. The architecture on which the algorithm is to be executed has a circular unit- shift interconnection network (ie: a linear interconnected array), with N processors PE0,PEX , • • • ,PEN_X . We assume th at the algorithm is a collec tion of parallel expressions S executed sequentially. Each parallel expres sion S is a parallel assignment statem ent (PAS) of the form: W i i ^ U ^ i + k J A U z i i + k J * • ■ • QUm( i + k m) where 0 < i < N -l, the addition ” + ” is modulo N , A , • , fi , are binary operations and k 1,k2l • • • ,km are integers. An example of a PAS is W ( i )+-U x(i + 2 )+ U2( i - 2 ) X U3(i +1). M appin g function: Each component V(i) of a vector V in the PAS is mapped (or equivalently stored) in PEFy(,-) , where Fv is a bijection on {0,l,...,N -l}. F v is called the mapping function for V. D efin ition -1- : D isplacem ent The displacement associated with variable Uj is disp(C/; ) = kj, 1 < j < m. 33 D efin ition -2- : A P The alignment point (AP) of a binary operation on A(i + £ x ) & B(i + fc2) is AP such th at the operation takes place in PE (i+A P). C om p u tation ordering For each PAS, we have to determine the order of computation. Once this is done, we can represent the PAS by an expression tree. E xam ple VF(*)<—£/!(*■ +l)+{72(i'-l)-t/3(*). L et’s choose the following ordering: [C/^*-hl)-PC/2 (a-l) ] - t /3(2). The expression tree th at corresponds to th at is shown in figure 3.1 . U 2( i - 1) figure 3.1 . Expression tree. 34 A lign m en t o f operands Once an expression tree for a PAS is determined (ie: computa tion ordering is chosen), we m ust determine the alignment point AP for every binary operation. For a binary operation on A(i + and B(i + k 2), we m ust move these two components to the same PE so th at the operation can take place. This is called alignment of operands. L ogical tran sfers Suppose th at during the alignment of operands, we move a vec tor U to another vector V. The logical transfer Puv associated with mov ing U to V is a partial function on {0,1,...,N-1}, such th at U(i) is transferred to the PE storing V(Puv(\)). 1. D e fin itio n o f th e p r o b le m Our goal is to execute the algorithm so th at the total commun ication time is minimized. Thus we have to: 1- Determine an expression tree (ie: com putation ordering) for every PAS in the algorithm. 2- Determine the alignment of operands for all binary operations in all the PA S’s. 3- Determine a mapping Fv for all vector variables V used in the PA S’s. such th at the total communication time of the algorithm is minimum. 35 C om m u nication tim e The total communication time Tc is the sum of the communi- over-alt cation time tc for each PAS. Thus Tc= J] tc , tc is the sum of the com- PAS munication time due to all the binary operations in the PAS. Consider a binary operation on U(i + k J and V(i + k 2). This operation takes place in PE (i+A P), where AP is the alignment point of the operation. Let the resulting vector be called W (i+AP). Thus we have two logical transfers Puw and Pyw associated with moving U to W and V to W respectively. L et’ s find the communication time for a logical transfer Pxy associated w ith moving vector X to vector Y. PE(i) stores X (P £ 1 (e')), which is transferred to PE(f(i)), where Y(Pxy[Fx \ * ))) * 1 S stored. Y(Pxy {Fx1 ^ ))) is stored in P E ^ y P j r y F j ^ t )). Thus / z =Fy P xyF x1. f is called the physical transfer associated w ith Pxy ■ Let D(f) be the minimum num ber of com munication steps needed to perform the transfer f. Then the communica tion time for Pxy is £>(/ )= D {FyPxyPx1)' Therefore the communication time for P uw is D{Fw PvwFul)=D{PuwF ^1 ) , because F w = identity. Similarly, the communication time for PVw 1 S D {Pvw F v1)- Therefore the total communication cost for the binary operation on (/(i+ k^ and V(i +&2 ) is D(PuwFfJ1)-\-D{PvwFvl). To find the total communication cost for a PAS S, we represent S by an expression tree E. let ^ 1 , ^ 2, • • • ,Vk be the vector variables involved in S. Let W 1,W 2, ■ ■ ■ ,Wt be the internal nodes of E where each Wj represents the partial result of some binary operations on variables and/or other internal nodes. It is clear th a t the total communication time for S (using the above results) is uij = p a r e n t ( V ,) W j = p a r e n t (IV,) tc — £ D [Pv,w]Fv~l)+ £ D{Pw.Wi)- V i— variable (leaf - node ) W , —internal-node In the following, we will consider the mapping of data into P E ’s. The analysis of com putation ordering and alignment of operands is quite lengthy and therefore was om itted. The reader who is interested should consult [4] (section III). 2. D a ta m a p p in g In this section we assume th at com putation ordering and align m ent of operands are chosen for every PAS. The sequence of logical transfers for all variables can be obtained from the alignment of operands. There are two kinds of data mappings: static mapping and dynamic map ping. 2.1 Static mapping The mapping is static when for each variable V,-, we use the same mapping during the execution of the algorithm. We assume that Fi(k)—k+xi , where 0 < k < N - l and a : ,- is an integer. L et’s find the total communication cost due to variable V,-. Let r,- be the total number of log ical transfers involving V{. Consider the j ’th such transfer. Vf is moved from PEiFiik+APiVi))) to PE (k +AP {parent {Vi))). (See figure 3.2). AP( ) = AP of binary operation at W; Wi = parent( V{) AP(Vi) = Disp(Vi) figure 3.2 . The cost for the j ’th transfer involving V{ is Cis = | Ff (k +AP (V{ ))-(k +AP (parent (Vf))) | , therefore, Cis = I * - [AP (parent ( V, ))-AP ( V,) ] | _ | - i is | , where =AP (parent (V{ ))-AP (V{), during the j ’th transfer of V{. Thus r , the total communication cost involving V{ is C{ = J] | x{ - d! fj- | . If the i algorithm uses 1 vector variables V lfV 2, • • * ,Vj , then the total communi- i cation cost for the algorithm is Tc = Yj Q • The choice of m ust be such « = i th at Tc is minimized. Since F{ doesn’t affect any Cy for all t1 , then we should only worry about minimizing C{ when we choose Ft . C{ is minimized when x: = median of { d u / K j < r, }. Hence the optimal F,-, 3b ! < * < / are determined. 2.2 Dynamic mapping (Remapping) W hen using dynamic mapping, each vector variable has a sequence of mapping functions instead of only one. D efin ition Let L be the sequence of ail logical transfers of a variable V. A remapping schedule I for V is / = • • • ,Ls),(f i,f 2, ' * ' ,/» )| > where L ltL 2, • • • ,La are subsequences of L obtained by dividing L, and / i>f 2> ' ' ' >/* are static mappings of V such th at / x is the initial map ping (done before any logical transfer takes place) and for 2< i < s , /,- is a remapping done before £,• and after Z,,_x take place. R em ark : Note th at the cost of remapping data m ust be included in the total communication cost. D efin ition A remapping schedule / for V is said to be optimal if the total communication cost with respect to this schedule is minimum among all possible schedules. 39 The complete procedure to find an optimal remapping schedule for a vector V is quite lengthy and complicated. The following is just a sum m ary of it. For more details, refer to [4] (section-IV-B). F in d in g an optim al rem apping schedule Let di,d2, • • ,dk be the shift distances of the first, second, ... , k ’th logical transfer on vector V respectively. (Since all variables involved in the algorithm are of the form V(i+disp(V)), 0< i < jV -1, then all logical transfers Pj are of the form +dj ). Note th at for the j ’th transfer Pj, dj is equal to AP (parent (V))-AP (V) for th at transfer. Let Ii = I(di ,dj+i), the interval defined by 2 consecutive shift distances d{ and \L |-1 di+l. Let It= p | Ij , (\L | >l). Let dM — median of the sequence « ' = i d1,d2, • • • ,dk . F v (i)—i +dM is the optimal static mapping for V. T h eorem If IL 7^0, then no remapping schedule I for V will result in a communication cost less than the communication cost using the optimal static mapping Fv (i )=i +dM. This theorem implies th at we should choose (L X ,L2, ■ ■ • ,LS^ such th at every L, • is maximal in the sense th at IL 7^0 and the intersection of either two elements of L th at precede , (*7^1) or the two elements 40 th a t succeed Z,-, {ij^s) will result in 1 ^= 0. The mappings / v f 2 , • • • ,f a are chosen such that /,• is the optimal with respect to the sequence Z ,- (as determined in section 2.1). The mapping schedule } th at results is optimal. A theorem was proposed to find the sequence (Z1,Z2, • • • ,Za) of maximal Z, ’ s. (See [4], section-IV-B, algorithm 2, for details). Also a brief description of the proof of the optim ality of the remapping schedule I is given in the refer ence stated above. 3 . F u tu r e resea rch One im portant research problem would be to consider some other kinds of array processors. Also in this chapter, the vectors considered have some special kinds of index functions. This consideration is not realis tic and should be eliminated. Furtherm ore, another restriction th at was used in this chapter (section 3) is th at all the mapping functions were of the form F, (k )=k +x{, which doesn’t necessarily lead to an optimal map ping among all possible mappings. Thus this restriction should also be avoided. 41 Chapter 4 This chapter deals w ith the mapping of low-level operations (such as machine level operations) onto a low-level parallel processor (LLPP). A LLPP is a parallel processor on which low-level operations are executed. The LLPP chosen is a linear array of processing elements (PE’s). The parallel algorithm consists of sequential assignment statem ents of the kind Y • < — (arithmetic expression involving some variables X 1,X2, • ■ ■ ,Xn ). E xam ple Y*-A x X 3+B X X 2+C XX+D . The low-level operations are C XX ,X XX ,B X(X2 ) ,etc ■ ■ Our task is to execute the algorithm on the given array of P E ’s, so that the communication time of the algorithm is minimum. Since the state ments are executed sequentially , we want to minimize the communication time of each statem ent individually. 1. A p p r o a c h to th e p ro b lem Consider a statem ent S of the algorithm. To execute S we must: 1- Determine a computation ordering for the low-level operations. 2- Map the data and the computations to the P E ’s. kZ G raph m odel We assume th at the computation ordering for statem ent S is already chosen. Thus we can represent S by a graph G =(V ,E ), where the nodes are the data and the low-level operations, and the edges correspond to communication of data. E xam ple Let S be Y * — A x X 3+ fl x X 2+C XX + D . We choose the fol lowing ordering: Time C XX , 1 x 1 Time t2: B x{X2) , (C xX)+D , {X2)x X Time t3: A x (X 3) , (B x X 2)+{C x X + D ) Time f4: (A XX3)+(B X X 2+C XX+D) The corresponding digraph is shown in figure 4.1 . Now we have to map the nodes of the graph to P E ’s such th at the nodes th a t are on the same level are mapped to different P E ’s (because they are executed at the same time) , and the total communication time for state m ent S is minimum E xam ple Consider the previous example, where 5 : Y<-A XX3+ B X X 2+CXX+D. The mapping chosen is indicated in graph G, corresponding to S, in figure 4.2 . As illustrated by th above example, we represent the graph by slots, whereL Time t Time t Time t2 ....(A Time t Time t figure 4.1 . Digraph for A X 3± B X 2+CX+D I I 1 PEC PE , j PE*, PE3 Time *4 Time t3 Time t2 Time t x Time £0 figure 4.2 . Mapping for Y < —A X 3+BX2+ CX+D each slot corresponds to a PE num ber and a time . We assign each node to a different slot , depending on when and where it is executed. D efin ition A perfect mapping is a mapping such th at for all pairs of nodes U and V such th at (U,V) or (V,U) is a directed edge, U and V are mapped to adjacent P E ’s. k 5 D efin ition A graph. G corresponding to a statem ent S of the algorithm is labelable if a perfect mapping for G exists. 2. A r e str ic te d p ro b lem The above problem is very difficult to solve. In fact it is NP- complete. In the rest of this chapter, we assume the following: 1- The graph G representing a statem ent S is a binary tree. 2- The num ber of P E ’s in the linear array is not bounded. Even this restricted problem is NP-complete. Some heuristic solutions were proposed. W hen trying to solve this problem , we w ant to answer the fol lowing questions: 1- Does a perfect mapping exist ? 2- If a perfect mapping exists, what is it ? 3- If no mapping exists, then find an optimal mapping. (An optimal mapping is one th at leads to minimum communication cost among all possible mappings.) D efin ition The height of a tree is the length of the longest path from the root to a leaf. The width of a tree is equal to the maximum of the number of nodes at any one level. kb Theorem A binary tree has a perfect mapping only if width < 2x height + 1. P roof: Assume th a t a perfect mapping for a binary tree T exists. Let the width of the binary tree T be W and its height be H. Consider the subtree T ’ of T, whose root is the root of T and whose leaves are the nodes th at are on the level L th at has W nodes (If more than one level have W nodes, then consider the one which is the closest to the root). Let H’ = height of T ’. Let the nodes at level L be numbered from 1 to W (starting at left-most node, in increasing order). Nodes num ber 1 and W can be separated by at most 2 x B edges. Since we have a perfect map ping, then each edge (U,V) is of length at most equal to one. (ie: U & V are mapped to adjacent P E ’s). The maximum value of W occurs when all slots on level L between nodes numbered 1 & W are filled. Thus this max imum value cannot exceed 2 x B 4-1. Therefore W < 2 x B 4-1. Since B <H then W < 2 x E + l. □ D efinition A 2-2 binary tree (2-2-tree) is a binary tree in which there are at most 2 binary nodes at any level of the tree. (A binary node is a node th a t has two children). 47 Theorem A 2-2-tree is labelable. P roof: We use induction on the height h of the tree. Basis step: W hen h = 1, it is obvious th at the theorem holds. Induction step: Assume th at the theorem holds for all 2-2-trees of height h, h > 1. Then let’s prove th at it holds for any 2-2-tree T of height (h + l). Consider the subtree T ’ of T, containing all nodes of T except its leaves. The height of T ’ is h. Thus using the induction hypothesis , T ’ is label- able. Let n 1,n 2, • ■ • ,nN be the leaf nodes of T ’. Let p ,- be the processor allocated to »{ ,l<*<iV, using a perfect mapping for T ’. Since T ’ is a 2- 2-tree, then it has at most two binary leaves. Let « ,• & n} - be such nodes w ith Pi < p j . (If there is only one binary leaf node, then let j = P j — co, and if there are no binary leaf nodes, then also let ?=p,-=oo). For any node nk , l<Ar < N , map its children as follows: a- If pk < pi, map its child (if any) to (p t-l). b- If k —i , map its children to pk- 1 & pk . c- If pi <pk < p j , map its child (if any) to pk . d- If k —j , map its children to pk & (p* +1). e- If pk > p j , map its child (if any) to (pj+ 1). Obviously, the above mapping ensures th at the leaf nodes of T are m apped to different processors, and for any leaf node V of T and its parent U, U & V are mapped to adjacent P E ’s. Thus this mapping is a perfect mapping for T, and therefore T is labelable. □ D efinition A 2 growth binary tree (2 GB-tree) is a binary tree such that ni < ni-i+2, where nt is the num ber of nodes at level / , / > 1. (We are counting levels from root towards leaves. Root is at level 0). T h eorem A 2 GB-tree in which all leaf nodes are at the same level is a 2-2-tree. P roof: Let T be a 2 GB-tree in which all leaf nodes are at the same level. This implies th at each node except leaf nodes has at least one child. Suppose th at at some level I , l > 2, there are more than 2 binary nodes. Thus «/+i>n(+ 2, which contradicts the fact th at T is a 2 GB-tree. There fore, at any level of T , there are more than 2 binary nodes. Hence T is a 2-2-tree. □ L em m a A 2 GB-tree in which all leaf nodes are at the same level is labelable. 49 P roof: The lemma follows immediately from the previous theorem. □ D efin ition s 1- Let S be a statem ent of the algorithm. Let G be the digraph corresponding to S. A mapping for G is planar if for any 2 nodes n, & nj of G, where n, & n;- are mapped to a ,- & a} - respectively, if a ,* < aj then ax < ay for all nx & ny ancestors of n ,- & nj respectively, (ie: a mapping is planar if when plotted on a grid (processors versus level number) there are no edge crossings). 2- A mapping for G is monotone if it is planar and the left child of a node is always mapped to a lower PE number than the right child. An algorithm is given in [5] that finds a perfect monotone map ping for any binary tree, when such a mapping exists. For more details, consult th at reference (section V, figure 6). Unfortunately, not all labelable trees have monotone mappings. Thus in case a tree doesn’t have a mono tone mapping, this algorithm is useless, because it doesn’t tell us whether the tree is labelable or not. A heuristic algorithm was proposed [5] that attem pts to minimize the number of edges th at connect 2 nodes which are mapped to non-adjacent P E ’s. 50 Heuristic algorithm W hen the algorithm for finding a perfect monotone mapping fails, this heuristic algorithm is used. The algorithm attem pts to find some perfect mapping for a binary tree. However, even if such a mapping exists, this algorithm doesn’ t guarantee to find it. A lgorith m Begin 1- Make some initial mapping. 2- If this is a perfect mapping then HALT. Else find IT.y, the longest directed edge of the corresponding tree. 3- Let Eij= (n i ,nj), try to reduce i? ,-y by moving n{ towards ny or ny towards « ,• without changing the level of any node. (A node is moved towards its parent or child if it can be done without increasing the lengths of arcs in the subtree rooted at the node being moved). 4- If Ejj was reduced, then go to 2 . Else let E^ be the next longest edge; If no more edges then go to 5; Else go to 3. 5- ” Spread out” the assignment by multiplying all a; by 2 ( a ,- is the processor assigned to node n{ ) , then go to 2. 51 (This has the effect of allowing the optim ization function to get worse momentarily, to allow the stepwise refinement to find a better solution. This also reduces the dependence on the initial guess by allowing the leaves to be rearranged). End. P erform an ce o f th e algorithm There is no guarantee of convergence for this algorithm. The algorithm may cycle between two or more different suboptimal solutions. However, experiments consisting of implementing this algorithm as a C program, and trying it on several thousand randomly generated binary trees, showed th at the ratio of trees that were successfully m apped to all trees th at are labelable is close to 1. 3. F u tu r e resea rch In the real word, the num ber of P E ’s in the array processor is bounded, furthermore, the digraph G associated with a statem ent of the algorithm is not necessarily a binary tree. Thus research considering a bounded num ber of P E ’s and an arbitrary digraph would be more useful. Also, we should try to design some algorithm which try to find some optim al mapping if no perfect mapping exists. Finally, instead of using a linear array of P E ’s, it would be im portant to use some other kinds of interconnection networks, such as a mesh-connected network. 52 R E F E R E N C E S [1] S. H. Bokhari, ‘On the mapping problem ,’ IEEE Transactions on Computers, March 1981. [2] H. T. Kung and D. Stevenson, ‘ A software technique for reducing routing tim e on a parallel computer with a fixed interconnection netw ork,’ Academic Press, New York 1977. [3] K. Chen and K. B. Irani, ‘Mapping problem and graph numbering,’ Interconnection Network Workshop for Parallel and Distributed Pro cessing, 1980. [4] K. Irani and K. W. Chen, ‘Minimization of interprocessor communi cation for parallel com putation,’ IEEE Transactions on computers, November 1982. [5] C. E. McDowell and W. F. Appelbe, ‘Processor Scheduling for linearly connected parallel processors,’ IEEE Transactions on com puters, July 1986. 53
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Automatic code partitioning for distributed-memory multiprocessors (DMMs)
PDF
Optical bus for distributed systems: An alternative LAN
PDF
Effects of memory consistency models on multithreaded multiprocessor performance
PDF
Implementation of neural networks on parallel architectures
PDF
Wadi Elmegenin artificial ground water recharge
PDF
On The Learning Of Rules For An Information Extraction System
PDF
A unified mapping framework for heterogeneous computing systems and computational grids
PDF
A framework for coarse grain parallel execution of functional programs
PDF
A study of in-place density tests of Montalvo base course material under controlled conditions
PDF
Adaptive dynamic thread scheduling for simultaneous multithreaded architectures with a detector thread
PDF
A technical survey of embedded processors
PDF
Improving memory hierarchy performance using data reorganization
PDF
I -structure software caches: Exploiting global data locality in non-blocking multithreaded architectures
PDF
Compression, correlation and detection for energy efficient wireless sensor networks
PDF
Energy and time efficient designs for digital signal processing kernels on FPGAs
PDF
Architectural support for network -based computing
PDF
Computational treatment of a model of the basilar membrane
PDF
An integrated systems approach for software project management
PDF
Efficient PIM (Processor-In-Memory) architectures for data -intensive applications
PDF
An analysis of fiscal policy in Saudia Arabia
Asset Metadata
Creator
Ayed, Mohamed Moez
(author)
Core Title
Mapping parallel algorithms onto parallel architectures
Degree
Master of Science
Degree Program
Computer Engineering
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
Computer Science,OAI-PMH Harvest
Language
English
Contributor
Digitized by ProQuest
(provenance)
Advisor
Prasanna, Viktor K. (
committee chair
), Gaudiot, Jean-Luc (
committee member
), Raghavendra, Cauligi S. (
committee member
)
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c20-360213
Unique identifier
UC11263763
Identifier
EP43876.pdf (filename),usctheses-c20-360213 (legacy record id)
Legacy Identifier
EP43876.pdf
Dmrecord
360213
Document Type
Thesis
Rights
Ayed, Mohamed Moez
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the au...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus, Los Angeles, California 90089, USA