Close
USC Libraries
University of Southern California
About
FAQ
Home
Collections
Login
USC Login
0
Selected 
Invert selection
Deselect all
Deselect all
 Click here to refresh results
 Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Folder
Automatic code partitioning for distributed-memory multiprocessors (DMMs)
(USC Thesis Other) 

Automatic code partitioning for distributed-memory multiprocessors (DMMs)

doctype icon
play button
PDF
 Download
 Share
 Open document
 Flip pages
 More
 Download a page range
 Download transcript
Copy asset link
Request this asset
Request accessible transcript
Transcript (if available)
Content INFORMATION TO USERS This manuscript has been reproduced from the microfilm master. UMI films the text directly from the original or copy submitted. Thus, some thesis and dissertation copies are in typewriter face, while others may be from any type of computer printer. The quality of this reproduction is dependent upon the quality of the copy submitted. Broken or indistinct print, colored or poor quality illustrations and photographs, print bleedthrough, substandard margins, and improper alignment can adversely afreet reproduction. In the unlikely event that the author did not send UMI a complete manuscript and there are missing pages, these will be noted. Also, if unauthorized copyright material had to be removed, a note will indicate the deletion. Oversize materials (e.g., maps, drawings, charts) are reproduced by sectioning the original, beginning at the upper left-hand comer and continuing from left to right in equal sections with small overlaps. Each original is also photographed in one exposure and is included in reduced form at the back of the book. Photographs included in the original manuscript have been reproduced xerographically in this copy. Higher quality 6” x 9” black and white photographic prints are available for any photographs or illustrations appearing in this copy for an additional charge. Contact UMI directly to order. UMI A Bell & Howell Information Company 300 North Zeeb Road, Ann Arbor MI 48106-1346 USA 313/761-4700 800/521-0600 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. AUTOMATIC CODE PARTITIONING FOR DISTRIBUTED MEMORY MULTIPROCESSORS (DMMs) by Moez Ayed A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (Computer Engineering) November 1996 Copyright 1996 Moez Ayed R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. UMI Number: 9720180 UMI Microform 9720180 Copyright 1997, by UMI Company. All rights reserved. This microform edition is protected against unauthorized copying under Title 17, United States Code. UMI 300 North Zeeb Road Ann Arbor, MI 48103 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. UNIVERSITY OF SOUTHERN CALIFORNIA THE GRADUATE SCHOOL UNIVERSITY PARK LOS ANGELES. CALIFORNIA 90007 This dissertation, written by M OEZ AYED under the direction of kf.?...... Dissertation Committee, and approved by all its members, has been presented to and accepted by The Graduate School in partial fulfillment of re­ quirements for the degree of DOCTOR OF PHILOSOPHY of Graduate Studies May 9 , 1996 Date DISSERTATION COMMITTEE Chairperson R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. D edication I dedicate this thesis to my father, my brothers and sisters, and especially mother. R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Acknowledgments I am grateful to all the people who directly or indirectly helped me through the years while I was doing my Ph.D. research. First, I wish to express my sincere gratitude to my advisor Professor Jean-Luc Gaudiot for his assistance and guidance. I also thank the other members of my thesis committee: Professor Ellis Horowitz and Dr. Sandeep Gupta for taking the tim e to attend my defense exam and for their valuable comments. My thanks also goes to my group members and friends for their assistance in various ways and for their comments. I also would like to thank my Tunisian friends who gave me moral support and who stood by m e while I was doing my Ph.D. research. My special thanks goes to Raouf Khelil, Anis Kallel, Sabeur Siala, Rym Ben Saiden, Mohamed and Mouna Sellami, Rym M’Hallah and many others. My deepest thanks goes to my parents, brothers and sisters who never stopped lending me great support when I was going through difficult times and things were looking very uncertain. Especially, my appreciation goes to my mother who was there for me all the time and who supported me in many different ways during my PHD years. She has always been very concerned and I will never forget her words of encouragements. Her love and caring gave me enough strength to keep trying and to never give up until my goal was reached. She was always in great pain when I was facing obstacles and my frustrations were high. I had a great feeling of relief when I announced to her my success at defending my dissertation, which made her cheer so loud and which brought tremendous joy to her heart. R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Contents Dedication ii Acknowledgments iii List O f Tables viii List Of Figures ix Abstract xi 1 Introduction 1 1.1 Parallel Processing and M ultiprocessors............................................... 1 1.2 DMMs: The computers of the f u t u r e ................................................... 3 1.3 Outline of this research ............................................................................. 4 2 Background Research 6 2.1 Programming styles for m ultiprocessors............................................... 6 . 2.2 Existing Implementations of Functional Languages on Multiprocessors and their Inefficiencies................................................................................................ 8 2.2.1 S IS A L ............................................................................................ 9 2.2.1.1 O v erv iew ....................................................................... 9 2.2.1.2 Com pilation................................................................... 10 2.2.1.3 Current Im p lem en tatio n s......................................... 13 2.2.1.4 Inefficiencies of O S C ................................................... 13 2.2.2 V ISA ................................................................................................ 13 2.2.3 Occamflow...................................................................................... 14 2.2.4 TAM (Threaded Abstract M achine)......................................... 14 2.2.4.1 Brief D escription......................................................... 14 2.2.4.2 Drawbacks ................................................................... 15 2.2.5 Sarkar’s W o rk ............................................................................... 16 2.3 Main Phases of Compilers for D M M s.................................................... 18 iv R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 2.3.1 Forms of P a ralle lism ................................................................... 18 2.3.2 Identification of P a ra lle lis m ...................................................... 19 2.3.3 Program Code Partitioning ...................................................... 19 2.3.3.1 Construct based P a r titio n in g ................................. 20 2.3.3.2 Data-flow based p artitio n in g.................................... 21 2.3.3.3 Function-level p a rtitio n in g .............................. 21 2.3.3.4 Data driven code p a rtitio n in g ....................... 21 2.3.3.5 Code based data allocation .................................... 22 2.3.3.6 SPMD Model of C o m p u tatio n ................................. 23 2.3.3.7 Programmer intervention ........................................ 23 2.3.3.8 Hierarchical P artitio n in g ................................. 24 2.3.4 Scheduling Is s u e s .......................................................................... 25 2.3.4.1 Static Scheduling.............................................. 26 2.3.4.2 Dynamic S ch ed u lin g ........................................ 27 2.3.4.3 Hybrid static/dynam ic scheduling................. 29 2.3.5 Distribution of D a t a ................................................................... 29 2.3.5.1 Classification of d ata distribution methods . . . . 30 2.3.5.2 Using the Shared Memory Programming Paradigm for DMM’s .................................................................. 33 2.4 Existing Code Partitioning M ethods.................................................... 36 3 T h e P a rtitio n in g P ro b le m 38 3.1 A ssum ptions............................................................................................... 38 3.2 D efinitions.................................................................................................. 39 3.3 Some P roperties........................................................................................ 46 3.4 Task G r a p h ............................................................................................... 50 3.4.1 Defining Task Graph W e ig h ts .................................................. 50 3.4.1.1 Node W eights............................................................... 50 3.4.1.2 Edge W eights............................................................... 50 3.4.2 Communication Between T a s k s .............................................. 50 3.5 Graph Execution Cost and Effect of D ata D is trib u tio n ................. 51 3.5.1 Nodes in the Input Program G r a p h ........................................ 51 3.5.2 Graph Execution C o s t ............................................................... 52 3.5.3 Cost Measures Needed for Partitioning A nalysis................. 53 3.5.4 Estimation of Execution Cost of R N O D E S ........................... 53 3.5.5 Task Execution Model and O utput D ata Storage .............. 54 3.5.6 Existing w o r k ............................................................................... 55 3.5.7 How to model effect of data distribution in the graph? . . 55 3.5.8 Conclusion...................................................................................... 56 3.6 Parallel Execution Time (P A R T IM E )................................................ 56 3.6.1 Execution Profile In fo rm atio n .................................................. 56 3.6.2 Cost A ssig n m e n t......................................................................... 57 v R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 3.6.3 Multiprocessor Parameters and Communication Model . . 58 3.7 Problem S ta tem en t.................................................................................. 60 3.7.1 R em arks.......................................................................................... 63 3.7.2 Why I I * ,? ....................................................................................... 63 3.7.3 Overall Procedure.......................................................................... 64 3.7.4 Estimating P A R T IM E ................................................................ 65 3.7.5 Equivalent Problem S tatem ent................................................... 67 3.7.6 C o m p lex ity .................................................................................... 67 3.7.7 The A lg o rith m ............................................................................. 67 3.7.8 Effect of Merging a Pair of T a s k s ............................................ 68 3.8 Task M erg in g ............................................................................................ 69 3.8.1 Updating Task Graph Weights as a Result of the Merger . 74 3.8.1.1 Node W eights................................................................ 74 3.8.1.2 Edge W eights................................................................ 74 3.8.2 Creation of Cycles as a Result of Task M e rg in g .................. 75 4 A n aly sis o f th e T ask G ra p h 79 4.1 Parallelism Loss Due to Task Merging .............................................. 79 4.1.1 Definitions....................................................................................... 80 4.1.2 Parallelism with respect to a N o d e ......................................... 81 4.1.2.1 Condition for Parallelism L o ss.................................. 82 4.1.2.2 Amount of Parallelism L o s t ...................................... 83 4.1.2.3 Remark 1 ...................................................................... 83 4.1.2.4 Remark 2 ...................................................................... 85 4.1.3 Defining P a ra lle lis m ................................................................... 86 4.1.4 Usable P a ra lle lism ....................................................................... 87 4.1.4.1 Condition for Usable Parallelism Loss ................. 87 4.1.4.2 Amount of usable Parallelism L o s t ....................... 88 4.1.4.3 Another condition for usable Parallelism Loss . . 88 4.1.5 Relationship between Parallelism and Usable Parallelism ...................................................................................... 90 4.1.6 Upper Bound on Degree of P a ra lle lism .................................. 91 4.1.7 T heorem .......................................................................................... 91 4.2 Effect of Task Merging on C P L ........................................................... 94 4.2.1 Problem S tatem e n t...................................................................... 94 4.2.2 Effect on Path L e n g th ................................................................ 94 4.2.3 Merging an Edge Belonging to the Critical P a t h .................. 97 4.2.3.1 Effect on Execution P a t h s ........................................ 97 4.2.3.2 Effect on Critical P a t h .............................................. 97 4.2.3.3 C o n clu sio n .................................................................. 100 4.2.4 Merging an Edge Not Belonging to the Critical Path . . . 100 4.3 Merging Tasks: Effect of Parallelism Loss on C P L ........................... 103 vi R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 4.3.1 No Parallelism L oss...................................................................... 103 4.3.2 There is Parallelism L o ss............................................................ 112 4.4 A Comparison with D S C ........................................................................ 117 4.4.1 Task M e rg in g ................................................................................ 118 4.4.2 Task C lu sterin g ............................................................................. 119 4.4.3 Consequence................................................................................... 119 4.5 Criteria for Merging ............................................................................... 120 5 The Partitioning Heuristics 122 5.0.1 H e u ristic s ...................................................................................... 123 5.1 Some P ro p erties........................................................................................ 126 5.2 Time Complexity of Partitioning A lg o rith m .................................... 130 5.2.1 DAG T ra v e rs a l............................................................................ 131 5.2.2 Determining the Notions Used by the Partitioning Algorithm 135 5.2.3 Time Complexity Using Heuristic 1 ......................................... 141 6 Performance Analysis 145 6.1 Partitioning Fork and Join D A G s ........................................................ 145 6.1.1 Fork D A G s ................................................................................... 145 6.1.2 Join D A G s...................................................................................... 162 6.2 Partitioning Complete Binary T re e s.................................................... 166 6.2.1 P ro p e rtie s ...................................................................................... 167 6.2.2 Optimal P a rtitio n ......................................................................... 168 6.2.3 Using Heuristic 1 ......................................................................... 184 6.2.4 Performance of Heuristic 1 ......................................................... 188 6.2.5 Using Sarkax’s Partitioning M e th o d ......................................... 192 6.3 Merger th at Results into Higher PARTIME .................................... 192 7 Conclusions and Future Research 202 7.1 C o n clu sio n s............................................................................................... 202 7.2 Future R esearch........................................................................................ 204 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. List O f Tables 6.1 Performance of Heuristic 1 for Complete Binary T rees.................... 190 6.2 Table for Example 1 .............................................................................. 198 6.3 Table for Example 2 .............................................................................. 198 6.4 Table for Example 3 .............................................................................. 199 6.5 Table for Example 4 .............................................................................. 199 6.6 Table for Example 5 .............................................................................. 200 6.7 Table for Example 6 .............................................................................. 200 6.8 Table for Example 7 .............................................................................. 201 v iii R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. List O f F igures 2.1 Internal Structure of O S C ..................................................................... 10 2.2 Classification of data distribution m e th o d s ....................................... 30 3.1 Path from Ni to IV ,+1 (1 < * < m — 1) 41 3.2 Path from N{ to nT O + 1 (1 < i < m) .................................................... 42 3.3 2 examples where {nx,n2, ... ,n m} cannot be an anti-parallel set . 45 3.4 || and J_ are not tra n s itiv e ..................................................................... 47 3.5 Number of elements in parallel s e t s .................................................... 48 3.6 Two tasks mapped to the same P E .................................................... 65 3.7 Explicit Task M e rg in g ........................................................................... 70 3.8 Explicit Task M e rg in g ........................................................................... 71 3.9 Cycle c re a tio n ........................................................................................... 76 3.10 Cycle c re a tio n ........................................................................................... 77 4.1 Parallelism L o ss........................................................................................ 80 4.2 Parallelism L o ss........................................................................................ 84 4.3 Parallelism L o ss........................................................................................ 84 4.4 Example 1 .................................................................................................. 85 4.5 Example 2 .................................................................................................. 86 4.6 No parallelism l o s s ................................................................................. 104 4.7 No parallelism l o s s ................................................................................. 105 4.8 No parallelism l o s s ................................................................................. 106 4.9 Example: no parallelism lo s s ................................................................. 109 4.10 Example: no parallelism lo s s ................................................................. 110 4.11 Example: no parallelism lo s s ................................................................. I l l 4.12 Example: there is parallelism l o s s ......................................................... 113 4.13 Example: there is parallelism l o s s ......................................................... 113 4.14 Example: there is parallelism l o s s ......................................................... 114 4.15 Example: there is parallelism l o s s ......................................................... 114 4.16 Example: there is parallelism l o s s ......................................................... 115 4.17 Example of task m erging........................................................................ 118 4.18 Example of task clustering.................................................................... 119 ix R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 6.1 Fork D A G .................................................................................................. 146 6.2 Determining optimal partition for fork DAGs................................... 147 6.3 Proof: optimal partition for fork D A G s............................................... 148 6.4 Proof: optimal partition for fork D A G s............................................... 149 6.5 Proof: optimal partition for fork D A G s............................................... 150 6.6 Proof: optimal partition for fork D A G s ............................................... 151 6.7 Proof: optimal partition for fork D A G s ............................................... 151 6.8 Proof: optimal partition for fork D A G s............................................... 152 6.9 Examples of fork D A G s........................................................................... 156 6.10 Join D A G ................................................................................................. 161 6.11 Merging steps for join DAGs using our h e u ristic ............................. 163 6.12 All G-trees with 2 le v e ls .................................................................. 169 6.13 All G-trees with 3 le v e ls .................................................................. 170 6.14 Optimal G-tree with 3 le v e ls................................................................. 172 6.15 Plot of f ( i) ................................................................................................. 175 6.16 Optimal task g ra p h ................................................................................. 176 6.17 Optimal task g ra p h ................................................................................. 177 6.18 Optimal task g ra p h ................................................................................. 178 6.19 Performance of Heuristic 1 .............................................................. 191 6.20 CPL as a function of the iteration n u m b e r ................................. 193 6.21 PARTIME Plot for Example 1 ....................................................... 194 6.22 PARTIME Plot for Example 2 ....................................................... 195 6.23 PARTIME Plot for Example 3 ....................................................... 195 6.24 PARTIME Plot for Example 4 ....................................................... 196 6.25 PARTIME Plot for Example 5 ........................................................ 196 6.26 PARTIME Plot for Example 6 ........................................................ 197 6.27 PARTIME Plot for Example 7 ........................................................ 197 x R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Abstract Two of the main phases of compilers for Distributed Memory Multiprocessors (DMMs) are the code partitioning and scheduling phases. Several satisfactory solutions have been proposed regarding the scheduling phase. However, much more needs to be done regarding the code partitioning phase. Existing work regarding the partitioning problem either considers a specific application and find an efficient partitioning scheme for it (i.e. no automatic partitioning), or determine a general solution (automatic partitioning) that is too simple and therefore not efficient (e.g. exploits only one kind of parallelism level). Our research deals with the code partitioning phase of the compiler. We pro­ pose a data-flow based partitioning method where all levels of parallelism are exploited. Given a Directed Acyclic Graph (DAG) representation of the program, we propose a procedure that automatically determines the granularity of paral­ lelism by partitioning the graph into tasks to be scheduled on the DMM. The granularity of parallelism depends only on the program to be executed and on the target machine parameters. The output of our algorithm is passed on as input to the scheduling phase. Finding an optimal solution to this problem is NP-complete. Due to the high cost of graph algorithms, it is nearly impossible to find close to optimal solutions that don’t have very high cost (higher order polynomial). Therefore, we propose heuristics that give good performance and that have relatively low cost. R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Chapter 1 Introduction 1.1 Parallel Processing and M ultiprocessors Improvements in device technology are no longer capable of meeting the perfor­ mance demands of today’s applications. Modern sequential computers are ap­ proaching the fundamental physical limitation that signals cannot travel faster than the speed of light. However, current scientific problems requirements are ever-increasing. Parallel processing research [3, 24, 25, 28, 34] offer architectural solutions to this problem. An example of such architectural solutions is multiprocessor sys­ tems, which are gaining more and more popularity in the research community and even in the industry. The hope behind designing these machines is that collective computational power of the multiple processors will enable us to solve very large problems. Because of the importance of multiprocessors, it is expected that these machines will become widely commercialized and widely used to solve complicated and computationally demanding problems. The advances in hardware design of parallel computers have not been followed by corresponding advances in software to program these machines. This is es­ pecially true for Distributed Memory Multiprocessors (DMMs)1, for which there is no shared memory that can be used by all the Processing Elements (PEs)2. High-level programming abstractions for these machines are almost non-existent, ‘From now on, D istributed Memory Multiprocessor will be abbreviated to DMM. 2 From now on Processing Element will be abbreviated to PE. 1 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. leaving the programmers the task of explicitly programming these architectures using machine-dependent, low-level abstractions. This approach is error-prone and forces the programmer to deal with many details outside of the application domain. More precisely, the programmer has to deal with all parallel processing tasks required to program the parallel machine. These tasks include explicit parti­ tioning of the program code into parallel tasks, scheduling these tasks on the PEs, synchronization, and explicit distribution of data among the PEs and insertion of the appropriate message passing calls needed to exchange data from one remote memory to another. Because of the problems mentioned above, providing solutions to ease the task of programmers of multiprocessors has become a very active area of research in the last few years [1, 4, 5, 11, 12, 26, 27, 32, 38, 39, 40, 43, 45, 55, 60]. Much effort is being done to make the parallel processing tasks mentioned above be done automatically by the compiler of the parallel machine. This way, the user does not have to know the details of the architecture of the machine. His/her main concern is the specification of the algorithm for solving the problem. Several languages have been proposed to program multiprocessors. Some of these languages are concurrent PASCAL, ADA, OCCAM, parallel FORTRAN and parallel C. In all of these languages, the programmer has to express the parallelism explicitly. Furthermore, the programmer is responsible for partitioning the program, allocating tasks on different PEs and scheduling the execution of the tasks on the PEs. Hence, programming multiprocessors is still a very complicated problem. In order to overcome this problem, much research is being done to efficiently program multiprocessors using functional languages. For that to be possible, we need to come up with sophisticated compilers that have powerful optimizers so that the implementation of these languages can be efficient. The gain from that is: • Programming using functional languages is very user friendly and is faster than using imperative languages, since reading and correcting functional programs is much easier than reading and correcting conventional languages. 2 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. • When using functional languages, the user doesn’t have to know anything about the details of the machine: free the programmer from the details of the machine. • Parallelism is expressed implicitly in the program and is extracted autom at­ ically by the compiler. • Partitioning, allocation and scheduling are done automatically by the com­ piler rather than by the programmer. All of these advantages can be summarized in saying th at using functional languages to program multiprocessors makes the software much easier to produce and maintain and thus cheaper. The programmer is provided with a very high level language, where the main concern is the specification of the algorithm for solving a problem. Furthermore, if we compile a functional programming language for a whole group of multiprocessors, then we will not need to rewrite a program executed on one of these machines, if we need to execute it on another machine in this group (portability of software). 1.2 DM M s: The com puters of the future In order to be able to solve the very large problems that face our scientific com­ munity today, we need computers capable of supporting thousands of powerful processors, whose aggregate computing capabilities axe sufficiently strong. Shared memory multiprocessors cannot support a big number of PEs, and therefore can­ not be used to solve this kind of problems efficiently. DMMs on the other hand are potentially scalable to a very large number of PEs, and hence are the right kind of machines to solve these large-scale problems. A m ajor difficulty with the current generation of Distributed. Memory Machines is that they generally lack programming tools for software development at a suitably high level. The user has to deal with all aspects of the distribution of data, since he is provided with separate address spaces, all aspects of the distribution of work load to the 3 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. processors, must explicitly take care of the inter-PE communication by using com­ munication constructs to send and receive data3, and must control the program’s execution at a very low level. This results in a programming style similar to as­ sembly programming on a sequential machine. This is tedious, time consuming, and error prone. The programmer has to face severed issues that do not have their counterparts in sequential programming, such as deadlock which is a major challenge for programmers of multiprocessors. The programmer also has to de­ cide when it is advantageous to replicate data across processors, rather than send data. Moreover, debugging could be extremely difficult. This has resulted in very slow software development cycles and, in consequence, very high software costs. This research is an attem pt at making programming DMMs very user friendly and therefore make the software cost be low. Our main objective is to provide the user with a machine independent programming model which is easy to use, and at the same time performs with acceptable efficiency. This will make the software portable to different DMMs. Furthermore, changing the parallel program to re­ flect a change in the specifications of the problem will be an easier task. Some examples of DMMs are: CM5, T3D, Intel Paragon and the Hypercube. 1.3 Outline of this research Our research deals with the code partitioning phase of the compiler. We propose a data-flow based partitioning m ethod where all levels of parallelism axe exploited. Given a Directed Acyclic Graph (DAG)4 representation of the program, we pro­ pose a procedure that automatically determines the granularity of parallelism by partitioning the graph into tasks to be scheduled on the DMM. The granularity of parallelism depends only on the program to be executed and on the target machine parameters. The output of our algorithm is passed on as input to the scheduling phase. Finding an optimal solution to this problem is NP-complete. Due to the high cost of graph algorithms, it is nearly impossible to come up with 3This is called message passing. 4From now on Directed Acyclic G raph is abbreviated to DAG. 4 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. close to optimal solutions that do not have very high cost (higher order polyno­ mial). Therefore, we propose heuristics that give good performance and that have relatively low cost. The rest of this thesis is organized as follows: chapter 2 is about the motivation of this work. In chapter 3 we define the partitioning problem. Chapter 4 describes the analysis done to determine the choice of heuristics. Chapter 5 describes the partitioning heuristics. Chapter 6 talks about the performance analysis of our algorithm. Finally, chapter 7 summarizes our work and talks about possible future research. 5 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Chapter 2 Background Research 2.1 Programming styles for m ultiprocessors Programming multiprocessors can be broadly classified into 4 methods: 1. E xplicit Im p e ra tiv e P ro g ram m in g : In this case, we use an imperative parallel language such as parallel Fortran or parallel C to program the DMM. The user is responsible for all dependence analysis and for inserting the par­ allelizing and synchronizing statements in the correct place. In addition, the programmer is responsible for the data distribution and the data movement statements (for DMMs only). This p ro g ram m in g style is comparable to as­ sembly programming for sequential machines. It is very time consuming and error-prone yet usually produces the most efficient code. 2. Im p licit Im p e ra tiv e P ro g ram m in g : Here the programmer uses a con­ ventional sequential language such as Fortran, Pascal or C. It is the job of a very intelligent compiler to extract the parallelism in the program using data dependence analysis, insert the appropriate parallelization and synchroniza­ tion primitives, distribute the data across the PEs and insert the required message passing routines for inter-PE communication (for DMMs only). Usually the data dependence relations for imperative languages are quite obscure and many false dependencies exist. Therefore, the compiler is forced to make very conservative decisions. This results in under-parallelization of the program. Hence, it is generally very difficult to design such compilers which are efficient for a wide range of applications. 6 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 3. H y b rid Im p lic it/E x p lic it Im p e ra tiv e P ro g ra m m in g : In this program­ ming style, the language used is an extension of an existing sequential im­ perative language such as Fortran or C. In addition to the usual code, the programmer is responsible for specifying the data layout (distribution of data across the PEs) or the processors on which different pieces of code will execute (such as different iterations of a loop), or both. These specifications could be part of the source code, or in the form of compiler directives or prag­ mas. W ith the help of the user specifications, it will be much easier for the compiler to perform the tasks required to parallelize the code. An example of such a programming language is Fortran D [26]. Also, another approach to this programming style is to use an imperative language augmented with some explicit parallel statements. In this case, the user explicitly specifies which statements or pieces of code execute in parallel. Again, this facilitates the analyses done by the compiler. 4. F u n ctio n al P ro g ra m m in g : The above mentioned problems with imper­ ative languages have led to the investigation into other kinds of languages. Functional languages are an example of that. Here, the programmer uses a functional language to write the code. All the user needs to know is the programming language that he/she is using, without any concern with the details of the machine on which the program is going to execute. It is the job of the compiler to produce the target program which is executable directly on the target machine, and compiled using the local compiler. This means that the compiler is responsible for partitioning the code (i.e. creating the parallel tasks), scheduling these tasks on the PEs (i.e. task distribution), managing the tasks for efficient execution on the PEs, and memory man­ agement (e.g. distributing the data across the PEs so that the number of remote references is minimized). Experience has shown that designing such a sophisticated compiler is a very hard problem. For example, both optimal partitioning and optimal data distribution problems are NP-complete. Most existing compilers rely on programmer interventions to help the compiler 7 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. with the analyses. This is done by either enabling or forcing the program­ mer to give some hints to the compiler regarding d ata distribution, task distribution and management, or both. In summary, there is a tradeoff of performance for programming effort. The more explicit programming DMMs is, the better the performance is but the more the programming effort becomes. The more implicit we make this task, the less the programming effort gets at the expense of lower performance. We are faced with the challenging task of providing the programmer with a high level language, capable of abstracting the underlying architecture, implicitly detecting the par­ allelism in the program, and managing the parallelism for efficient execution on a wide range of multiprocessor systems. This should not come at the expense of performance. This task is obviously very challenging. Because of all the above mentioned points, we axe convinced th at functional languages are the right programming languages to use, in order to have good programmability for multiprocessors. 2.2 Existing Im plem entations of Functional Languages on M ultiprocessors and their Inefficiencies To this day, there i3 no satisfactory p ro g ra m m ing environment for multiprocessors, even using functional languages. This is especially true for DMMs. Most of the inefficiencies associated with the existing methods have to do with the code partitioning, data partitioning and scheduling. 8 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 2.2.1 SISAL 2.2.1.1 Overview SISAL1 [33] is a general purpose functional language that supports data types and operations for scientific computation. It is intended for use on a variety of sequential, vector, multiprocessor and data-flow architectures. A primary goal in the design of SISAL was to express algorithms for execution on computers ca­ pable of highly parallel operation. It is expected that SISAL will evolve into a general purpose programming language targeted to run on future parallel com­ puters. Being an applicative language, SISAL uses functions for all operations to aid the identification of concurrency. This results in a language with very clean semantics. In addition, SISAL has an elegant functional representation in its in­ termediate forms (the data-flow graphs IF l and IF2). The language syntax, being similar to Pascal, is easy to learn and read. SISAL is a strongly typed language. All inputs and outputs of expressions and functions are values (no memory address references are used). Each value has an associated SISAL data type. There are basic scalar arithmetic types (character, boolean, integer, real and double precision) and aggregate types (arrays, records, unions and streams). SISAL supports both sequential (non-product form ) and parallel (product form) loop constructs. The non-product form resembles sequential iteration in conven­ tional languages, but retain single assignment semantics. The product-form loop allows the programmer to specify iterations that do inner (dot) and outer (carte­ sian) array and stream index computations. All iterations should be independent of one another. The programmer uses this construct to express parallelism explic­ itly. In addition to iteration forms, SISAL supports program structures for con­ ditional execution. Note that all structured expressions and functions in SISAL can produce two or more values via multi-expression: comma separated lists of 1 Researchers at the Lawrence Livermore National Laboratory (LLNL) in collaboration with individuals from the University of Manchester, Colorado State University and the Digital Equip­ ment Corporation have developed the programming language SISAL (Streams and Iteration in a Single Assignment Language). 9 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. SISAL SISAL SISAL Libraries Include files Libraries Executable IF2UP PARSER IF2MEM CC/F77 PARSER IF1L0A0 INLINE EXPANSION INVARIANT REMOVAL CONSTANT FOLDING DEAD CODE REMOVAL NORMALIZATION RECORD FISSION CS E-LOOP FUSION Olo b aLCs E Figure 2.1: Internal Structure of OSC expressions producing values of any type. Such expressions are both convenient and well suited for parallel evaluation. 2.2.1.2 Compilation SISAL program — > Target Machine M (a Parallel Computer, e.g. a shared or distributed memory multiprocessor). The SISAL compiler consists of 3 parts: a front end, a back end, and a run­ time system [9, 44, 8, 16, 50, 53]. Figure 2.1 shows the internal structure of an 10 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. existing compiler for SISAL, developed at LLNL. This compiler is called OSC (Optimizing Sisal Compiler). 1. Front end: Architecture Independent. SISAL — » IF1 graph. In this step, the syntax analysis of the SISAL program is done. Next, the program is translated into an intermediate dependence graph form in IF1. IF1 [51] is data-flow graph language which is applicative. 2. Back end: IF1 graph — > ■ optimized IF1 — y IF2 graph -> optimized IF2. Optimized IF1/2 — » Language L directly executable on M — » • use M’s local L compiler — > Execute on M. Some optimization techniques are used on the IF l graph to get an optimized IF1. Next, the IF l graph is extended into an IF2 graph [54] which is a su­ perset of IF l, consisting of the IF l graph plus some memory requirements and specifications. More precisely, in the IF2 graph we attem pt to preallo­ cate array storage whenever possible, in order to reduce array copying that results from the incremental aggregate construction problem. IF2 is not an applicative language since it directly references and manipulates memory. This optimization phase from optimized IF l to IF2 is called build-in-place analysis. The next phase consists of the update-in-place analysis. Here the IF2 graph is further optimized to help identify at compile-time those operations that can execute in-place, and to improve chances for in-place operations at run-time when the analysis fails. The result of this phase is the optimized IF2 graph. Note that both build-in-place and update-in-place analysis are optimization phases, that try to reduce the aggregate copying overhead incurred due to the single assignment nature of SISAL. 3. Run-time system: This is the library software that provides support for parallel execution, storage management and interaction with the user. This library of routines is called from the program L generated by the SISAL compiler. Then program 11 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. L is compiled using M’s local L compiler, and the result is linked with the run-time system and executed on the taxget machine. The compiler analysis up to the optimized IF2 graph is done by the SISAL group at LLNL. The analysis up to the optimized IF l is completely independent from the architecture. The analysis from the optimized IFl through the optimized IF2 is architecture independent, but was done with the assumption th at the target machine has a single shared memory. All aggregate data is assumed to be allocated to a contiguous block of memory. P o rta b ility of OSC OSC was designed primarily to target shared memory multiprocessors. Complete implementations exist for various shared memory machines. It is quite easy to porte OSC to different shared memory multiprocessors. All what is needed is to make some minor modifications to some low-level routines and some library routines to reflect the new run-time system and low-level routines of the new target machine. It is however much harder to porte OSC to DMMs. This is so because when writing a parallel program to target a DMM, we have to deal with the data partitioning, which is not an issue for shared memory multiprocessors. Hence the compiler has to take care of the non-local memory accesses and the message passing mechanism, which are not included in the OSC compiler. However, the parts of OSC which are architecture independent can still be used. As for our project, we can use all the OSC analysis which is architecture independent. In addition, the analysis that includes IF2 and the corresponding optimizations can be used as well, despite the fact that IF2 was designed with a single address space in mind. This is true because as we will see later in the proposed research, the virtual shared memory mechanism is used. As for the graph partitioning, our proposed method is much more complex than the one used in OSC, and therefore that part of the compiler will have to be redesigned. 12 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 2.2.1.3 Current Implementations SISAL has proven to be an efficient language for solving many scientific applica­ tions. It can be executed on both conventional and novel architectures. Current implementations of SISAL exist for sequential machines and for shared memory multiprocessors [9, 16, 6, 7, 52, 2, 56, 37, 35, 10, 30, 29, 36]. It has been targeted on most Unix-based uniprocessors. There is also ongoing research in distributed memory SISAL implementations [19, 20, 21, 23, 22, 41]. Also the intermediate data-flow graph representation of SISAL programs can be executed on data-flow machines. SISAL competed very well with sequential and parallel execution performance of imperative languages such as C and Fortran, on uniprocessor machines as well as various multiprocessors and vector architectures. However, we still have to come up with efficient implementations of SISAL on DMMs, that can compete with conventional languages implementation on these machines. Some of the machines on which SISAL was implemented are: Sun workstations, Vax machines, Macintosh II, Sequent Balance, Cray X/M P, Affiant FX/8, Encore Multimax, Warp machine, Connection Machine, nCUBE/2, HEP, Transputers, and the University of Manchester Data-flow Machine. 2.2.1.4 Inefficiencies of OSC The main problem with the OSC compiler is the simplicity of the partitioning scheme used. It is syntax based and exploits the parallelism used in FORALL loops only. Hence the granularity of parallelism is defined by language constructs and the programming style affects the multiprocessor performance. 2.2.2 V ISA VISA [21,23, 22] is a system that targets SISAL to DMMs. It uses the same simple partitioning scheme used by OSC. The virtual shared memory paradigm is used for data partitioning. The mechanism for translating virtual addresses to physical ones is very costly, and therefore introduces tremendous run-time overhead. This 13 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. system uses a dynamic scheduling scheme, where each PE keeps its own ready queue. This introduces much run-time overhead. Also, because of the distributed ready queues, this method causes load imbalance. 2.2.3 Occamflow Occamflow [31, 17] is an implementation of SISAL on a distributed memory mul­ tiprocessor of transputers. The compiler takes as input a SISAL program, and generates an OCCAM program loadable on the network of transputers. Because the target code of the compiler was OCCAM, many drawbacks followed. First of all, using OCCAM the router (for communication between PEs) has to be explicitly written as part of the code. Because of the nature of the OCCAM pro­ gramming environment, there is no way for the compiler to generate this router automatically, or to produce a universal router that works for all applications. Hence, the programmer has to write this router manually in OCCAM. In addi­ tion, the programmer has to add some OCCAM code to the output generated by the compiler. For example, all variable declaxations in OCCAM have to be written manually by the programmer. Furthermore, since OCCAM does not al­ low recursive function calls, no implementation for recursive calls was done. More importantly, the partitioning scheme was too simple (syntax based) and was not even implemented. The compiler generates the code for one transputer and it is the job of the programmer to partition and load the code on the network. This was mainly due to the primitive nature of the programming environment of the transputers available at that time (for example, there wasn’t any operating system available that provides routines that take caxe of the low level details, such as the routing between PEs). Also, the data partitioning has to be done manually by the programmer using OCCAM. 2.2.4 TAM (Threaded A bstract Machine) 2.2.4.1 Brief Description TAM [13, 14] defines a self-scheduled machine language of parallel threads, which provides a path from data-flow program representations to conventional control 14 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. flow. It presents a model that exploits fine grain parallelism and fine grain syn­ chronization, without any specialized hardware support (with minimum hardware support). It is an attem pt to prove that exploiting fine grain parallelism is a com­ piler and program representation issue rather than a hardware issue, and that a conventional parallel machine coupled with the right program representation and the right compiler is able to do that efficiently. The overall goal in compiling to TAM is to produce code that is latency tol­ erant, yet obtain processor efficiency and locality. All memory transactions and message passing primitives are split-phase. This encourages a latency tolerant style of code generation. All synchronization, scheduling, and storage management is explicit and under compiler control, yet dynamic. This enables the compiler to optimize the use of processor resources for the expected case rather than the worst case. The TAM model shows that implicit scheduling in hardware is of questionable value, as it prevents register usage beyond thread boundaries. Exposing scheduling to the compiler allows it to synthesize particular scheduling policies in specific portions of the program. Note that the goal of TAM research is not to prove that the exploitation of fine grain parallelism is the most efficient approach, and that it is better than the other existing methods. It is merely an attem pt to come up with a software approach for fine grain parallelism, and see how much performance can be obtained from it. In fact, all TAM performance results give statistics regarding context switches frequency, dynamic thread length, duration of a quanta, etc. Nothing is mentioned about absolute performance of TAM, such as total execution tim e of real applications, and their comparison with the currently existing approaches. It is quite obvious that the absolute performance of TAM is slower than other approaches th at use a coarser grain parallelism. 2.2.4.2 Drawbacks In attem pting to exploit fine grain parallelism without any hardware support, TAM tends to introduce extra run-time overhead due to the software support ap­ proach needed to implement its model. This overhead is due in part to the explicit 15 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. scheduling of threads and frame activations. Exploiting fine grain scheduling is an expensive process. Although the fine grain parallelism in TAM is exploited within a single processor and not across the processors2, it still generates more inter-PE communication overhead than coarse grain parallelism. Because fine grain paral­ lelism is exploited within single processors, roughly the same amount of remote data will be requested by the processors even if we had a coarse grain parallelism. However, since the code inside each processor is divided into threads and inlets, it is more likely that a larger number of smaller messages will be requested, for threads and inlets use smaller messages. Since each message sent on the network has a start-up time in addition to the time taken to communicate the data, this will result in higher communication overhead. In addition to the above problems associated with the TAM model, we have to mention that it would be difficult to come up with an autom atic compiler to target a high-level language to the parallel assembly language defined by TAM to represent programs (TLO). It is quite hard to partition the code into threads and inlets th at result into an efficient execution. Furthermore, code-blocks3 correspond to function bodies and loop bodies. Hence the partitioning into code-blocks is too simple. 2.2.5 Sarkar’s W ork Sarkar [47, 46, 48, 49] developed a partitioning and scheduling method for func­ tional languages represented by data-flow like graphs. The graphs that he used are a generalization of the IF l graph used for SISAL. His method targets both shared and distributed memory multiprocessors. The multiprocessor model is applicable to all kinds of multiprocessors, in­ cluding both shared and distributed memory multiprocessors. This model was so general that it didn’t represent accurately real machines and was too simple. 2Code-block activations are distributed across the processors, but each code-block activation is mapped to a single processor, and therefore all threads within a code-block are executed inside a single processor. 3This is the unit of parallelism between processors, since each entire code-block activation is mapped to a single processor. 16 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. For instance, the architecture model does not take into account the true charac­ teristics of DMMs and their limitations, such as the high cost of inter-processor communication. Also, the program execution model is not efficient for DMMs. It allows any compound node in the graph to execute in parallel, in which case the compound node and nodes belonging to it (i.e. nodes that belong to subgraphs of the com­ pound node) execute in separate processors. For DMMs this is not efficient since it could generate too much communication overhead at run-time. For instance, LOOPA and LOOPB nodes4 are allowed to execute in parallel. To do this, the nodes that belong to the subgraphs of the LOOP5 node are distributed across the processors, and the processor where LOOP node executes is responsible for distributing the input(s) and gathering the output(s) of the LOOP node. This generates too much traffic in the network, since for each iteration of the loop, we have to communicate messages between the processors involved in the execution of the LOOP node and nodes belonging to its subgraphs. Another problem with Sarkar’s approach (refer to Compile-time Partition­ ing and Scheduling part) has to do with compound non-FORALL nodes which are macro nodes6 (let’s call these nodes nm). When partitioning a graph g, all subgraphs of the nm nodes are partitioned first (using a recursive call to the par­ titioning algorithm), then the nm nodes axe assigned to tasks. Therefore an nm node and nodes belonging to its subgraphs will belong to different tasks. Hence the tasks will not be guaranteed to be independent of each other. This violates the convexity constraint and makes the compiler analysis more complicated. The code partitioning method proposed by Sarkar is too simple. More will be said about this later in this chapter. Finally, Sarkar’s work does not solve the data distribution problem for DMMs. 4LOOPA and LOOPB nodes correspond to the (Repeat ... Until) and (While ... Do) con­ structs respectively in SISAL. 5 LOOP stands for LOOPA and LOOPB nodes. 6 A macro node is a node th at is allowed to execute in parallel. 17 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 2.3 M ain Phases of Compilers for DM M s la addition to ail phases required by compilers for conventional sequential ma­ chines, compilers for DMMs also include phases required for parallel processing. The most im portant of these phases axe: identification of parallelism, program code partitioning, scheduling, data partitioning (also called data distribution), and insertion of the appropriate message passing calls needed to exchange data from one remote memory to another. 2.3.1 Forms o f Parallelism The parallelism in a program is exposed implicitly by a programming language or a compiler, or explicitly by the programmer. The g ra n u la rity of parallelism is the size of the schedulable unit of parallelism, called grain. The different forms of parallelism are characterized by the grain size and are as follows: • Procedure or loop level parallelism uses entire loops or procedures (or func­ tions), or different iterations of the same loop as grains. Because the gran­ ularity here is large, we call this coarse grain parallelism. • Thread level parallelism uses basic blocks as grains. A basic block is a sequential piece of code that is of medium size, and that does not contain any loops or jum p instructions. These blocks are also called threads. The thread is called blocking if it has long latency operations, such as read and write. It is called non-blocking if it does not have any long latency operations. This form of parallelism is called medium grain parallelism. • Instruction level parallelism uses individual instructions as grains. Since the granularity in this case is very small, this is called fine grain parallelism. This offers the largest amount of parallelism at the expense of much run-time overhead. Managing the exposed parallelism in an application is a hard problem. Once the program is partitioned into grains, we need to do the scheduling, load bal­ ancing, and synchronization among other tasks. As the granularity decreases, so 18 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. does the overhead of scheduling and load balancing. However, the synchronization overhead increases as the granularity decreases [48]. 2.3.2 Identification of Parallelism During this phase, we have to identify all operations that can execute concurrently. Usually this is done by drawing the dependency graph of the program. For im­ perative languages, this task is quite difficult, due to the side effects caused by the updating of variables. For functional languages, this process is quite simple, since all data refers to values and not memory cells. The parallelism is implicit at all levels, and data dependencies are the only sequencing constraints. As soon as an instruction has all its data ready, we can safely execute it. Even though IF2 is not purely applicative, it is designed in such a way that data dependency ensures that when all data of any actor is ready, we can safely execute it. 2.3.3 Program Code Partitioning Once the parallelism is identified, the first thing that we have to do in the back end compiler analysis is to partition the IF2 (or IF l, depending on which interm ediate form we use) graph. Partitioning of a parallel program is the separation of program operations into sequential tasks that can be executed concurrently. In other words, the partition specifies the sequential units of computation in the program7. More precisely, dur­ ing this phase, we group the concurrent operations identified during the previous step into sequential tasks to be executed in parallel. Therefore, when partitioning the code, one of the things that we have to decide on is the granularity of the partition8. There is a trade-off between fine and coarse grain partitioning. The finer the partition is, the more available parallelism we have, and the smaller the load balancing overhead is. However, this comes at the expense of higher over­ head to exploit parallelism, such as higher communication and synchronization overhead. 7We call these sequential units tasks. 8 We can have fine grain partitioning, coarse grain partitioning, or something in between. 19 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. For DMMs, it is particularly important to reduce both. inter-PE communica­ tion and load imbalance. The partitioning problem for a general DAG (Directed Acyclic Graph) where nodes represent computations and arcs carry the data values is NP-complete, ruling out the possibility of a pseudo-polynomial algorithm. Therefore all we can do is to try to come up with a heuristic algorithm that gives us a performance as close to the optimal one as possible. Generally, this is still a very hard problem. An algorithm that gives a satisfactory solution has yet to be found. Some partitioning methods were proposed in [48, 47, 46, 49, 57, 12]. The partitioning of a program can be done either at compile-time or at run­ time. Run-time partitioning has the advantage of using run-time information about the behavior of the program, which may lead to a better partition. How­ ever, this comes at the expense of introducing tremendous extra overhead during program execution. Hence, the partitioning algorithm has to be very simple. We can also have a hybrid compile-time/run-time partitioning. Here, an initial par­ tition is done at compile-time. Then at run-time, we can use some information regarding the behavior of the program, to repartition the code and come up with a better partition. This method suffers from the same drawbacks associated with the pure run-time scheme. Mainly, it introduces too much run-time overhead. The different partitioning methods for distributed memory machines can be classified as follows: 2.3.3.1 Construct based Partitioning This is also called syntax-based partitioning. Here we try to exploit only the parallelism offered by the syntax of the language. For example, in Sisal we have the pipelined parallelism in streams and concurrency in FORALL loops. Function calls could also be spawned as separate tasks. Since in this case the granularity of parallelism is defined by language constructs (e.g. compound expressions and user- defined functions), the programming style dramatically affects the multiprocessor performance. Clearly, this is indesirable. 20 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. This is the simplest partitioning method in terms of analysis of the program. However it usually gives us the least amount of parallelism. 2.3.3.2 D ata-flow b ased p a rtitio n in g In this method, we try to exploit all kinds of parallelism available in the DAG. Any two nodes are allowed to execute in parallel, provided that no data dependency exists between them. Any node is allowed to execute as soon as all input data is available. For imperative languages, much analysis is required to determine all the par­ allelism available in the program. However for functional languages, this is a straight forward task. 2.3.3.3 F unction-level p a rtitio n in g All functions which are independent of each other may be executed in parallel. For functional programming, two functions are independent of each other if the input of neither one is the output of the other. The Church Rosser property ensures that the order of evaluation of functions which are independent of one another will not affect the outcome of the program. This method might result in a granularity which is too coarse, and therefore we might not get enough parallelism to keep all PEs busy. 2.3.3.4 D a ta driven code p a rtitio n in g In this approach, the data is partitioned and mapped to the PEs. A processor is then thought of as owning the data assigned to it; these data elements are stored in its local memory. Then the work is distributed according to the data distribution: computations that define the data elements owned by a processor are performed by it. This is called the ow ner-com putes paradigm9. Note that 9The o w n e r-c o m p u te s rule states th at all computations updating a given datum are per­ formed by the processor owning th a t datum . 21 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. we can apply the ow ner-stores paradigm 10 instead. Our hope in doing so is that the code will be mapped to the PEs in such a way that all (or at least most) of the data references are local. This results in lesser communication on distributed memory machines. The success of this approach is very program dependent. For programs where the information about some of the data (e.g. array size) can only be determined at run time, the implementation of this method might be quite difficult and could introduce much run time overhead. Furthermore, this approach over-emphasizes the locality issue. When we dis­ tribute the data first and then distribute the code in such a way to preserve locality of reference, the resulting code partition may be unbalanced, leading to poor performance. For an example of a data driven code partitioning approach, refer to [26], 2 .3.3.5 C ode b ased d a ta allocation In the code based approach, the program is partitioned so that each processor gets approximately an equal share of the program code (load balancing). Then, depending on the data references, the data is allocated to different PEs so that communication is minimal. Here, we can also apply the owner-computes or owner- stores paradigms (or possibly a mixture of the two). Again, the success of this method is very program dependent. For programs where the information about some of the data (e.g. array size) can only be de­ termined at run time, the implementation of this method might be quite difficult and could introduce much run time overhead. Furthermore, the goals of locality of data and load balanced code could be conflicting, and for certain types of codes impossible to achieve simultaneously. In addition, this approach may lead to high communication overhead, that would nullify all the benefits of parallelism. l0The o w n e r-sto re s rule states that the right-hand side expression of an assignment is com­ puted by a processor which owns data appearing in th at expression and this result is then sent to the processor owning the left-hand side datum . 2 2 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 2.3.3.6 SPM D M odel of Computation In the SPMD (Single Program Multiple Data) approach, also called Data Parallel Model of Computation, we make use of the regularity of most numerical com­ putations. The processors execute essentially the same code in parallel, each on the data stored locally. In other words, the multiprocessor executes in a similar way to an SIMD (Single Instruction Multiple Data) machine. Note however that this approach is applicable only to specific constructs like forall loops and not all algorithms can be executed in this way. An Algorithm that can be handled using this approach has to be some code that is executed several times in parallel using multiple sets of data, such as a parallel loop. The advantages of this model of com­ putation is its simplicity and the fact that there is no inter-PE communication. However, for most real life algorithms, there is no way of avoiding communication between PEs. 2.3.3.7 Programmer intervention Sometimes, the user is required to supply information to the compiler, through compiler directives, assertions, etc., with regard to global, high-level properties of the algorithm whose detection by even the most able systems may be intractable. One example of this is the specification of FARALL loops to indicate the possible parallel execution of loop iterations. Another example is when the programmer asserts some information about some variable in the code (e.g. the variable is a prime number), which enables the compiler to make some decisions regarding the execution of that code. Some parallel languages require the programmer to specify some information, regarding the partitioning of the program, in the source code. For instance Hi- ranandani et al. [26] use the data driven partitioning scheme. They define lan­ guage extensions to Fortran called Fortran D. In this language, they include con­ structs for managing data distribution in non-shared address spaces. The user is responsible for specifying the data layout. Then the compiler uses the informa­ tion regarding the data structures decomposition and the owner-computes rule to partition the program. 23 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. This approach makes the task of the compiler much simpler. However, the goal of making the programming of multiprocessors easy and convenient, and freeing the programmer from the details of the machine is defeated here. Furthermore, due to compile time unknowns, this method might lead to poor performance for some programs. Many researchers believe that no compiler can on its own suffice to support the highly complex and challenging task of producing efficient programs for parallel systems [60, page 285]. They believe that advanced compilation systems will be integrated into a sophisticated programming environment that includes an exten­ sive set of programming support tools. These will be needed to provide guidance in a number of forms. Their claim is that Parallelizing compilers cannot always perform well without assistance from the user. The programmer may play an important role, informing the system via assertions of global relationships (some of which may be due to high-level properties of the algorithm) that an automatic state-of-the-art tool cannot detect. One of the main reasons of this is the inde­ cidability or intractability of many relevant problems and the lack of adequate heuristics for handling them. 2.3.3.8 Hierarchical Partitioning Here the program to be partitioned is represented by a hierarchical graph. A hierarchical graph is one for which the nodes could contain subgraphs. These subgraphs could have nodes which in turn contain subgraphs. This goes on recur­ sively and without any limit. IF l and IF2 are examples of hierarchical graphs. The hierarchical partitioning procedure is a recursive one, which takes as argu­ ment the program graph, and is then recursively applied on the subgraphs. Hence, the partitioning is done starting from the lowest-most level subgraphs and goes on to the next higher level subgraphs, until we reach the original program graph. The partitioning method proposed by Sarkar [48] is an example of a hierarchical partitioning procedure. 2 4 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 2.3.4 Scheduling Issues Once the program code has been partitioned into parallel tasks, we need to manage these tasks (i.e. managing the parallelism) for efficient parallel execution on the multiprocessor. The management of these tasks is called scheduling. Stated more formally, scheduling consists of assigning the tasks th at result from partitioning the program (i.e. exposing the parallelism) to the available processors so as to minimize the parallel execution time. For each task (or process), we have to decide on when to execute that task and where11 (i.e. on which processor) to execute it. The most obvious answer to the when question is to execute tasks as soon as their inputs are available12 [19, 20]. However, this could cause our system to saturate and hence deadlock. This is so because some parent tasks that are being executed might not find any available PEs on which to spawn their child subtasks to give them the values needed to complete. Thus, a throttle is needed. As was mentioned before, for DMMs, it is particularly important to reduce both inter-PE communication and load imbalance. Scheduling is necessary to achieve a good processor utilization and to optimize inter-PE communication in the target multiprocessor. It is well known that finding the optimal scheduling is an NP-complete prob­ lem. Although scheduling is an old, notorious problem with numerous versions and has attracted the attention of many researchers in the past, the results known to date offer little help when dealing with real parallel machines. The complexity of the scheduling problem has led the computing community to adopt heuristic approaches for each new machine. For maximum performance, the problem should be entirely left to the user13. However, this makes programming the multiproces­ sor very user unfriendly, tedious and very time consuming. Worse yet, this work cannot be ported to other parallel machines if the need arises to run the same problem in a different machine. On the other hand, it is very difficult or impos­ sible to find a universal solution for problems such as scheduling, minimization of interprocessor communication, and synchronization. The reason is that these l l This is called task distribution. 12This is a due to the functional nature of our interm ediate form. l3This involves hand-coding and manually inserting system calls in the program. 25 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. problems are architecture dependent. A realistic goal would be to find systematic and autom atic solutions for the scheduling problem for large classes of machine architectures. For some existing scheduling methods, refer to [46, 48, 42, 27]. W hen performing the scheduling, there are several factors that have to be taken into account. Some of these factors are the communication and scheduling overhead, and the task granularity. Low task granularity results in high scheduling and inter-PE communication overhead. T he scheduling methods can be broadly distinguished into three classes: static14, d y n a m ic and hybrid static/dynamic. 2.3.4.1 Static Scheduling In static scheduling, processors are assigned tasks at compile time, before execu­ tion staxts. When program execution starts, each processor knows exactly which tasks to execute. W hen static scheduling is used, there is no overhead due to run-time schedul­ ing; all inter-PE synchronization and communication is directly compiled in the code. Both task scheduling overhead and load balancing overhead are eliminated. Further, there is a greater opportunity to optimize inter-PE communication when the processor assignment is known at compile-time. A global compile-time anal­ ysis reduces communication overhead for the entire program. Such an analysis cannot be done on the fly at run-time. However, the efficiency of the scheduling is questionable in this case, because many of the facts regarding the behavior of the program, such as memory access patterns are only known at run-tim e1 6 . Also, many “adaptive” applications change their access patterns and data loca­ tions over their execution lifetime. For these kind of programs, a compile-time solution might lead to a very inefficient execution, and therefore a run-time or a hybrid run-time/compile-time approach should be used. For programs with fairly predictable execution times at compile-time, this approach could be very efficient. 14S tatic scheduling is also called compile-time scheduling. 15Dynam ic scheduling is also called run-tim e scheduling. 16Some of the examples of this are: array subscripts which are functions with unknown values at compile-time, conditional statem ents, etc. 26 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 2.3.4.2 Dynam ic Scheduling A scheduling scheme is called dynamic when the actual processor allocation is performed by hardware or software mechanisms during program execution (i.e. at run-time). Therefore, during dynamic scheduling, decisions for allocating pro­ cessors are taken on-the-fly for different parts of the program, as the program executes. With dynamic scheduling, the run-time overhead becomes a critical factor and may account for a significant portion of the total execution tim e of the program. For static scheduling, the compiler or an intelligent preprocessor is responsible for making the scheduling decisions. However, for dynamic scheduling, this decision must be made at run-time in a case by case fashion, and the time spent for this decision-making process is reflected in the program’s total execution time. Note that we can have the scheduler make scheduling decisions for a chunk of the program, while other parts of the program axe already executing on the processors. This may reduce the overhead but does not eliminate it. The large overhead of run-time analysis necessitates very simple scheduling algorithms. For shared memory multiprocessors, dynamic scheduling is much easier to deal with. Shared memory implementations of SISAL uses a shared ready queue to enqueue all the tasks that are ready to be executed17. This queue is allocated from shared memory, and therefore is accessible to all processors. Whenever a processor becomes idle, it fetches a task to execute from the ready queue. A throttle has not been needed for this approach. The advantage of this scheme is that there is no need for any dynamic load balancing algorithm, since it is done implicitly by the use of the shared ready queue. The disadvantage is that the queue is a shared resource that must be accessed using a critical section. This results in contention for the shared resource, which causes run-time overhead that consists of the time required to execute the lock protocol, and the tim e the process has to wait for the lock (in case it has to wait). 1 7These are either newly created tasks, or tasks th at were previously blocked (e.g. waiting for memory to become available or some value to be computed) and become unblocked. 2 7 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Dynamic scheduling for DMMs can be performed through a central control unit18 or it can be distributed. Dynamic scheduling through a central control unit usually creates too much communication overhead and results in a bottle-neck at the processor where the central control unit executes. Hence it is usually more efficient to adopt a self-scheduling scheme. A special case of scheduling through distributed control units is self-scheduling. As implied by the term , there is no single control that makes global decisions for allocating processors, but rather the processors themselves are responsible for determining what task to execute next. For an example of a self-scheduling method called guided self-scheduling19, refer to [42]. Another way to implement distributed dynamic scheduling is to establish a fixed spawning pattern for each node20. Here, the processor executing a task uses a predefined method to decide on where (i.e. on which processor) to execute the child tasks. This method needs some load balancing scheme to try to keep the work evenly distributed on all processors. In fact, when we use a dynamic scheduling mechanism, dynamic load balancing is usually needed. One more scheme to implement distributed scheduling is to adapt the shaxed ready queue idea of the shared memory implementation of SISAL. Here, each processor is given its own private ready queue. Each processor monitors its own private ready list, which now can be done without having to obtain exclusive access, and executes any task that arrives. Some mechanism is still needed to decide to the queue of which processor to send a task when it is ready to execute (for instance when a child task is newly created by the parent, or when a previously blocked task is ready to resume execution). Unlike for the shared memory case, we now don’t need to obtain critical section locks to access the ready queue. Hence, the overhead for contention is eliminated, and we now allow for a scalable number of processors to be employed. However, we now no longer have implicit load balancing capabilities, and therefore some dynamic load balancing algorithm is needed, for it is now possible for the system load to become imbalanced. For a detailed implementation of this scheme, refer to [23]. I8Some people call this unit the arbitrator. 19This method applies to shared memory machines and is restricted to arbitrary nested parallel loops. 20This is the method th a t was first used for shared memory im plementation of SISAL. 28 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 2.3.4.3 Hybrid static/dynam ic scheduling We can also have a hybrid static/dynamic scheduling. In this case, we start with an initial scheduling at compile-time, and allow the assignment of tasks to PEs to change at run-time (task migration), if we know th a t this will give us a better performance. For example, we might have to do some task migration in order to balance the load. During run-time, it is not possible to have all the knowledge about the topology of the program. Compiler support of some form is required for that. Therefore, a hybrid dynam ic/static scheduling is probably the best approach to solve this problem. In this scheme, the compiler helps the arbitrator or the processors21 in making a scheduling decision. The hybrid approach could benefit us from the low overhead of static schedul­ ing and the better task assignment of dynamic scheduling. 2.3.5 D istribution o f D ata Data distribution is the task of dividing the data structures that a program uses among the memory elements so as to minimize the total execution tim e of the program. This is equivalent to distributing the data structures so that the num­ ber of remote references is minimal. Therefore, it is necessary to keep the task distribution and the data distribution closely tied, and keep these two aligned at all times. Ideally, we want all memory accesses to be local. Clearly, this is not possible. The data distribution problem is a very difficult and complex one. In fact, finding the optimal data partition is an NP-complete problem. Nevertheless, if good performance is to be achieved, this problem has to be addressed. Since the optimal partition is very hard to find, we have to settle for heuristic methods that give us performance as close to optimal partition as possible. An appropriate heuristic method for automatically determining a nearly optimal data partition has yet to be found. 21depending on whether we use a central or a distributed scheduling scheme. 29 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. static hybrid dynamic implicit hybrid explicit Figure 2.2: Classification of data distribution methods 2.3.5.1 Classification of data distribution m ethods Usually, the data partitioning methods are classified according to whether they are implicit or explicit, compile-time or run-time (see figure 2 .2). • compile-time data distribution: The distribution of data structures is done at compile-time. This is also called static data distribution. By static we mean that the distribution remains the same throughout the life-time of the program. There are several ways this is done, and in what follows we list a few of them: 1. The compiler uses some d istrib u tio n fu n c tio n to partition the data. Distribution functions can be either predefined or user supplied. Typi­ cally, a distribution function has many parameters that control the way data structures are partitioned. Ideally these parameters are chosen automatically by the compiler, after doing some formal analysis, such as analyzing the access patterns, or with the help of run-time profiles [Sar89]. In the case of run-time profiles, the compiler watches several characteristic runs and notes the distribution patterns used for those runs. The compiler then selects a distribution function that will come closest to the observed reference behavior. Note that this approach is inefficient if the profile rims are not characteristic of the actual refer­ ence patterns, or if the reference patterns vary with the input data. 30 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Due to the complexity of this task, many compilers rely on some user in­ tervention, either in the form of language extensions or pragmas (com­ piler directives). For an example of that, refer to [26]. 2. The compiler uses a random distribution of data structures. In this case, the compiler tries to make the distribution even across all PEs without regard to the access patterns. Naturally this is a very simple scheme. But its performance could be unacceptable. • Hybrid compile-time/run-time data distribution: The term hybrid refers to the possibility of changing the distribution associated with the data- structure at run-time. An initial distribution is done by the compiler. Then during run-time, a re d istrib u tio n (re-m apping) of the data is done in order to reduce the remote accesses. This method is efficient in case we have many compile-time unknowns, such as array subscripts which cannot be determined at compile time. Also for programs where the access pattern changes during the execution of the program, it might not be possible to find a mapping of the data structures that will result in minimal remote references before and after the access pattern changes. In other words, the best mapping that gives the least amount of remote references for the pro­ gram up to the point just before the access pattern changes might give a big number of remote references during the execution from the point when the access pattern changes and on. For example, for the FFT algorithm, the access pattern changes over the iteration space. For this kind of problems, re-mapping of the data to recapture local references could result in some im­ provements. The problem with this method is that the analysis required to determine when re-mapping is worth performing is a very difficult one. Also for DMMs for which the inter-PE communication is expensive, this method might cause some tremendous run-time overhead due to the extra inter-PE communication caused by moving the data around during the re-mapping, and updating the descriptors of the data structures that are re-mapped to record the new distribution. Hence, it is important that the communica­ tion incurred by the redistribution of the data (to minimize communication 31 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. during a computation) does not exceed the communication overhead which that redistribution was intended to reduce. • Run-time data distribution: Here all decisions regarding the distribution of data are done at run-time. This is also called dynamic data distribution. One way to do this is to make the decisions regarding distribution of data, functions of some parameters that axe only known at run-time. • Explicit data distribution: The programmer controls the data distribution explicitly2 2. In this case, the programmer explicitly inserts the appropriate inter-PE communication primitives. All of this work is done “by hand” and contradicts the goal of raising programming to a higher level of abstraction. • Hybrid Explicit/Implicit data distribution: In this method, we use compile­ time, hybrid compile-time/run-time, or run-time data distribution, with some user intervention. More precisely, the programmer gives some hints regarding how distribution of data should be done, and the compiler or the run-time system uses those hints to decide how to distribute the data structures. For example, the hints could be in the form of some compiler directives, pragmas or assertions. This method put some burden on the programmer, but has more potential for success than the purely automatic data partitioning methods. This is so because the programmer could be very aware of the data access patterns of his program, and therefore could know about the optimal or near-optimal partitioning of the data structures in the program. Some researchers feel that the compiler on its own will not be able to choose an efficient data decomposition for all programs, and therefore it needs some additional information from the user. Fortran D [26] is an example of a language that requires the programmer to specify the data layout of the program. Then the compiler uses that information to distribute the data structures (arrays) to the processing elements, and insert the appropriate message passing primitives. 22Usually, the program mer uses an explicit im perative programming m ethod. Hence there is no compiler that further processes the code to do the parallel processing tasks (code and data partitioning, scheduling, etc.), since the program mer does all of these tasks explicitly. 32 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. • Implicit data distribution: Using the implicit data distribution method, the compiler or the run-time system is responsible for all the partitioning of the data structures, without any help from the user. Usually, it is quite difficult to design a system that produces very efficient data partitioning, unless some user intervention is available. Note that this method could be compile-time, run-time, or a hybrid of the two. 2.3.5.2 Using the Shared Memory Programming Paradigm for DM M ’s • D escrip tio n : In this method, a shared memory programming model is sup­ ported on top of the distributed memory architecture. This is called Dis­ tributed Shared Memory (DSM) or Virtual Shared Memory (VSM) system. The compiler or programmer is provided with a shared memory abstraction, and a set of primitives for allocating and accessing shared data structures within a virtual address space. The programmer (or the compiler) assumes that there is a contiguous, single address space (this is a virtual space) shared by all PEs in the network and uses this virtual memory to store and retrieve the data structures in the program. All the memory access routines in the program are done with respect to the virtual memory. Then, it is the job of some software interface to map the virtual space onto the distributed phys­ ical memories. All message passing required for accessing remote values is handled implicitly by this interface, through the use of a message passing abstraction. This message passing abstraction provides an abstract interface to the host operating system. Each data structure allocated to the virtual space receives a contiguous set of virtual addresses shared among all the PEs23. Note that the interface will be nothing more than the library of memory access routines called by the user program. This library of routines is responsible for the mapping between the virtual memory and the physical memories, and all the memory management needed. 23In fact, d ata can be represented in two ways: as a local memory variable (i.e. stored in the local memory of the PE in question) or as a shared addressing space variable (i.e. stored in the virtual shared memory). 33 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. This library will be part of the run-time system, and hence the method described here is called run-time support fo r a single addressing space. • A dvantages: The main advantage of this scheme is th at the library created can be written as a stand-alone module in a generic (language independent) way. This module can therefore be plugged into any other run-time system for compilers of other languages. With some minor modifications to the message passing routines used, this library could even be ported to other DMMs. Another advantage is that this message passing abstraction makes the task of the programmer (or compiler) much simpler, since all message passing routines and data distribution will be handled by the library of memory access routines. However, the way the data distribution is done should be dictated by the compiler (or the programmer), since these library routines don’t have any information regarding the behavior of the program. One way to implement this is to include the mapping function as a param eter in the routine that allocates the virtual space to data structures, so th at this routine can use th at information to distribute the data across the PEs. • Im p le m e n ta tio n issue: Using the virtual shared memory scheme has an­ other advantage as far as the implementation goes. Instead of writing a new compiler, we can simply use the existing OSC compiler with some miner modifications24. • D isadvantages: The main disadvantage of this method is the run-time overhead caused by the translation of virtual addresses to physical ones. • A ddress T ran slatio n : When the compiler encounters an array access A[i], it translates it to some code in the output program, that computes the address of the element A[i] and then fetches that location. In our case, this address is a virtual one. Therefore, we have to translate this virtual address to the corresponding physical one, which for DMMs constitutes of a processor number and the local physical address on th at processor. Clearly 24OSC was written for shared memory multiprocessors. 34 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. this translation process is done by the library routines that manage the virtual address space. One way to do the translation is to have a table that has one entry for each virtual address and the corresponding physical address. We need to replicate this table in all PEs in order not to create too much communication traffic. Although this method makes the translation process very cheap in terms of time, since it would require one access of this table, it is extremely costly in terms of space and therefore not feasible. The other way to do the translation is to use the information regarding the virtual space allocated to the data structure and the way it is distributed. From this, we can use some formulas to deduce the virtual address. This method is not costly in terms of space, since we don’t have to store the physical addresses corresponding to virtual ones, however it is very costly in terms of time because of the computations involved. There are ways to optimize the address translation using the second method that we mentioned. For instance, for virtual addresses corresponding to local physical ones (physical addresses on the current PE), we can use some tricks to recognize these virtual addresses and get a direct mapping to physical ones, without using any computations [21, 23]. • It is clear that using this method does not affect the inter-PE communication overhead. Very little extra memory space is needed to record the required information for the translation of virtual addresses to physical ones for each array. • This distributed shared memory method was used to implement SISAL on DMMs25 [21, 23]. The results were not very impressive however. Some of the problems were the run-time overhead introduced by the VISA calls26, and the fact th at multi-dimensional arrays were not implemented as true multi-dimensional arrays27. 25 VISA is the name of the library th at was designed. 26The main overhead was caused by the address translation. 27SISAL and its interm ediate forms IF l and IF2 assume that all arrays are one-dimensional. Multi-dimensional arrays are implemented as arrays of arrays. 35 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. • In order to make this method more attractive, we have to devise some ad­ dress translation scheme that does not involve many computations. We should be able to afford to sacrifice some memory space for less time, pro­ vided that the space traded for tim e is not too large. 2.4 Existing Code Partitioning M ethods Existing work regarding the partitioning problem either considers a specific ap­ plication and try to come up with an efficient partitioning scheme for it (i.e. no autom atic partitioning), or come up with a general solution (automatic partition­ ing) that is too simple and therefore not efficient (e.g. exploits only one kind of parallelism level). The partitioning methods that belong to the second class and that have been implemented in real systems exploit only certain levels of par­ allelism, with certain granularity. Some methods use syntax-based partitioning, some others use function-level partitioning, some systems exploit instruction level parallelism leading to very fine grain size, etc. We believe that the granularity of parallelism should depend on the particular application that we are solving and on the target machine. The best partitioning scheme that we are aware of is the one proposed by Sarkar [48, 47, 46, 49] and is described briefly next. As was mentioned before, this method is too simple. Sarkar’s Partitioning M ethod Sarkar considers both static and dynamic scheduling, and only static program partitioning. His partitioning method for static scheduling differs from the one for dynamic scheduling. Since we assume static scheduling, we only describe Sarkar’s partitioning method for static scheduling. Algorithm In this section, we describe the overall algorithm without going into any details. 36 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. 1. All communication edges in the program graph are sorted in decreasing order of their communication cost. 2 . For each edge e in the graph, starting with the one that has the highest communication cost, we merge e if the merger does not cause an increase in the parallel execution time (the critical path length of the graph). Time Com plexity Let E be the number of edges and N be the number of nodes in the program graph. Sorting the E edges takes 0(E 2) time in the worst case. The algorithm requires that the parallel execution time be computed for each iteration (i.e. each edge examined) of the algorithm. The parallel execution time can be computed by traversing the graph. In the worst case, this takes 0(E + N) time. Since there are E edges in the graph (i.e. E iterations in the algorithm), this will take 0(E(E + N )) time. Hence the overall time complexity of the algorithm is 0(E(E + N)). We will show later on in this thesis that E < N2. Therefore, the time complexity can be written as 0(E.N2). 37 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Chapter 3 The Partitioning Problem 3.1 Assum ptions • We assume that we have a weighted Directed Acyclic Graph (DAG) repre­ sentation of the program. The DAG is flat (no hierarchical graphs such as IF 1 and IF2, the intermediate graph representation of SISAL), the nodes represent instructions (primitive or compound statements), and the edges represent data. The assumption th at we don’t have any hierarchical graphs makes the analysis simpler, and it allows us not to use hierarchical parti­ tioning algorithms which are in general more complex. • One way to get the input DAG described earlier is to use IF1 or IF2, and a variation of Sarkar’s graph expansion m ethod1 ([47, 46, 48, 49]) to do some preprocessing to get a non-hierarchical graph. Doing this, some compound nodes (compound nodes are nodes that contain subgraphs, and correspond to compound statements in SISAL) in the original graph (IFl or IF2) will be nodes in the final flat DAG used as input to our analysis. In this case, all subgraphs of the compound nodes will be transparent, and we only look at the functionality of the node (i.e. given some input, we are interested in what output the nodes generates). The advantage in using IF l or IF2 is that we can use the SISAL compiler 1 Sarkar’s graph expansion m ethod is better suited for shared memory multiprocessors. It does not take into account the high cost of inter-PE communication for DMMs, and therefore is in general not efficient when used to target DMMs. 38 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. to generate the intermediate form (IFl or IF2), then preprocess this graph to get the final input DAG used by our partitioning analysis. • We assume that the DAG is applicative and therefore we don’t have any side effects when we execute it. The data carried by the edges represent mathematical variables and not memory cells. This makes the detection of parallelism straight-forward. • All the inputs of the DAG are assumed to be ready when program starts execution. • The target DMM is assumed to have point to point communication links (no busses). • The output of our partitioning analysis is the input to the scheduling phase. The best scheduling method known so far is the DSC (Dominant Sequence Clustering) designed by Tao Yang ([18, 58, 59]). 3.2 Definitions C ode P artitio n in g : Given a multiprocessor M and a DAG g to be executed on M, a partition of f i r is a set II = {t*, 72,..., rn}, where each rt - is a non-empty set consisting of nodes in g that have to be executed on the same P E ,2 and where all the r,-’s are disjoint, and their union forms all the nodes th at belong to fir. Each set rt - is called a task. D efinition: The trivial partition is the one for which each task r,- is a singleton set (i.e. this is the partition that puts each node in a single task). All other partitions are said to be non-trivial. In p u t and O u tp u t N o d es: Given a DAG g, an input (entry) node in g is any node for which an input edge carries an input data, and an output (exit) node in fir is any node for which an output edge carries an output data. 2 Nodes belonging to the sam e t v could be independent of each other, and therefore could in theory be executed in parallel. 39 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. There is no predecessor node to an input node and there is no successor node to an output node. Input nodes axe also called root nodes, and output nodes are also called leaf nodes. E x e cu tio n P a th : Given a DAG g, an execution path of g is any path from an input node to an output node. C ritical P a th : Given a DAG g, a critical path of g is any longest execution path in g. The critical path length of g is the length of a critical path of g. Let Pa-it '■ = Critical path of the DAG. Let CPL := Critical Path Length of the DAG. In d e p e n d e n t N odes: Given a DAG, two nodes are dependent if and only if there is a path between them. Otherwise they are independent. We define || as the independency relationship, and JL as the dependency relationship. ni || n 2 means and n 2 are independent, rii _ L «2 means ni and are dependent. In d e p e n d e n t Sets 1. Given a DAG g, an independent set is a set of nodes in g in which each pair of nodes are independent (i.e. ail nodes axe pairwise independent). 2 . Given a DAG g and a set S of nodes belonging to < 7, an independent set of S is an independent set which is contained in S. D ep en d en t S ets 1. Given a DAG g, a dependent set is nodes are dependent (i.e. all nodes 2. Given a DAG g and a set S of nodes set which is contained in S. 40 a set of nodes in g in which each pair of are pairwise dependent). in g, a dependent set of S' is a dependent R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. Figure 3.1: Path from N{ to iV,+1 (1 < i < m — 1) R em ark s • If a set is not independent, it does not necessarily imply that it is dependent. • Any path in the DAG constitutes a dependent set. P arallel S ets 1. Given a DAG g, a parallel set is an independent set which is not contained in any other independent set. 2. Given a DAG g and a set S of nodes belonging to g, a parallel set of S is an independent set of S which is not contained in any other independent set of 5 . A n ti-P arallel S ets 1. Given a DAG g, an anti-parallel set is a dependent set which is not contained in any other dependent set. 2. Given a DAG g and a set S of nodes belonging to g, an anti-parallel set of S is a dependent set of S which is not contained in any other dependent set of S. T h eo rem : Let g be a DAG. Let n l? n2, ..., nm be nodes in the graph. {n!, n2, .. •, n m} (m > 2) is a dependent set < = $ ■ 3 a path containing nodes ni, n2, ..., n m (the path could contain other nodes as well). 41 R eproduced with perm ission of the copyright owner. Further reproduction prohibited without perm ission. N . m . m + 1 Figure 3.2: Path from N, to nm+i (1 < i < rzz) P ro o f 1 .3 a path containing nodes nj., n2, . . . , nm =» (nx, rc2, . . . , nm} is a dependent set: Trivial. 2 . {ni, n2, . . . , nm} is a dependent set =* 3 a path containing nodes ni, n2, . . . , nm: We use proof by induction. Base Case: m = 2 ni _ L n 2 = » • there is a path from ni to ra2 or from n 2 to nx. Induction Step: We assume that the property is true for m > 2. Let’s prove that the property is true for m + 1. Assume that { n i,n 2, . . . , nm,n m+x} is a dependent set {ni, n2, ..., nm} is a dependent set 3 a path containing nodes ni, ra2, ..., nm There is a path from N{ to A T t+1 (1 < i < m — 1), where N{ 6 {nx, n2, ..., r R ep ro d u ced with p erm issio n o f th e copyright ow n er. Further reproduction prohibited w ithout p erm issio n . (1 < i < m), and all the A ^ -’s are different from each other (see figure 3.1). nm+1 1 . Ni = $ ■ there is a path from Ni to nm+1 or from nm+1 to N i. If there is a path from nm+1 to N i, then there is a path containing nodes ATl, N 2 , • • • > N m , ^ m + l =£• 3 a path containing nodes n 1,n2,..., nm,n m+i. Now assume that there is a path from Ni to n m+1. nm+i X N2 = ► there is a path from N2 to nm+i or from nm+1 to N 2. If there is a path from nm+i to N2, then there is a path containing nodes Nil N 2i ..., N m , 3 a path containing nodes n u n2, ..., nm, ram+i- Now assume that there is a path from N 2 to nm+i. nm+1 X N3 = > there is a path from N3 to nm+ 1 or from nm+i to N3 . If there is a path from nm+i to N3, then there is a path containing nodes N h N2, ... 5 Nm, n m+i = > 3 a path containing nodes n2, ..., nT O , nm+i. Now assume that there is a path from N3 to n m+1. We keep doing this reasoning until we reach node N m. nm+ 1 X Nm = ► there is a path from Nm to nm+1 or from nm+1 to Wm. If there is a path from nm+i to iV m, then there is a path containing nodes ■ N i, N2, ..., W m, n m +i 3 a path containing nodes ni, n2, ..., nm, nm+i. Now assume th at there is a path from Nm to nm+1 (see figure 3.2) = > There exists a path containing nodes , iV 2,..., Ar m, nm+1 = > 3 a path containing nodes ni, n2, ..., nm, nm+i. Hence the property is true for m + 1. T h e o rem : Let g be a DAG. 4 3 R ep ro d u ced with p erm issio n o f th e copyright ow n er. Further reproduction prohibited w ithout p erm issio n . Let ni > n 2, • • •, T im be nodes in the graph. {ri!, ri2, • • • i Tim } (t7i > 2 ) is an anti-parallel set 3 a path containing only nodes ni,Ti2,..., nm. P ro o f Assume that {n1 } n25 • • •»T im} (m > 2) is an anti-parallel set. Then n2, ..., nm} is a dependent set. Therefore, from a theorem stated earlier, there exists a path p containing nodes n t , n2, ..., nm. Hence, there is a path p,-,,+i from N{ to iV,+i (1 < i < m — 1), where N{ £ {ni,n2, ... ,nm} (1 < i < T 7 i), and all the N^s are different from each other (see figure 3.1). We claim that path p,-,,+i is equal to edge (A,-, nt+1), 1 < i < m — 1. To see why this is the case, assume that the claim is incorrect. Then, 3 a node Nj, j i and j ^ i + 1 such that ny € Pt,i+i- Clearly N j Nk, 1 < k < m, otherwise we will have a cycle in path p. Since Nj J_ Nk, 1 < k < m, then {/Vi, iV 2, ..., Nm, N j} is a dependent set. Hence, {ni, 712, • - -, nm, N j} is a dependent set. This means that {ni, n2,.. •, T im} is not an anti-parallel set, which contradicts our original assumption. Therefore, our claim is correct, which means that path p contains only nodes ni, n2, ..., nm. R e m ark : Given a DAG g. 3 a path containing only nodes ni, n2, ..., T im (m > 2) A (ii!, 7 i2, • • •, T im} is an anti-parallel set. To see why this is the case, assume that there exists a path p containing only nodes n i, n2, •.., 7 im- Then there could exist a path p' which contains p, p' ^ p (see figure 3.3 for 2 examples of such a situation), in which case there exists at least a node n, n ^ n,-, 1 < i < m, such that n X n,-, 1 < i < m . Hence, (tzj, T i2, .. •, T im, ti} is a dependent set. Therefore, {ni, n2, ..., nm} cannot be an anti-parallel set. T h e o rem : Given a DAG g. Let 7il5 n 2, • • •, T im be nodes in the graph. 44 R ep r o d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm issio n . m m m Figure 3.3: 2 examples where {nj, ri2, ..., nm} cannot be an anti-parallel set R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . 3 a unique path p containing nodes ni,n 2, • • • 7rcm, (m > 2 ) AND p contains only nodes n i, n2 , ..., nm {ni, n 2, ..., nm} is an anti-parallel set. Proof Assume that: There exists a unique path p containing nodes ni, n 2,..., nm, (m > 2) AND p contains only nodes n i, ri2, ... ,n m Clearly, 5 = {ni, ri2, ..., nm} is a dependent set. Let’s assume that S is not an anti-paxallel set. Then there exists at least one node n, n ^ n,-, 1 < i < m, such that {nl5 ,..., nm, n} is a dependent set. From a previous theorem, this implies that there exists a path p' containing nodes ni, n2, ..., nm, n. Since p is the unique path containing nodes n L , n2, ..., nm, then p = p'. But p contains only nodes n L , n 2, . . . , nm, which con­ tradicts the fact that n € p. Hence, our assumption th at 5 is not an anti-parallel set is wrong. 3.3 Some Properties _ L Relationship 1. The relationship is not reflexive. 2 . The relationship is symmetric. 3. The relationship is not transitive. In figure 3.4-a, nj 1 n 2, «2 1 n3, but n x || 723. || Relationship I. The relationship is not reflexive. 46 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm issio n . (b) Figure 3.4: || and J_ are not transitive R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm issio n . Figure 3.5: Number of elements in parallel sets 2. The relationship is symmetric. 3. The relationship is not transitive. In figure 3.4-b, || n2, n 2 || n3, but n x _ L n3. Parallel Sets Consider a DAG g. The parallel sets don’t always have the same number of elements. In figure 3.5-a, the parallel sets are: {n6,ni,n4}, { ^ n ^ n s } , {n6,n 3,n 5}, {n7, n 2,n 5}, {n7, nt,n4}, {ra-i^ns}, {n7, n3,n 5}. In this case, ail parallel sets have the same number of elements. 48 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . In figure 3.5-b, the parallel sets are: {n6,ni}, {n6,n4}, {n6,n 5}, {n7,n 2,n 4}, {n7,n 2,n 5}, {n7, nt}, {n7, n3,n 5}. In this case, not all parallel sets have the same number of elements. Anti-Parallel Sets Consider a DAG g and a set S of nodes belonging to g. In general, the set S has zero or more anti-parallel sets, and a set S{nd of zero or more nodes which don’t belong to any dependent set3. Assume that the anti-parallel sets are Si, S 2 , • • •, Sk, and that S{nd is {ni, n2, ..., nm}, where4 k > 0 and m > 0. In general, the anti-parallel sets are not necessarily disjoint. Furthermore, 2 nodes belonging to 2 different anti-parallel sets could be dependent. For an example of this, consider figure 3.5-b. Let S = {nL , n2, ..., n7}. The anti-parallel sets are: {n6, n7}, {n6, n 2, n3}, {nl} n2, n3}, {ni, n4, n3}, {nl5 n 4, n 5}. In this case S',n < f = 0. Figure 3.5-a shows another example. Let S = {ni., n2, ..., n 7}. In this case, the anti-parallel sets are: {n6,n 7}, {n6,n 2}, {ni,n2, n3}, {n2,n 3,n 4}, {n4,n 5}. Also for this case, Sind = 0- Note that in this case, S can be written as S = Si U S2 U S3, where S 1 = {ni,n 2,n 3}, S2 = {n4,ns} and S3 = {n6,n 7}. Si, S2 and S 3 are disjoint anti-parallel sets. Note also that in some cases, the set S can be expressed as 3Each node in £ ,■ „< / and any other node in S are independent, and clearly Sind is an indepen­ dent set of S. However, S,n< i is not a parallel set. 4k = 0 represents the case where S has no dependent sets (i.e. S is an independent set), and m = 0 represents the case where each element in S belongs to some dependent set. 49 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm issio n . S = Si U £2 U ... U Sk U Sind, where the S,’s are disjoint anti-parallel sets, and S{nd is a set of one or more nodes which don’t belong to any dependent set. 3.4 Task Graph Definition: We define the task dependency graph (or task graph for short) of a partition to be the directed graph whose nodes are the tasks in the partition, and where the arcs between nodes represent the data dependency between tasks (i.e. there is an edge from task r,- to task ry if and only if data has to be transm itted from T { to ry). 3.4.1 Defining Task Graph W eights 3 .4 .1.1 N ode Weights The weight of a node in the task graph is the sum of the execution times of all actors that belong to the task represented by the node. 3 .4 .1.2 Edge Weights Consider an edge e = (n,-, ny) in the task graph, where node n,- represents task r,- and node ny represents task ry. The weight of e is the total amount of data transm itted from r,- to ry during one execution of the program. 3.4.2 Com m unication Between Tasks Consider 2 tasks r t and r 2 in the task graph such that there is an edge from Ti to 7 2 - Assume that actors ai, 02,..., an belong to ri, and actors 61, 62, • • • > bn belong to r2, and that there is an edge from at - to 6, - (1 < i < n) in the original DAG (the input program graph). The question is: do we send the messages from a,- to 6, - individually, or do we combine them into ONE larger message that is sent from iq to r2? 50 R ep ro d u ced with p erm issio n o f th e copyright ow n er. Further reproduction prohibited w ith out p erm issio n . In other words, do we wait for all messages destined from r L to r2 to be ready and send them all together in one larger message, or do we send each message individually as soon as it is ready? One advantage of sending messages individually is that the destination actors wait less time for their input data to arrive (as soon as an actor inside a task finishes execution, we send the outputs of the actor to their destination actors). However, as will be seen later, our execution model does not allow any partial execution of tasks. Hence, this will not be of any benefit to us. Note that if partial task execution were allowed, then sending the outputs of actors inside a task individually as soon as they are ready could be of great benefit. One popular optimization technique used for DMMs is to combine smaller mes­ sages going to the same destination into larger ones before sending them whenever possible. This usually reduces the communication overhead. This is true because the dominant component in the cost to communicate a message between PEs is the message start-up time. The other component, mainly the delay component (the duration from the time the message is sent to the tim e it is received) is small compared to the message start-up component. Therefore, when we combine smaller messages into a larger one, we only have one message start-up for the new larger message, rather than several for each smaller message. Because of the reasons mentioned above, we chose to combine all messages destined from r,- to tj into one larger message. 3.5 Graph Execution Cost and Effect of Data D istribution 3.5.1 N odes in th e Input Program Graph Two kinds of nodes: 1. RNODEs: Nodes whose execution always involves one or more memory accesses. 51 R ep ro d u ced with p erm issio n o f th e copyright ow n er. Further reproduction prohibited w ith out p erm issio n . These are also the nodes whose execution may involve one or more remote accesses (in case the memory accessed is rem ote). 2. LNODEs: Nodes whose execution does not involve any memory accesses. These are also the nodes whose execution never involves any remote accesses. Examples of RNODEs: Array manipulation nodes (ABuild, AFill, AElement, AReplace, ACatenate, AScatter, AGather). Examples of LNODEs: Arithmetic and Boolean nodes. 3.5.2 Graph E xecution Cost 1. Node computation cost. 2. Inter-node communication cost. Node Com putation Cost 1. LNODES: Given the target machine, we can determine the execution cost of these nodes. 2. RNODES: The execution cost of an RNODE depends on the PE to which the node is assigned, and the memory (i.e. memory of which PE) that has to be accessed. Therefore, the way data is distributed across the PEs affects the execution cost of RNODES. Inter-Node Com munication Cost Given the assignment of nodes to PEs and the size of data transm itted along an edge in the graph, we can determine the cost of transm itting the data along the edge. 1. If an edge links two nodes assigned to different PEs, then the cost of trans­ mitting data along this edge is the cost of communicating the data between the PEs5. 5An edge connecting two nodes mapped to different PEs doesn’t necessarily cause inter-PE communication. For more details, refer to the section regarding local and non-local edges. 52 R ep ro d u ced with p erm issio n o f th e copyright ow n er. Further reproduction prohibited w ithout p erm issio n . 2. If an edge links two nodes assigned to the same PE, then we assume that the cost of transmitting data along the edge is zero. 3.5.3 Cost M easures N eeded for Partitioning Analysis As will be seen later, during the partitioning analysis, we need to compute the CPL of the task graph. Evaluation of CPL of task graph requires knowledge of node computation cost and inter-node communication cost. 1. Inter-node communication cost: Since we assume that each node in the task graph is mapped to a different virtual PE, and that we have minimum non-zero communication overhead, then given the size of the data transm itted along an edge, we can figure out the inter-PE communication cost caused by this transmission. 2. Node computation cost: (a) LNODES: Given the target machine, we can estimate the cost. (b) RNODES: During the partitioning phase, we don’t know the assign­ ment of nodes to the physical PEs yet. Hence we have no way of telling whether an RNODE involves a remote access or not. Therefore, it is not possible to determine accurately the execution time of RNODEs. This is true even if we know the way the data is partitioned across the PEs. 3.5.4 Estim ation of Execution Cost o f RNO DES D a ta D istrib u tio n P ro ced u re: We assume that it is a function of the number of PEs and the distance between each pair of PEs. It does not depend on code partitioning and scheduling (i.e. it can be done before code partitioning and scheduling phases). M e th o d 1 : Assume that all RNODES do local memory accesses only. In this case, inter-PE communication can be caused by edges only. 53 R ep ro d u ced with p erm issio n o f th e copyright ow n er. Further reproduction prohibited w ithout p erm issio n . M e th o d 2: Apply data distribution procedure to map data to the virtual PEs to which the nodes in the graph are assigned. P ro b le m s w ith M e th o d 2 • There is a very large number of nodes in the graph. This number could be in the order of 10,000 or more, and therefore the number of virtual PEs used could be in the order of 10,000 or more. Hence, we might not have enough data to distribute across all virtual PEs. • At each step of the partitioning algorithm, nodes in the graph are merged. Therefore, fewer virtual PEs are used, and we will have to apply the data distribution procedure again to take into account the change in the number of virtual PEs. One way to get around this problem is to assume that each tim e 2 nodes are merged, the corresponding virtual PEs P E i and P E j to which these nodes are assigned are replaced by another virtual PE PEk, and all the data which was mapped to P E i and P E j is assigned to PE k. For this to be possible, we need some way to keep track of this data reassignment. This could be costly in terms of time or memory space. • Usually for DMMs, the cost to determine the physical addresses of the data used is quite high. Hence, this method could add too much to the time complexity of the compiler analysis. 3.5.5 Task Execution M odel and Output D ata Storage • Convexity Constraint: A task receives all inputs before starting execution, then it executes to completion without interruption (i.e. no partial task execution). For this not to cause any deadlock situations, we have to make sure that the task graph is acyclic at all times. • At the end of execution, all outputs are sent immediately to destination tasks. 54 R ep r o d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm issio n . • Output data is not stored in memory. • D ata is sent to the PE where destination task executes right away, using the send primitive. • Destination task gets data by executing a receive command right before starting execution. Convexity Constraint Versus Partial Task Execution As was mentioned previously, our execution model does not allow any partial task execution. The question that arises here is: will this affect the utilization of the multiprocessor? For a typical application, there will be so many tasks ready to execute most of the time during run-time. Therefore, it is most probable that we can keep the multiprocessor busy most of the time even without allowing any partial task execution. Using the convexity constraint, our execution model is simpler, and there is no need for context switching during run-time. 3.5.6 Existing work • Partitioning and scheduling methods for DMMs don’t take into account effect of data distribution. • Execution of nodes is assumed not to cause any remote accesses, and there­ fore does not cause any inter-PE communication. Inter-PE communication can be caused by edges only. 3.5.7 H ow to m odel effect o f data distribution in th e graph? Node Cost: • LNODE: x , where x := computation cost. • RNODE: (x ,y ) , where — x := computation cost, 55 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . — y := communication cost due to remote access, if there is any, y := 0 , otherwise. Problem: During partitioning phase, y is unknown, since assignment of nodes to PEs hasn’t been done yet. 3.5.8 Conclusion • Communication can be caused by non-local edges and RNODEs. • Because assignment of nodes to PEs is done after code partitioning phase, during code partitioning analysis, it won’t be possible to take into account communication caused by RNODES. • However during scheduling phase, we could use the effect of communication caused by RNODES. As soon as an RNODE is assigned to a PE, its corresponding y value can be determined, assuming that the data partitioning has already been done. 3.6 Parallel Execution Time (PARTIM E) Since all our analysis is done at compile-time, we have to devise some way of estimating the parallel execution tim e of the program at compile-time. Obviously, the only way to determine the exact execution time of the program is to run it on the multiprocessor. 3.6.1 Execution Profile Information In recent years, execution profile information became widely used in automatic compiler optimizations. In our case, it provides us with: • Average data sizes for all communication edges. • Average frequency values for function calls for each function in the program. 56 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm issio n . • Average frequency values for subgraphs of parallel and compound nodes. In order to be as accurate as possible, the only information generated from pro­ files is counts and sizes. Execution time costs are not used since unlike counts and sizes, they cannot be measured exactly from profiles. Execution times and communication costs are estim ated from the information obtained from the pro­ file. This information can be generated by any execution (sequential or parallel) of the program on the taxget machine. A drawback of execution profiles is their sensitivity to changes in program inputs. Clearly, this could affect the optimizations done by the compiler. For this reason, it is more efficient to average the information over several inputs. Note that if we change the taxget machine, then we have to regenerate the profile information even if we use the same program. This is true since the data sizes for the new machine might be different from the previous one6. 3.6.2 Cost A ssignm ent An important information for our compiler analysis is the average execution time of the nodes in the graph. Here we axe concerned with the sequential execution time of the nodes. Our approach is the same as the one in [48]. Mainly, we assume that the average execution time f a(n) of all simple nodes n is one of the target multiprocessor parameters. There are many possible techniques for estimating the execution tim e for simple nodes, and these techniques vary for different architectures. One simple scheme is to add the execution times of the target instructions which implement the simple node. All average execution times of non-simple nodes are derived from f a and from the profile information. This derivation is based on the following 3 simple rules: 1. The average execution tim e of a graph is the sum of the average execution times of all its nodes. 2. The average execution tim e of a parallel node is the product of its average number of iterations and its subgraph’s average execution time. 6The graph frequencies remain the same however. 57 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm issio n . 3. The average execution time of a non-parallel compound node is the sum of the products of each subgraph’s average frequency and execution time. 3.6.3 M ultiprocessor Param eters and Com m unication M odel The multiprocessor parameters needed for our analysis are the communication overhead between PEs and the simple node average execution time cost function fa- fs(n ) for simple node n is the execution time of n. The communication overhead between 2 PEs is assumed to have two compo­ nents [48]: 1. Processor component: the duration for which a processor paxticipates in a communication. If the processor is sending data, then this is the time it takes it to write and prepaxe the message, before it is sent on the network. We represent this time by a function Wc. If the precessor is receiving a message, then this is the time it takes it to read the message, after it arrives to the PE from the network. We represent this time by a function Rc. Note that when a message is sent between 2 processors, the sum of Wc and Rc caused by this message constitutes the message start-up component (allocating buffers, copying data to or from buffers, etc.)7. 2. Delay component: this is the duration from the time the message is sent on the network by the source processor (after it is written), to the time it is received by the destination processor (before it is read). This is also the fraction of communication time during which the sender and receiver precessors are free to execute other instructions. We also call this time network component. We represent this time by a function Dc. Dc is a function of a 4-tuple (i,j, s, I), where i is the source processor number, j is the destination processor number (i ^ j), s is the size of the message, and I is the total communication load in the multiprocessor at the time of the message 7The message start-up component also includes the tim e to execute the routing algorithm, the time to establish an interface between the local processor and the router, etc. 58 R ep ro d u ced with p erm issio n o f th e copyright ow n er. Further reproduction prohibited w ithout p erm issio n . transfer. Rc is a function of the couple (j, s) and Wc is a function of the couple ( z » . DMMs use message passing as a means for communication between PEs (syn­ chronization and remote data access). The message passing protocol uses the send and receive commands. We assume that send is non-blocking and receive is blocking. Assume that processor P E i communicates a message to processor P E j . Let ti be the time when P E i executes the send command and t2 be the tim e when P E j executes the receive command8. The message arrives at P E j at tim e £3 = £1 + Wc+ Dc. Let’s compute the time taken by P E i and P E j due to this com m u n ication. Sending Processor For processor P E i, there will be no idle time since the send command is non- blocking. Thus the time taken by the sending processor due to this com m u n ication is simply Wc. Receiving Processor There are two possibilities: 1. If the receive command is executed at any time after £3 (t2 > £3), then P E j will never idle to wait for the message to arrive: idletime = 0 . P E j will spend another Rc time to read the data, and therefore the data will be available at time t2 + Rc- We say that all the communication delay has been overlapped with computation in P E j. The time taken by P E j due to this communication is: idletime 4 - Rc = Rc. 2. If on the other hand the receive command is executed before £3 (t2 < £ 3), then P E j will idle for £3 — t2 tim e to wait for the message to arrive: 8Here we assume th at we have a global clock used by all PEs. 59 R ep ro d u ced with p erm issio n o f th e copyright ow n er. Further reproduction prohibited w ith out p erm issio n . idletim e — £3 — 12 — ti -h Wc + Dc — 12 — £i — £2 "h Wc " f * Dc — A t -f- Wc + Dc, where A t = £1 — t2 is the difference between the time when PEi executes the send command and the time when P E j executes the receive com m and. In this case, the data will be available in P E j at tim e <3 + /2C . The time taken by PE j due to this communication is: idletim e + Rc = A t + Wc + Dc + R c. In conclusion, for the receiving processor, the tim e taken due to the commu­ nication is idletim e + Rc. idletime is equal to zero or A £+Wc-\-Dc, depending on when the receive command was executed relative to the send command. Another expression for idletim e is idletime = a (A t + Wc + Dc), where a = 0 if £2 > £1 + Wc d * Dc a = 1 otherwise. Since Wc and Dc axe functions of (i, s) and (i,j, s, I) respectively, idletim e is a function of ( i,j, s, /, £1, £ 2). 3.7 Problem Statem ent D efinition: Given a multiprocessor M and a DAG g to be executed on M , we say that an execution of g on M does not violate partition II = {ri, 72,..., rn}, if and only if, each task r,- of II is executed in a single PE (i.e. all the nodes in rt - are executed on the same PE). Note that 2 nodes that are in different tasks may be executed on the same PE. However, 2 nodes that are in the same task have to be executed on the same PE. E xam ple: The trivial partition is not violated by any execution graph g on multiprocessor M . D efinition: A partition III is said to be contained in partition II2, if and only if, each task r,- of III is included in some task ry of n 2.9 We write IIi C II2.10 We also say that II2 is smaller than Hi. 9 t v could be the sam e as ry. l0This is not the same as the ususal set inclusion. We sim ply borrow this notation for convenience. 60 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . E xam ple: The trivial partition is contained in any non-trivial partition C orollary: For a multiprocessor M and DA g, if an execution of g on M does not violate some partition III, then it does not violate any partition II2 which is contained in IIx- C orollary: For any 2 partitions III and II2, III C n 2 = > ■ III has at least as many tasks as II2 does. P roof: Straightforward: Use proof by contradiction: We assume that the statement of the corollary is not true, then using that we deduce a false statement. D efinition: Given a multiprocessor M and a DAG g to be executed on M, a universal partition is a non-trivial partition which is not violated by any execution of g on M , that leads to minimal parallel execution tim e for any number of PEs in M (including the infinite number). Stated differently, a universal partition is a non-trivial partition IIU which is contained in any optim al partition n o pt (i.e. a partition that results into minimal parallel execution time) for any number of processors: IIU C 11^. T h e P a rtitio n in g P ro b le m : Given a target multiprocessor M and a DAG g to be executed on M , the code partitioning problem consists of finding the smallest universal partition for g. T h e Idea: The reason for choosing a universal partition is that when we start from such partition and start lumping tasks together, we are able to get to the optimal partition for the number of processors that are available, provided that our algorithm leads to optimal solution. This is true since a universal partition is contained in the optimal partition. When we start from the smallest universal partition, we save work since the smallest partition has the least amount of tasks. This lumping process is done in the scheduling phase as will be seen later11. 1 1 When 2 tasks are assigned to the same processor, we say that they are lum ped together. 61 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . R e m a rk (In tu itiv e fact): The number of tasks in the smallest universal partition should be greater than the number of PEs in the multiprocessor. If it is not, then the problem that we axe trying to run does not have sufficient parallelism for all the PEs. T h e o rem : Given a target multiprocessor M and a DAG g to be executed on M , Let rioo be The optimal partition for the following case: We have a virtual DMM (D M M V ) satisfying the following 2 properties: 1. Infinite number of virtual PEs, 2. Communication overhead between the PEs is minimal (not zero)12. IIoo is the smallest universal partition. P roof: 1. First, let’s prove that IIoo is a universal partition. For any task r,- in IIoo, all nodes in r,- have to be executed in the same processor to get optimal parallel execution time, given the best case situation of an infinite number of processors with minimum communication overhead. Therefore, in the realistic case of a finite number of processors with varying communication overhead, all the nodes in r,- have to be executed in the same processor as well, in order to get optimal parallel execution time. Hence, any execution that leads to optimal performance does not violate Ho q . As a consequence, IIoo is a universal partition. 12Here we assume th at the communication overhead between any 2 virtual processors is mini­ mal. In other words, we assume th at the distance between any two processors is one hop (i.e. all processors are directly linked with one another). Also, we assume that the total communication load in the multiprocessor network is always negligible and does not affect the communication time between processors, and therefore we can ignore it. Hence in this case, the delay component De does not depend on the source processor, the destination processor, or the total communi­ cation load. D e is then a function of the size of the message only. In addition, we assume th at Wc and R c for any virtual processor are equal to the m inim um value of Wc and Re respectively of all physical processors. Hence, We and Rc are also functions of the message size only. 62 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm issio n . 2. Now we have to prove that IIoo is the smallest universal partition. Since IIoo is an optimal partition, then any universal partition IIU is contained in it. Therefore IIoo is smaller than any universal partition. Since it is itself a universal partition, then it is the smallest universal partition. 3.7.1 Rem arks • Sarkar uses the same definition for the partitioning problem. However, our definition is much more formal. • The code partitioning is usually done at compile-time. It is very unusual for parallel compilers to use dynamic code partitioning schemes. In this work, the compile-time method is used. 3.7.2 W hy !!«,? Here, we give a more informal and easier to understand explanation for the choice of rioo. Consider a task r,- in IIoo. Let Ti = {al5 a2, ..., an}, where the ay’s are actors in the input program graph. Since under the ideal case of infinite number of PEs and minimum non-zero com­ munication overhead actors a l? 02,..., an belong to the same task, then under the realistic case of finite number of PEs and actual communication overhead they have to belong to the same task as well. Hence all the actors that belong to the same task in H,*, belong to the same task in the optimal partition for the realistic case. Therefore, by doing some further merging of the tasks in Hoo, we can obtain the optimal partition for the realistic case. If our scheduling algorithm is optimal, then the tasks that should be merged together will be assigned to the same PE. In our approach, IIoo is passed as the input to the scheduling phase, and we rely on the scheduling algorithm to assign the tasks that should be merged together to the same PE. 63 R ep ro d u ced with p erm issio n o f th e copyright ow n er. Further reproduction prohibited w ithou t p erm issio n . 3.7.3 Overall Procedure • Start with the trivial partition. • Perform a sequence of partitioning refinements. • At each step, the algorithm tries to improve on the previous partition by choosing a pair of tasks to be merged using some heuristic13. We record the parallel execution time (PARTIME) 14 corresponding to the new partition. • Stop when the singleton partition is reached. • Choose partition with lowest PARTIME. Remarks • Most partitioning algorithms use the above overall approach. • The main work of the algorithm is to find the appropriate tasks to be merged during each step. Therefore, we have to study very carefully the effects of task merging and understand its impact on CPL, available parallelism in the task graph, re­ duction in communication overhead, etc. • We keep merging tasks until we reach the coarsest (singleton) partition. We will see in a later section that we need to keep merging tasks even if the merger results into a higher PARTIME. This is done so that we don’t get caught at a local minimum. • The parallelism granularity is determined by the size of the tasks in the partition that results from the partitioning analysis. l3Merging 2 tasks 77 and tj means replacing them by a new task Tk which contains all the actors in r,- and rj: = r,- U t j . 14From now on, parallel execution time will be abbreviated to PARTIME. 6 4 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . PE PE PE PE PE PE PE Figure 3.6: Two tasks mapped to the same PE 3.7.4 E stim ating PARTIM E We make the following 2 assumptions: 1. At each step of the algorithm, each task of the current partition is mapped to a separate PE. 2. All the inputs of the program graph are ready before the program starts execution. Therefore, we can estim ate PARTIME to the CPL of the task graph of the current partition. Remark To see why the assumption stating that each task of the current partition is mapped to a separate P E is necessary, consider the task graph shown in figure 3.6. Nodes ni and T I 3 are m apped to the same PE. When the program starts execution, these 2 nodes cannot start execution at the same time. Assume that n x starts execution first. Thus n 3 can start execution only when n x finishes. Therefore 6 5 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . we have to add comp(ni) to the length of path (ft3, n6, n 7) when we figure out PARTIME. Notations and Definitions Given a task graph. Let n be a node in the graph. Let e = («!, n2) be an edge in the graph connecting 2 nodes and n2. Let p be a path in the graph. comp(n) := Cost of computation in node n. data(e) := Amount of data communicated along edge e during 1 execution of the program. data(n,-, rij) : = 0 , if there is no edge (n,-,ny). comm(e) := Cost of communicating data on edge e during 1 execution of the program. comp(e) := com p(ni) + comp(n2). L(p) := Length of path p. L(p) = £ „ epcomp(n) + E eePcomm(e). Lb{p) := Length of path p before merger. L a(p) := Length of path p after merger. CPLb and C P L a axe defined to be the CPL of the task graph before and after the merger respectively. Consider two virtual PEs P E \ and PE 2 belonging to D M M V. Let f c be the cost to communicate a message from P E \ to PE 2 . f c is a function of the message size s only, because of the characteristics of D M M V. f c(s) := Cost to communicate a message of size s from P E \ to P E 2. f c(s) has 2 components: a message start-up component Tstart and a delay component delay(s). f c(s) = Tstart + delay(s). delay(s) is proportional to s. Since the message start-up component is a constant and does not depend on the message size, then 66 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . / c(si + S2) 7 ^ /c(-Sl) + f c ( s 2). i.e. /c(-s) is not proportional to s. We have delay(s\ + 52) = delay(s{) + delay(s2). fc{s 1 + S2) = Tatart + delay(si + S 2). Thus, / c ( S i + s2) = Tjtart + delay(st) + delay(s2). Hence, f c(si + s2) = / c(si) + delay(s2). We define delay(s) := 0, if s = 0. comm(e) = Tstart + delay(data(e)) 3.7.5 Equivalent Problem Statem ent The optimal partition is the one for which the corresponding task graph has the shortest CPL among all partitions. The task graph corresponding to the optimal partition is called optima task graph. 3.7.6 C om plexity The partitioning problem is NP-complete [48]. Therefore, all we can do is find heuristics that give a performance as close to the optim al as possible. 3.7.7 The A lgorithm An informal description of the algorithm for the code partitioning follows: A L G O R IT H M PartitionG raph BEGIN P a r titio n := T r iv ia l_ P a r titio n /* S ta rt w ith th e t r i v i a l p a r tit io n . PARTIM E := CPL o f current ta sk graph /* P a r a lle l Execution Time /* corresponding to current /* p a r t it io n . b e s t .p a r t it io n := P a r titio n /* B est p a r t it io n found so fa r . b e st.tim e := PARTIM E /* B est p a r a lle l e x ec u tio n tim e found so f a r . W H IL E I p a r tit io n I >= 2 D O BEGIN /* Perform a merging ite r a t io n . 67 R ep r o d u ced with p erm issio n o f th e copyright ow n er. Further reproduction prohibited w ithout p erm issio n . P a r titio n := Merge(H) /* Choose a p a ir o f ta sk s to be merged /* u sin g H e u r is tic H, and merge them. /* P a r titio n = p a r titio n a f t e r m erger. PARTIM E := CPL o f new ta sk graph IF (PARTIME < b e st.tim e ) T H E N BEGIN b e s t .p a r t it io n := P a r titio n b e s t.tim e := PARTIM E E N D E N D E N D At the end of the execution of the algorithm, variable best-partition is the partition chosen by the algorithm, and variable best-time is its corresponding PARTIME. 3.7.8 Effect o f M erging a Pair o f Tasks • If the tasks are independent = > ■ . No reduction in communication cost. . Loss in parallelism. • If the tasks are connected by an edge = > • . Reduction in communication overhead15. . Possible loss in parallelism. M erging Tasks Goal: Minimize CPL of task graph. 15Note that during the scheduling phase, these two tasks m ay be assigned to the sam e physical PE, and therefore no reduction in communication overhead is done as a result of merging them . However, since we assume th a t we have an infinite num ber of virtual PEs, and th at each task is mapped to a different virtual PE, there is reduction in communication overhead as a result of the merger. 68 R ep ro d u ced with p erm issio n o f th e copyright ow n er. Further reproduction prohibited w ithout p erm issio n . R u le: Merge only pairs of nodes connected by an edge, since there is no gain from merging nodes not connected by edges. PARTIME := Parallel execution tim e of the program on the DMM. PARTIME = TC + T0, where Tc := Computation Time Component, T0 := Overhead Component (com m unication overhead only, no scheduling overhead). Trade-off between computation component and overhead component. The more parallelism we exploit, the smaller Tc and the larger T0 will be, and vice versa. CPL = Tc + T0. For DMMs, communication cost is quite high = > - Try to reduce communication as much as possible. In general, merge tasks = $ ■ Tc increases (loss in parallelism and more sequentialization) and T0 decreases (reduction in communication overhead). 3.8 Task M erging 2 possibilities: 1. Explicit merging: In this method, we keep track of how the task graph looks like during the execution of the partitioning algorithm. Each time tasks are merged, we update the task graph to reflect the new partition (i.e. after each merging step, we determine the task graph of the new partition). This is called explicit merging, because we explicitly reflect the task merging in the task graph. Using explicit merging helps the partitioning analysis. For instance, it al­ lows us to compute the parallel execution time at each step of the merging process, which is simply the CPL of the corresponding task graph. Also, it allows us to keep track of the available parallelism between tasks and of the dependency relationship between tasks, which help with the choice of 69 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm issio n . Before Merger Figure 3.7: Explicit Task Merging R ep ro d u ced with p erm issio n o f th e copyright ow n er. Further reproduction prohibited w ithout p erm issio n . After Merger Figure 3.8: Explicit Task Merging R ep ro d u ced with p erm issio n o f th e copyright ow n er. Further reproduction prohibited w ithout p erm issio n . appropriate tasks to be merged. Since at each merging step we have to update the task graph, we have to make sure that this updating of the graph has low time complexity. For an example of this, refer to figures 3.7 and 3.8. Figure 3.7 shows a task graph before merging nodes rii and n 2. Figure 3.8 shows the task graph after the merger. 2. Implicit merging: In this method, we don’t keep up of how the task graph looks like during the execution of the partitioning algorithm. The problem with this approach is th at without knowing how the task graph looks like during each merging step, we cannot figure out the parallel ex­ ecution time, and it is hard to figure out the parallelism available in the graph. For our analysis, we use the explicit merging method. Explicit M erging Procedure In this section, we show how we update the task graph when tasks are merged. For an example of this, refer to figures 3.7 and 3.8. Figure 3.7 shows a task graph before merging nodes and n2. Figure 3.8 shows the graph after the merger. A ssu m p tio n : We assume that all messages inside a task which are destined to the same task are grouped together into a bigger message. There is no loss in doing so since no partial task execution is allowed because of the convexity constraint. Since tasks are generally small to medium grains, messages are never too big. Initially, each actor in the input program graph is put in a separate task. Therefore, the task graph has one node for each actor in the program graph. The edges in the task graph are determined by the edges between the actors in the program graph. 72 R ep ro d u ced with p erm issio n o f th e copyright ow n er. Further reproduction prohibited w ith out p erm issio n . Each time 2 nodes n x and n2 in the task graph connected by an edge (nl?n2), are merged into a node n 1 > 2 , we do the following: • Nodes ni and n2 are replaced by node n it2. • Edge (n i,n 2) is deleted. • Any edge going from n L to any other node n,- (i ^ 2) is replaced by an edge going from n li2 to n,-, and any edge going from nt - to nx is replaced by an edge going from n,- to nxi2. • Any edge going from n2 to any other node n,- (i ^ 1) is replaced by an edge going from n 12 to n,-, and any edge going from n,- to n2 is replaced by an edge going from n,- to nxt2. • If there is an edge from tii to n,- and an edge from n2 to nt - (i ^ 1 and i ^ 2), then edges (nx,nt) and (n2,n,) are replaced by one edge (nli2,n t -). • If there is an edge from n,- to nx and an edge from n,- to n2 (i ^ 1 and i 2), then edges (n,-,nx) and (n,-, n 2) are replaced by one edge (n,-, n lt2). Merging an Edge in the Task Graph Let e = (nx, n 2) be an edge in the task graph. Merging the edge e means merging tasks nx and n2 together. Time Com plexity of Explicit Merging Let N be the number of actors in the input program graph. Initially, the task graph has N nodes as well. Let’s consider the cost of merging nodes nx and n2. Replacing these 2 nodes by node nx,2 takes a constant amount of time. Each of these 2 nodes is connected to at most (N — 1) other nodes. Therefore, the total cost to replace all edges during this merging step is O(N). Hence, it costs at the most O(N) time to explicitly merge nodes n x and n 2. Since there is a total of (N — 1) merging steps, then the total cost to do the explicit merging, counting all merging steps in the algorithm is 0 (N 2). 73 R ep ro d u ced with p erm issio n o f th e cop yright ow ner. Further reproduction prohibited w ithout p erm issio n . Note that we over-estimated this time complexity because we assumed the worst case scenario. For real applications, the 2 nodes merged are not connected to ail other nodes in the graph, on the average, the nodes in the task graph axe connected to 3 or 4 nodes. Therefore, each merging step takes a constant amount of time. Hence the total cost to do the explicit merging, counting all merging steps in the algorithm is O(N). 3.8.1 U pdating Task Graph W eights as a Result o f the Merger Assume explicit merging. Consider 2 nodes n x and in the DAG, connected by an edge (n i,n 2) and merged into a node 711,2. For an example of this, refer to figures 3.7 and 3.8. Figure 3.7 shows a DAG g before merging nodes ni and n^. Figure 3.8 shows the graph g after the merger. The thick edges represent edges that carry more data (i.e. the edge weights has increased). The larger node represent a node that has more computations (i.e. the node weight has increased). 3.8.1.1 Node W eights comp(n 1,2) = comp(jii) + comp^n?). Here we assume that all PEs are simple and are not capable of doing parallel computations16. 3.8.1.2 Edge W eights 1. If an edge e' replaces one edge e: data(e') — data(e). l6Even simple com putations are done sequentially. 74 R ep ro d u ced with p erm issio n o f th e copyright ow n er. Further reproduction prohibited w ithout p erm issio n . 2. If an edge e' replaces two edges ei and e2: data(e') = data(e 1) + data(e 2). Since all messages inside a task th at have the same destination task are combined together into a bigger message then: comm(e') ^ comm(ei) + comm(e2). comm(e') = T3tart + delay(data(e 1)) + delay{data{e 2)). comm(e') = commie 1) + delay(dataie 2)). commie1 ) = comm(e 2) + delay{data(e 1)). 3.8.2 Creation of Cycles as a R esult o f Task M erging The task graph should remain acyclic at all times, so that we guarantee that no deadlock situation occurs because of the convexity constraint rule. Hence the task graph should always be a DAG. Initially the task graph is acyclic because we assume that the input program graph is acyclic. When we merge tasks, we have to make sure that no cycles axe created as a result of the merger. Consider 2 nodes ni and n2 in the task graph, connected by an edge (n^ n2) and merged into a node nx|2. Cycles could be introduced after the merger, because new edges are created. All newly created edges are connected to the newly created node n 1 )2 . T h e o re m : Given an acyclic task graph, consider 2 nodes ni and n2 in the graph, connected by an edge (n^ n2) and merged into a node n li2. A cycle is created after the merger if and only if there exists a path from ni to n2 other than (n i,n 2) before the merger. P ro o f 1. There exists a path from ni to n2 other than (n i,n 2) before the merger A cycle is created after the merger: The proof of this is quite obvious. Refer to figure 3.9. 75 R ep ro d u ced with p erm issio n o f th e copyright ow n er. Further reproduction prohibited w ithou t p erm issio n . Before Merger After Merger Figure 3.9: Cycle creation 2. A cycle is created after the merger = > There exists a path from ni to ri2 other than (n i,n 2) before the merger: For a cycle to be introduced, a newly created edge has to have created it. Since any new edge is connected to tzii2, then any created cycle contains the node n if2 (see figure 3.10-a). Before the merger, the portion of the graph in figure 3.10-a used to be the one shown in figure 3.10-b. We cannot have edges going from both nodes ni and n2 to na, because that would create a cycle, and we no that the graph is acyclic before the merger. Also we cannot have edges going from nf, to both nodes and 712, because that would also create a cycle. Assume th at we have an edge from ni to na. Then we cannot have an edge from nj to ni because that would create a cycle. Therefore, we can only have an edge from nj to n? (see figure 3.10-c). Assume th at we have an edge from n2 to na. Then having an edge from rib to ni or an edge from tij to 712 would create a cycle. Hence we cannot have 76 R ep ro d u ced with p erm issio n o f th e copyright ow n er. Further reproduction prohibited w ithout p erm issio n . (a) Cycle created as a result of the m erger From n ^ or n or both — • or both (b) Before the merger (c) The only possible solution Figure 3.10: Cycle creation 7 7 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . an edge from n 2 to na. In conclusion, the only possibility is the one shown in figure 3.10-c. A m erging Rule If there is a path from rii to n 2 in the task graph other than (n i,n 2), then nodes ni and n 2 should not be merged, otherwise we create a cycle in the graph. Keeping the task graph free of cycles guarantees that no deadlock situation occurs because of the convexity constraint. Effect on Tim e C om plexity Because of the merging rule stated above, each time 2 tasks are chosen to be merged, we have to check for cycle creation before we do the merger. This check will add to the tim e complexity of the partitioning algorithm. In order to be efficient, we should try to find a way to do this check at a low time cost. Effect on Efficiency of Partitioning Algorithm Consider the 2 nodes n,- and rij in the task graph, that are best candidates to be merged (i.e. their merger results into the best improvement in the partition). If their merger results into a cycle, then we cannot merge them, even though their merger results into the best improvement in the partition. To get around this problem, we might consider merging nodes rii and n 2 and all the nodes that belong to ail cycles created as a result of merging n i and n2 together. This will guarantee that no cycles are created as a result of the merger. However, there is no guarantee that the partition obtained as a result of the merger is better than the one before the merger. 78 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm issio n . Chapter 4 Analysis of th e Task Graph 4.1 Parallelism Loss Due to Task M erging In this section, we study the effect of task merging on the available parallelism in the task graph. Q uestion: Given a task graph, could there be any loss in parallelism when two nodes connected by an edge are merged? A nsw er: Yes It is quite obvious that some parallelism may be lost, even though these 2 nodes are connected by an edge, and therefore are dependent. P roof: Assume that the answer is NO. Then when we merge two nodes connected by an edge, there is reduction in communication overhead, and in addition no parallelism is lost. Therefore the optimal partition is obtained by merging any two nodes connected by an edge, which will result into a graph consisting of only ONE node (i.e. optimal partition consists of a single task)!! W hy? Consider two nodes ni and ni connected by an edge (ni,ni) and merged into a node n it2. 79 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . Before M erger After M erger Figure 4.1: Parallelism Loss Let’s examine node ni: Let JI3 be a node in g such that and nz are independent, and n 2 and n 2 are dependent. Then after the merger, n 1 and nz become dependent1. Thus n : and nz cannot be executed in parallel any longer. This represents a loss in parallelism with respect to n t in the task graph. For an example of this, refer to figure 4.1. 4.1.1 D efinitions P arallel Set of a N o d e: Given a node n in a DAG g, the Parallel Set of n is ParSet(n) := {n' 6 g j n and n' are independent}. These are the nodes that can be executed in parallel2 with n. Given 2 nodes ni and n 2 in a DAG, lTo be more accurate, it is 711 ,2 and 7 1 3 which are dependent, since m by itself no longer exists after the merger. 2t i may not be executed in parallel with all nodes in ParSet(n) simultaneously, since ParSet(n) is not necessarily an independent set. 80 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . ParSet(ni,722) := ParSet(nL ) U ParSet(n2). Dependent Set of a Node: Given a node n in a DAG g, the dependent set of n is DepSet(72) := {n' ( = . g / n and n! are dependent} Given 2 nodes nx and ra2 in a DAG, DepSet(nx,n2) := DepSet(nx) U DepSet(n2). R em ark : DepSet(n) is evaluated by finding all paths that pass through n. Lem m a: Consider two nodes n\ and n2 connected by an edge (nx,n2) and merged into a node nx> 2 , ParSet(nx> 2 ) = ParSet(rai) fl ParSet(n2) Remark: Also ParSet(nlj2) = ParSet(nt, n2) — [ParSet(nx, n2) fl DepSet(nx,n2)] L em m a: Consider two nodes nx and n2 connected by an edge (nx,n2) and merged into a node nxi2, DepSet(nli2) = DepSet(nx) U DepSet(n2) 4.1.2 Parallelism w ith respect to a N od e Parallelism w ith respect to a Node: Given a node n in a DAG g, we define the Parallelism with respect to n to be |P a r5 e t(n )|3. Parallelism Loss with respect to a Node: Given a node n in a DAG g, we say that there is parallelism loss with respect to n as a result of merging nodes if and only if the parallelism with respect to n after the merger is strictly smaller than the parallelism with respect to n before the merger. 3Given a set S, |S| is the number of elements in S. 81 R ep ro d u ced with p erm issio n o f th e copyright ow n er. Further reproduction prohibited w ithout p erm ission . 4.1.2.1 Condition for Parallelism Loss Consider two nodes ni and n2 in the task graph, connected by an edge (n i,n 2) and merged into a node ni,2. D efinition: We say that some parallelism is lost in the task graph as a result of the merger if and only if some parallelism is lost with respect to or n2. Thus if we guarantee that no parallelism is lost with respect to n\ an d no parallelism is lost with respect to n2, then we guarantee that no parallelism is lost in the task graph as a result of the merger. ParSet{n\) represents all nodes which can be executed in parallel with ni before the merger. Any node belonging to Z}ep5'et(n2) cannot be executed in parallel with n\ after the merger. Therefore, the set ParSet(ni) fl DepSet(n2) represents all nodes which could ex­ ecute in parallel with n\ before the merger, and no longer can be executed in parallel with after the merger. The same analysis can be applied to node n2. Hence: 1. Some parallelism will be lost with respect to ni as a result of the merger if and only if ParSet(rii) fl DepSet(n2) ^ 0. An equivalent condition is: ParSet(nlt2 ) ^ ParSet(n\). 2. Some parallelism will be lost with respect to n 2 as a result of the merger if and only if ParSet(n2) fl DepSet(ni) ^ 0. An equivalent condition is: ParSetfai^) ^ ParSet(n2). 82 R ep ro d u ced with p erm issio n o f th e copyright ow n er. Further reproduction prohibited w ithou t p erm issio n . 4.1.2.2 Amount of Parallelism Lost Consider 2 nodes rii and n2 in the task graph, connected by an edge (n i,n 2) and merged into a node n 1 > 2 . 1. The amount of parallelism lost (if any) with respect to rti as a result of the merger is defined to be |ParSet(ni) fl Z)ep,S,e<(n2)[. This can also be expressed as: |ParS'e£(nl)| — |ParS’ et(nii2)|. 2. The amount of parallelism lost (if any) with respect to n2 as a result of the merger is defined to be |P ar5 e^(n 2) fl DepSet(ni)\. This can also be expressed as: |P arS ’ et(n2)| — |ParS'et(nli2)|. D efinition: The amount of parallelism lost in the task graph as a result of the merger is defined to be the sum of the amount of parallelism lost with respect to rti and the amount of parallelism lost with respect to n2. 4.1.2.3 R e m a rk 1 Consider the example in figure 4.2. ParSet(n2) = {rc3,n 4}. DepSet(nx) = {n4,n 2}. ParSet(n2) fl DepSet(ni) = {ra4} ^ 0. Hence there is parallelism loss with respect to n2. Note that ParSet(n2) is not an independent set. n3 and n 4 are dependent, and therefore n 2 cannot execute in parallel with these 2 nodes simultaneously. Hence, if only one of these 2 nodes becomes dependent with n2 after the merger, n2 can still execute in parallel with the other node. Nevertheless, the CPL can still increase here. However, if both n3 and ra4 become dependent with rz2 after the merger, then node n2 will not be able to execute in parallel with any of these 2 83 R ep r o d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm issio n . | Merger After Figure 4.2: Parallelism Loss After Merger Figure 4.3: Parallelism Loss R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm issio n . M erge ft-, a n d n, Figure 4.4: Example 1 nodes. For an example of this refer to figure 4.3. In figure 4.3, ParSet(re2) = {n3,n 4}. DepSet(n!) = {n3,n 4,n 2}. ParSet(n2) fl DepSet(nt ) = {n3,n 4} ^ 0. 4.1.2.4 Remark 2 Consider the following 2 examples. Exam ple 1 Given the task graph in figure 4.4. Before the merger: ParSet(ni) = {n3,n 4,n 5,n 6}. n t, n3, n4 and n5 can execute in parallel. After the merger: ParSet(ni,2) = {n4,n 5,n 6}. We have parallelism loss with respect to n 4 (one node: n3 lost). However, n 3, n4 and n5 can still execute in parallel. In other words, only one node is lost for parallelism (node rii). 85 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . Merge n, and Figure 4.5: Example 2 Example 2 Given the task graph in figure 4.5. Before the merger: P a r S e t^ ) = {n3,n 4,n 5}. rii, n 3 , n 4 and n5 can execute in parallel. After the merger: ParSet(n1 > 2 ) = 0. All the parallelism with respect to n i is lost (3 nodes). However, n3, n4 and n 5 can still execute in parallel. In other words, only one node is lost for parallelism (node «i). Let S = ParSet(nj) fl DepSet(n2). For all nodes n € S, there exists at least one path p which goes through n and n2 but not ni (before the merger). Therefore, p increases in length after the merger (see section 4.2). The more nodes S has the more execution paths are most likely to increase after the merger. Hence defining the amount of parallelism with respect to rii as |5 | makes sense. Note that in general the number of execution paths th at increase as a result of merging tasks is different from |.S|. 4.1.3 Defining Parallelism It is very hard to define the parallelism available in a task graph formally and precisely. All we can do is give an approximative definition. The definition should depend on what we need it for and how we are going to use it. For instance in our case, our measure of the performance of the partitioning algorithm is the CPL. 86 R ep ro d u ced with p erm issio n o f th e copyright ow n er. Further reproduction prohibited w ithout p erm issio n . We will see later on in the analysis that our definition of parallelism is tightly coupled with the CPL of the task graph, and th at the way the CPL is affected by the merger of tasks is related to the loss of parallelism (the way we defined it) in the task graph. Some other possible definitions of the parallelism in a task graph follow. Note that we assume that all nodes have the same weight and all edges have the same weight. Total Parallelism = H „epeH((l + MaxPar(n)). Maximum Parallelism = M axn^pC T it{\ + M axPar(n)}. 4.1.4 U sable Parallelism D eg ree of P a ra lle lism : Given a DAG g and a set S of nodes belonging to g, the degree of parallelism in S is |P |, where4 P is the largest parallel set of S. U sable P a ralle lism : Given a DAG g and a node n in g, the Usable Parallelism with respect to n is MaxPar(n), defined to be the degree of parallelism in ParSet(n). L em m a: Given a DAG g and a node n in g, the Usable parallelism with respect to n is the maximum number of nodes in g that can be executed in parallel simultaneously with node n. 4.1.4.1 Condition for Usable Parallelism Loss Consider 2 nodes ni and n2 in the task graph, connected by an edge (ni,ra2) and merged into a node n i)2 . 1. There is usable parallelism loss with respect to ni as a result of the merger if and only if M axPar(n!) > M axPar(nli2). 4|P | is the num ber of elements in the set P. 87 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm issio n . 2. There is usable parallelism loss with respect to n 2 as a result of the merger if and only if MaxPar(n2) > MaxPar(rali2). 4.1.4.2 A m o u n t o f u sab le P arallelism Lost Consider 2 nodes and n 2 in the task graph, connected by an edge (n i,n 2) and merged into a node n i> 2 . 1. The amount of usable parallelism lost (if any) with respect to ni as a result of the merger is defined to be M axPar(ni) — M axPar(nli2). 2. The amount of usable parallelism lost (if any) with respect to n2 as a result of the merger is defined to be MaxPar(n2) — M axPar(nij2). 4.1.4.3 A n o th e r co n d itio n fo r usable P ara lle lism Loss In what follows, we assume that two nodes n\ and ra2 in the task graph, connected by an edge (nt, n2) are merged into a node ni,2. Let’s examine the parallelism lost with respect to n\. Let S = ParSet(nx) fl DepSet(n2). S represents all the nodes which could be executed in parallel with rii before the merger, and no longer can execute in parallel with n\ after the merger. However, ParSet(n: ) is not necessarily an independent set, and therefore it is not always the case that ni can execute with all nodes in ParSet(nx) simultaneously. Let’s assume that ParSet(ni) is not an independent set. Hence, there exists at least one dependent set S' C ParSet(ni). Clearly, no 2 nodes belonging to S' can execute in parallel. Hence, nj can execute in parallel with only one node a t a tim e from S'. Therefore, if after the merger some nodes in S' become dependent with ni because they used to belong to DepSet(n2) before the merger, no parallelism will be lost with respect to n\ because of that, provided th at at least one node in S' remains independent with «i after the merger. 88 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm issio n . T h e C o n d itio n Some usable parallelism will be lost in the task graph as a result of the merger if and only if any of the 2 following conditions is true: 1. Some usable parallelism with respect to nx is lost. 2. Some usable parallelism with respect to n 2 is lost. Usable Parallelism Lost W ith R espect to n x C ase 1 ParSet(rax) is an independent set5. In this case, some parallelism will be lost with respect to n x if and only if ParSet(ni) D DepSet(n2) ^ 0. C ase 2 ParSet(nx) is not an independent set. Therefore, there is at least one dependent set S' C ParSet(nx). W ithout any loss of generality, let’s assume that ParSet(ni) has k anti­ parallel sets: S x, S2, • •., Sk- Let Su = S t U S 2 U ... U Sk. Let S = ParSet(ni) fl DepSet(n2). In this case, some parallelism will be lost with respect to n x if and only if the following 2 conditions Ci and C 2 axe satisfied: 1. Ci. 5 ^ 0 . 2 . C2 = C2 iX OR C2,2. C2.1: 5 £ Su (i.e. there exists at least one node n € S such th at n £ Su n does not belong to any of the sets Si, 1 < i < k n and any other node in ParSet(nx) are independent.) C2,26: S C Su and at least one of the sets Si C S, 1 < i < k. 5There is no dependent set S ' C ParS et(ni). 6Here we are assuming th at all sets Si’ s are disjoint, and that no two nodes belonging to different anti-parallel sets of ParSet(nx) can be dependent. 89 R ep ro d u ced with p erm issio n o f th e copyrigh t ow ner. Further reproduction prohibited w ithout p erm issio n . Usable Parallelism Lost W ith Respect to n2 The exact same analysis that applies to n L applies to ra2 as well. Problem W ith Above Condition As was mentioned earlier, for condition 6 * 2,2, we assume that all sets 5,-’s are dis­ joint, and that no two nodes belonging to different anti-parallel sets of ParSet(ni) can be dependent. Clearly, in general this might not be true. 4.1.5 R elationship between Parallelism and Usable Parallelism L em m a: Given a DAG g and a node n in g, PAR(ra) > MaxPar(n) i.e. the parallelism w / respect to n is greater or equal than the usable parallelism w/ respect to n. L em m a: Given a DAG g and a node n in < 7, There is usable parallelism loss with respect to n There is parallelism loss with respect to n. There is parallelism loss with respect to n -j= $- There is usable parallelism loss with respect to n. There is no parallelism loss with respect to n =>■ There is no usable parallelism loss with respect to n. There is no usable parallelism loss with respect to n ^ = = > There is no parallelism loss with respect to n. L em m a: Given a DAG g and a node n in g, if ParSet(n) is an independent set, then PAR(n) = MaxPar(n). L em m a: Given a DAG g and a node n in g, if ParSet(n) is an independent set, then: 90 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm issio n . 4.1.6 Upper Bound on D egree o f Parallelism Consider a DAG g and a set S of nodes belonging to g. We find the smallest k such that S = Si U S2 U . . . U Sk U Sind, where each Si is an anti-parallel set, and S{nd is a set of zero or more nodes that don’t belong to any dependent set. k = 0 represents the case for which S doesn’t have any dependent sets (i.e. S is an independent set). Any parallel set of S contains zero or one element from each set Si, plus all elements in S,n £ f- Therefore, the degree of parallelism in S is < k + |Sind | • Consider the case where S can be written as S = Si U S 2 U . . . U Sjt U Sind, where each Si is an anti-parallel set, and Sind is a set of zero or more nodes that don’t belong to any dependent set, and all 5 ,’s are disjoint, and no 2 nodes which belong to different anti-parallel sets (from the above listed anti-parallel sets) are dependent. In this case, the degree of parallelism = k + \ S{nd |. 4.1.7 Theorem Let g be a task graph. Let ni and ni be 2 nodes in g connected by an edge e = (nj, n2). Let S\ := DepSet(ni) — {n2}. Let S 2 := DepSet(n2) — {n!}. ParSet{ni) fl DepSet(n2) = 0 AND ParSet{n2) fl DepSet{n{) = 0 «=» ParSet(ni) = ParSet(n2) Si = S2. 91 R ep ro d u ced with p erm issio n o f th e copyright ow n er. Further reproduction prohibited w ithout p erm issio n . Proof 1. Assume that: ParSe£(ni)fl.DepSe£(n2) = 0 AND ParSet(n 2 )nDepSet(ni) 0 - (a) ParSet[ni) fl DepSet(n2) = 0: DepSet(n2) = {nx} U S2. Vn G S2, n ^ ParSet(ni) = > • Vn 6 52, n 6 DepSet(n{) (since n ^ ni) => Vn G S2, n € Si (since n ^ n2) = > ■ s 2 C Si (1) Vn G ParSet(ni), n £ DepSet(n2) = ^ - Vn G P arS et(ni), n € ParSet(n2) (since n ^ n2) =^ ParSet(ni) C ParSet(n 2 ) (2) (b) ParSet(n2) fl DepSe£(ni) = 0: DepSet(n{) = {n2} U Si. Vn 6 Si, n ^ ParSet(n2) = > • Vn 6 Si, n € DepSet(n2) (since n 7^ n2) =» Vn 6 Si, n € S2 (since n ^ ni) = £ - Si C S2 (3) Vn G ParSet(n2), n g DepSet(n{) = > ■ Vn € ParSet(n2), n G ParSet(ni) (since n / ni) = > • ParSet(n2) C ParSet(rii) (4) (1) AND (3) =» Si = S2. (2) AND (4) = > ParSet(ni) = ParSet(n2). 2. Assume that: P arS et(n i) = ParSet(n2). Vn G DepSet(n2), n ^ ParSet(n2) = > • Vn G DepSet(n2), n ^ ParSet(ni) ParSet(ni) fl DepSet(n2) = 0 (5) 92 R ep ro d u ced with p erm issio n o f th e copyright ow n er. Further reproduction prohibited w ithout p erm issio n . Vn £ DepSet(ni), n ^ ParSet{n\) = > ■ Vn 6 DepSet(ni), n £ ParSet(n 2 ) = ^- ParSet{n 2 ) fl DepSet(ni) = 0 (6 ) (5) AND (6) =* Si = S2. 3. Assume that: Si = S2. Vn £ ParSet(ni), n ^ DepSet(ni) = * ► Vn 6 ParSet(nx), n g Si = > Vn 6 ParSet(ni), n f: S 2 = > Vn £ ParSet(ni), n ^ DepSet(n 2 ) (since n 7^ ni) = £ • Vn £ ParSet(ni), n £ ParSet(n 2 ) (since n 7^ n2) = £ • ParSet(ni) C ParSet(n2) (7) Vn £ ParSet(n2), n ( f c . DepSet(n 2 ) = > • Vn £ ParSet{n 2 ), n 0 5 2 = > • Vn £ ParSet(n2), n Si = > Vn £ P ar5 et(n 2), n ^ DepSet(ni) (since n 7^ n2) = £ ■ Vn £ ParSet(n2), n £ ParSet[n\) (since n 7^ ni) = £ • ParSet(n 2 ) C ParSet(n 1) (8 ) (7) AND (8) = S > ParSet(ni) = ParSet(n2). Corollary Let 5 be a task graph. Let ni and n 2 be 2 nodes in < 7 connected by an edge e = (n !,n 2). No parallelism is lost in the task graph as a result of the merger ParSet(ni) = ParSet{n2). Proof No parallelism is lost as a result of the merger if and only if no parallelism is lost with respect to ni and no parallelism is lost with respect to n2. This is true if and only if 93 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . ParSet(ni) fl DepSet(ri2 ) = 0 AND ParSet(ri2 ) fl DepSet(ni) = 0. From the above theorem, that is true if and only if ParSet(ni) = ParSet{n 2 ). 4.2 Effect of Task Merging on CPL 4.2.1 Problem Statem ent In all what follows, we assume that 2 nodes ni and n2 in the task graph connected by an edge e = (n i,n 2 ), are merged into a node n ii2. Let pc = P^it of task graph before the merger. lb := length of Pa-u of task graph before the merger. lb = P b (P c) • CPLh := lb. la := length of Pa-u of task graph after the merger. 4.2.2 Effect on P ath Length Let p be any path in the task graph. We have 3 possibilities: 1. N one of the two nodes merged belongs to p. 2. O n ly one of the two nodes merged belongs to p. 3. B o th nodes merged belong to p: ^ e € p . To see why this is true, assume that e ^ p. = > • there are two possibilities: (a) There is a path from ni to n 2 other than (n !,n 2). => After a merger, a cycle will be created. Therefore ni and n2 cannot be merged together. (b) There is a path from ra 2 to ni. There is a cycle before the merger, because of edge (n i,n 2), which is not possible since we have a DAG. 94 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . The length of p is affected by the merger if and only if at least one of the following conditions is true: 1. A node in p is replaced by another node that has more computations (this is the case when only one of the 2 merged nodes belongs to p). Assuming th at ni E p, L(p) is increased by comp{n2). 2. An edge in p is deleted (this is the case when e € p). L{p) is reduced by comm(e). 3. An edge in p is replaced by another edge which carries more data (this is the case when tw o edges ei and e2 are replaced by one edge e', and either ei or e2 belongs to p). Assume ej (E p, then L(p) is increased by delay(data(e2)). There are 3 cases: C ase 1 N one of the two nodes merged belongs to p: U p) = L b {p ). C ase 2 O nly one of the two nodes merged (say it is ni) belongs to p: Let np be the predecessor of ni in p (if any). Let ns be the successor of nx in p (if any). La(p) = Lb(p) + comp(n2) + delay(data(np, n2)) + delay(data(n2, ns)). Note th at if (np,n 2) and (n2,n s) don’t exist (or if np and ns don’t exist), then Lafp) = Lb{p) +comp(n2). This increase in length of p represents a loss in parallelism and increase in se- quentialization by the amount comp(n 2 )+delay(data(np, n 2 ))+delay(data(n2, relative to path p. The terms involving the delay function are due to the fact that some inter- PE communication has to be sequentialized are a result of the merger. For instance, the increase by the amount delay{data{np,n 2)) is due to the fact 95 R ep ro d u ced with p erm issio n o f th e copyright ow n er. Further reproduction prohibited w ithou t p erm issio n . that before the merger, np used to send the data on edges (np,ni) and (np, 712) to separate virtual PEs in parallel. After the merger, the data on these 2 edges is combined and sent to the same virtual PE. Clearly this takes more time. In conclusion, we could have an increase in the CPL, and as a consequence the parallel execution time could increase. For an example of this, refer to figures 3.7 and 3.8. Figure 3.7 shows a task graph before merging nodes and n2. Figure 3.8 shows the graph after the merger. Consider path pi = (n3,n p,ni,res,n 5,n i0) in figure 3.7. After the merger, P i = {p-3i ^pj ^ l,2j U5 , Tlio). C ase 3 B o th nodes merged belong to p: Let np be the predecessor of ni in p (if any). Let ns be the successor of n2 in p (if any). ^a(p) = L> b(p) — comm(e) + delay (data(np,ri2 )) + delay(data(ni,ns)). Note that if (np,n 2) and (ni,n,) don’t exist (or if np and n , don’t exist), then La(p) = Lb{p) — comm(e). This decrease in the length of p represents a reduction in com m unication overhead by the amount x = comm(e)—delay(data(np, n2))— delay(data(ni, n„ relative to path p (assuming that x > 0, which is true for most cases). Again, the terms involving the delay function are due to the fact that some inter-PE communication has to be sequentialized are a result of the merger. For an example of this, refer to figures 3.7 and 3.8. Figure 3.7 shows a task graph before merging nodes ni and n2. Figure 3.8 shows the graph after the merger. Consider path p2 = (n3, np, nt, n2, ns, n 5, n 10) in figure 3.7. After the merger, P i ~ ( ^ 3> Ux> 2 , ra5, n s , Uio). 96 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . 4.2.3 M erging an Edge Belonging to the Critical P ath In what follows, we omit the terms involving the function delay in the expressions giving the length of a path after merging two nodes, in terms of its length before the merger. Let’s assume that e € Pcrit of task graph. Thus, La(pc) = 4 — comm(e). 4.2.3.1 Effect on Execution Paths Clearly for any execution path p in the task graph, Lb(p) < 4 , since 4 is the CPL before the merger. 1. Any execution path p that La(p) = Lb(p). 2. Any execution path p that La(p) = Lb(p) + comp(n2). 3. Any execution path p that La{p) = Lb(p) + comp{nf). 4. Any execution path p that La(p) = Lb{p) — comm(e). 4.2.3.2 Effect on Critical Path Case 1 There is no execution path that goes through only one of the 2 nodes merged: Thus for any execution path p, La(p) < Lb(p). Also we know that Lb{p) < lb. Therefore, La(p) < lb. Hence, CPL will either decrease or remain unchanged as a result of the merger. P c r it could change as a result of the merger. We have 2 cases: 1. If all execution paths p go through e: La(p) = Lb{p) — comm(e). 97 doesn’t go through any of the 2 nodes merged: goes through n L and not n2: goes through n2 and not nx: goes through edge e: R ep r o d u ced with p erm issio n o f th e copyright ow n er. Further reproduction prohibited w ithout p erm issio n . In this case, all execution paths including pc will be reduced in length by the same amount. Hence, P a - n will not change and CPL decreases by comm(e) as a result of the merger. 2. At least one execution path doesn’t go through e: Let p i,p i, . . . ,Pk be such execution paths, where k > 1. Lo(pi) — Lj(pf), 1 ^ i ^ fc. There are 2 cases: (a) If at least one of the p,’s is such that Lb{Pi) = h' P c r it will change and CPL will remain unchanged. (b) If Lb(pi) < lb, 1 < * < k: CPL will decrease. P c r it could change. There axe 2 possibilities: i. L b (p i) < L a (p c ), 1 <i< k: CPL will decrease by comm(e). P c r it will not change. ii. At least one of the p.’s is such that L b (p i) > L a (p c ): P c r it will change. CPL will decrease by an amount smaller than comm(e). Let p m be the pt - such that L b (p m ) is the largest among all p i s. After the merger, P c r it = P m and C P L — L b ijp m )• CPL will decrease by lb — Lb(pm). Case 2 There is at least one execution path p that goes through only one of the 2 nodes merged: P c r it could change as a result of the merger, and C P L could increase, since the length of p increases after the merger. Let Pi,P2j • • • iPk be all execution paths that go through only one of the 2 nodes merged, where k > 1. Let nt,i and n ,- i2 be the two nodes merged, and let ntii be the node that 98 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . belongs to p,-, and let n ,- )2 be the other node7, 1 < i < k. La(pi) = Lb(pi) + comp(ni,2), l < i < k . We know that lb > Lb{pi), 1 < i < k, since lb is the CPL before the merger. There are so many possibilities, depending on the length of the execution paths before the merger, /& , the value of camp(v.ij), the value of comm(e), etc. Since we already studied the case where no execution path goes through only one of the 2 nodes merged, le t’s assu m e th a t all ex e cu tio n p ath s (pc excluded) go th ro u g h only one of th e 2 nodes m erg ed . This will simplify our analysis. In this case, the execution paths are plt p2, . . . , pk and pc. There are 2 possible situations: 1. If La(pi) < La(pc), 1 < * < k: Pa-it will not change. CPL will decrease by comm(e). 2. If there is at least one execution path p such that La(p) > La(pc): Pcrit will change, but CPL does not necessarily increase. We have 2 cases: (a) If La(pi) < lb, 1 < i < k: i. If at least one of the p,’s is such that L0(p,-) = lb: Let p0 be this p,-. After the merger, Pcrit == Po- CPL remains imchanged. ii. If La(pi) < lb, 1 < i < k : CPL will decrease by an amount less than comm(e). Let pro be the p ,- such that La(pm) is the largest among all p,-’s. After the merger, 7If pi goes through n t then is n i and n,-i2 is n2. If p,- goes through n2 then n fil is n 2 and ni,2 is n i. 99 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm issio n . Pzrit — Pm <md C P L — La (Pm ) • CPL will decrease by lb — Lb(pm) — comp{nm,2). (b) If there is at least one execution path p such that La(p) > lb’ . CPL will increase. Let pm be the p ,- such th at La(pm) is the largest among all p,-’s. After the merger, P& -it — Pm 3Jld CPL — La(pm). CPL will increase by Lb(pm) + comp(nm^ 2) — lb- 4.2.3.3 Conclusion • Merging an edge that belongs to Pa-it of task graph does not guarantee a decrease in CPL. • The maximum decrease in CPL is comm(e). • Merging an edge that belongs to all execution paths guarantees the maxi­ mum decrease in CPL. • If none of the execution paths go through only one of the 2 nodes merged, then CPL will either decrease or remain unchanged. Also the maximum decrease in CPL could be achieved here. • If at least one execution path goes through only one of the 2 nodes merged, then CPL will either increase, remain unchanged or decrease. Also the maximum decrease in CPL could be achieved here. 4.2.4 M erging an Edge N ot B elonging to th e Critical P ath In what follows, we omit the terms involving the function delay in the expressions giving the length of a path after merging two nodes, in terms of its length before the merger. 100 R ep ro d u ced with p erm issio n o f th e copyright ow n er. Further reproduction prohibited w ithout p erm ission . Let’s assume th at e ^ Pcrit- Clearly, e belongs to at least one execution path8. There are 2 possible cases: Case 1 Pcrit goes through only 1 of the 2 nodes merged: Let nin be the node merged which belongs to pc, and let nout be the node merged which does not belong to pc. La(pc) = lb + com ping). After the merger, CPL will increase by at least comp^n^t) and Pa-u might change. There are 2 possibilities: 1. If none of the execution paths p (pc excluded) go through only one of the 2 nodes merged: La{p) < Lb{p). Since Lb{p) < lb, then La(p) < lb. Hence CPL will increase by compljiout) and Pcrit will not change after the merger. 2. If at least one execution path (pc excluded) goes through only one of the 2 nodes merged: Let pi,p 2 , . ■ • ,Pk be all the execution paths that go through only one of the 2 nodes merged (pc excluded), k > 1. Let n,ti be the node merged which belongs to p,-, and n t)2 be the node merged which does not belong to p,-. La{Pi) = Lb(Pi) + comp(nia), 1 < i < k. If comp(nit2) < compinout) then Pcr,t will not change and CPL will increase by comp(nout)• If n ,i2 = nout then P^u will not change and CPL will increase by comp(nout)- Let pm be the p ,- such that La(pm) is the largest among all p,-’s. 8 Any edge in the graph belongs to at least one execution path. 101 R ep ro d u ced with p erm issio n o f th e copyright ow n er. Further reproduction prohibited w ithout p erm issio n . There are 2 possible cases: (a) If L a{pi) < L a{pc), 1 <i<k: Pcrit will not change and CPL will increase by compfaout). (b) If at least one p ,- is such that L„(p,-) > La(pc): After the merger, Pcrit = Pm and C P L = La{pm). CPL will increase by an amount greater than comp(nout). The increase in CPL is Lb(pm) + comp{nm,2) — h- Case 2 Pcrit does not go through any of the 2 nodes merged: L a{Pc) = Lb{pc) = lb- CPL will either increase or remain unchanged. Pcrit might change. There are 2 possibilities: 1. If no execution path p (pc excluded) goes through only one of the 2 nodes merged: La{p) < Lb(p). Since L&(p) < lb then La(p) < lb. Thus, Pcrit and CPL will not change. 2 . If at least one execution path goes through only one of the 2 nodes merged: Let Pi,P2> • • • tPk be all the execution paths that go through only one of the 2 nodes merged (pc excluded), k > 1. Let ntti be the node merged which belongs to p,-, and n ,- t2 be the node merged which does not belong to p,-. La(pi) = Lb(pi) + comp(nij2), 1 < i < k. Hence, Pa-n could change and CPL could increase. There are 2 cases: (a) If La(pi) < lb, 1 < * < k: Pcrit and CPL will not change. 102 R ep ro d u ced with p erm issio n o f th e copyright ow n er. Further reproduction prohibited w ithout p erm ission . (b) If at least one p ,- is such that La{pi) > lb- Let pm be the p ,- such that La(pm) is the largest among all p ,’s. After the merger, P z r it — P m 3n d C P L — La(jpm). CPL will increase by Lb(jpm) 4 - comp(nmt2) — lb- Conclusion • If e ^ Pa-it then CPL never decreases (it will either increase or remain unchanged) after the merger. • If Pa-it goes through only one of the 2 nodes merged, then CPL will increase by at least comp(ncmt) after the merger, where nout is the node merged which does not belong to Pa-it- • If Pcrit does not go through any of the 2 nodes merged, then CPL will either increase or remain unchanged after the merger. 4.3 M erging Tasks: Effect of Parallelism Loss on CPL In this section, we study the effect of parallelism loss on the CPL of the task graph. We consider two nodes n 1 and n 2 belonging to the task graph and connected by an edge e = (nl7 n 2). We study the effect of merging nodes nx and n 2 on the CPL when the merger causes parallelism loss and when it doesn’t. 4.3.1 No Parallelism Loss T heorem : Assume that ParS'ef(ni) = P arSetfa), so that there is no parallelism loss when we merge ni and n2. Then the CPL of the task graph never increases as a result of the merger. 103 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . Before Merger After Merger Figure 4.6: No parallelism loss R ep ro d u ced with p erm issio n o f th e copyright ow n er. Further reproduction prohibited w ithout p erm issio n . Figure 4.7: No parallelism loss R ep ro d u ced with p erm issio n o f th e copyright ow n er. Further reproduction prohibited w ithout p erm issio n . Figure 4.8: No parallelism loss R ep ro d u ced with p erm issio n o f th e copyright ow n er. Further reproduction prohibited w ithout p erm issio n . P ro o f Let’s use proof by contradiction. We assume that the CPL increases as a result of the merger. Therefore, there should exist at least one execution path p such that La(p) > C P Lb. Clearly, Lb(p) < C P Lb. There are 3 possible cases: 1. p goes through both nodes n i and n2. 2. p goes through rii but not ra2. 3. p goes through n2 but not nj. The situation where p doesn’t go through any of the 2 nodes rii and n 2 is not possible, since in that case L(p) is not affected by the merger. Let’s investigate the 3 possible cases. C asel p goes through both nodes ni and n2: Let p = (n,-,.. .,n p,n i,n 2,n s,...,n j) . For p to have the maximum increase in length after the merger, we have to have an edge (np,n 2) and an edge (n r,n s) (see figure 4.6). Let A L := La(p) — Lb(p). A L = delay(data(np, n2)) + delay{data{n\, n,)) — comm(rii,n2). The comm function includes both the start-up component and the delay component. Since in general the start-up component is much larger than the delay component, A L must be negative. Furthermore, in practice the edges (np,n 2) and (n i,n s) are most likely not to exist. This means th at La{p) < Lb{p). Since Lb{p) < C P L b, then La(p) < C P Lb. This is a contradiction since we assumed that La(p) > C P Lb. Case 2 p goes through ni and not n2: Let p — (rz,', . . . , Tip, Tij, Tig, * • • j ^/)* Since ParSet(ni) = P arS et(n 2) and nodes ni and n, are dependent, then nodes n 2 and n, have to be dependent as well. Thus either there exists a path from n2 to n3 or there exists a path from ns to n2. If there exists a path from ns to n2 then there exists a path from n x to n 2 other than (n i,n 2). 107 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithou t p erm issio n . Therefore we will have a cycle in the task graph after the merger. Hence we have to disregard this case, which means that there exists a path from n 2 to ns (see figure 4.7). Let p — (T I,* , . . . , T ip, Tlj, ^ 2j • • • j ^sj • • • j Tl/). Let AZ := ZQ (p) — Z&(p'). For AZ to have its maximum value, Z0(p) has to be maximized and Lb(p) has to be minimized. Hence, the path from n2 to n , has to be constituted of the siDgle edge (7i2,n a). Also for Za(p) to be maximized, we have to have an edge (np, n 2) (see figure 4.6). Therefore, p — (ti^, •.• , T ip, tix, ti2, tia, . . . , ti^). Hence, AZ = delay(data(np, ti2)) + delay(data(ni,ns)) — comm(nu n 2). Again, AZ must be negative. Furthermore, in practice the edges (np,n 2) and (n2,n ,) are most likely not to exist. This means that Za(p) < Lb{p'). Since Z&(p') < CPLb, then Za(p) < CPLb- This is a contradiction since we assumed that Za(p) > CPLb- C ase 3 p goes through n2 and not Let p — (ti,*, . . . , T ip, T l2, T l^, • . . , Tly). Since ParSet{n x) = Par Set (1 1 2) and nodes n 2 and np axe dependent, then nodes n x and np have to be dependent as well. Thus either there exists a path from n x to np or there exists a path from np to n x. If there exists a path from 7 1 1 to T ip then there exists a path from 7 i x to n 2 other than (n i,n 2). Therefore we will have a cycle in the task graph after the merger. Hence we have to disregard this case, which means that there exists a path from np to rii (see figure 4.8). Let p — (ti,*, * . • , T ip , . . • , T lX , T l2, Tlj, . . . , Tlf'}- Let AL := La(p) - Lb(p')- For AZ to have its maximum value, Z„(p) has to be maximized and Lb(p) has to be minimized. Hence, the path from np to n x has to be constituted of the single edge (np, n x). Also for La(p) to be maximized, we have to have an edge (tii,ti4) (see figure 108 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . Merge Figure 4.9: Example: no parallelism loss 4.6). Therefore, p' = (n,-,. . . , np, ni, n 2, na, . . . , nj). Hence, A£ = delay {data{np,Ti2)) + delay{data{nu n3)) — comm(ni,ri2). Again, A L must be negative. Furthermore, in practice the edges (np,n l ) and (n i,n s) are most likely not to exist. This means that La(p) < Lb{p'). Since £&(p') < CPLb, then La(p) < CPLb■ This is a contradiction since we assumed that La(p) > CPLb- Hence, there cannot exist an execution path p such that La(p) > CPLb 109 R ep ro d u ced with p erm issio n o f th e copyright ow n er. Further reproduction prohibited w ithout p erm issio n . Merge Weights: Actors: 1 (except n Edges: 10 Figure 4.10: Example: no parallelism loss 110 R ep ro d u ced with p erm issio n o f th e copyright ow n er. Further reproduction prohibited w ithout p erm issio n . Merge Weights: Actors: 1 (except Edges: 10 Figure 4.11: Example: no parallelism loss 111 R ep ro d u ced with p erm issio n o f th e copyright ow n er. Further reproduction prohibited w ithou t p erm issio n . Examples • Figure 4.9 shows a task graph before and after the merger. ParSet(ni) = {n6,n 8} ParSet(n2) = {nsjfts} P arSet(ni) = ParSet(n2) Hence, there is no parallelism loss as a result of the merger. Before merger: CPL = 45. After merger: CPL = 35. The CPL has decreased. • Figure 4.10 shows a task graph before and after the merger. ParSet(ni) = {n6,n 7,n 8,n 10,n ii} ParSet(n2) = {n6,n 7,n 8,nio,ren} ParSet(n,i) = ParSet(n2) Hence, there is no parallelism loss as a result of the merger. Before merger: CPL = 67. After merger: CPL = 57. The CPL has decreased. • Figure 4.11 shows a task graph before and after the merger. ParSet{n{) = {n3,n 6, n7, n8, , n9, n 10, n u , n 12} ParSet(n2) = { n a jn e jn ^ n ^ n g jr iio ,^ ! ,/^ } ParSet{ni) = ParSet(n2) Hence, there is no parallelism loss as a result of the merger. Before merger: CPL = 56. After merger: CPL = 56. The CPL has not changed. 4.3.2 T here is Parallelism Loss T h eo rem : Assume that there is parallelism loss when we merge rii and n2. Then the CPL of the task graph could increase as a result of the merger. R ep ro d u ced with p erm issio n o f th e copyright ow n er. Further reproduction prohibited w ith out p erm issio n . M erge Figure 4.12: Example: there is parallelism loss M erge Figure 4.13: Example: there is parallelism loss R ep ro d u ced with p erm issio n o f th e copyright ow n er. Further reproduction prohibited w ithou t p erm issio n . Figure 4.14: Example: there is parallelism loss M erge Figure 4.15: Example: there is parallelism loss 114 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm issio n . M erge Figure 4.16: Example: there is parallelism loss R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithou t p erm issio n . Proof Let’s prove the claim in the theorem by studying some examples. In all the figures used in the following examples, the number next to a node represents its execution time, and the number next to an edge represents the communication time caused by the edge. • Figure 4.12 shows a task graph before and after the merger. ParSet(n. 2 ) = {re3, r a 4} DepSet(ni) = {n2,n 4} ParSet(ri 2 ) fl DepSet(ni) = {n4} Hence, there is parallelism loss with respect to n2. Before merger: CPL = PARTIME = 4. After merger: CPL = 5. Thus the CPL has increased. Note that before the merger, the graph had a critical path which contained rii aQd not n 2 ((n i,n 4)), and that is why we have an increase in the CPL. • Figure 4.13 shows the same graph as in figure 4.12, except for the weights. Before merger: CPL = 7. After merger: CPL = 7. Thus the CPL did not change. • Figure 4.14 shows a task before and after the merger. ParSet(n2) = {n3, ra 4,n 5} DepSet(rii) = {n2,n 5} P a rS e tfa ) fl DepSet(ni) = {n5} Hence, there is parallelism loss with respect to n2. Before merger: CPL = 7. After merger: CPL = 7. Thus the CPL did not change. • Figure 4.15 shows a task graph before and after the merger. ParSet(ri 2 ) = {n3, n4, n5, ne}: independent set. DepSet(ni) = {n2,n 3,n 4,n 5,n 6} 116 R ep ro d u ced with p erm issio n o f th e copyright ow n er. Further reproduction prohibited w ithou t p erm issio n . ParSet(n2) n DepSet(ni) = {n3,n 4,n 5,n6} = ParSet(n2) Hence, there is parallelism and usable parallelism loss with respect to n2. Before merger: CPL = 12. After merger: CPL = 13. Thus the CPL has increased. Note that before the merger, the graph had at least one critical path which contained rii and not n2 (e.g. (n i,n 3)), and that is why we have an increase in the CPL. • Figure 4.16 shows a task graph before and after the merger. ParSet(n2) = {n3, n ^ns}: independent set. DepSetiny) = {n2,n3,n4,n5,n6} ParSet(n2) fl DepSet(ni) = {n3,n 4,n 5} = ParSet[n2) Hence, there is parallelism and usable parallelism loss with respect to n2. Before merger: CPL = 23. After merger: CPL = 13. The CPL has decreased. Conclusion • Given a task graph, if there is parallelism loss as a result of task merging, the CPL can increase, remain unchanged, or decrease. • Given a task graph, if there is usable parallelism loss as a result of task merging, the CPL can increase, remain unchanged, or decrease. 4.4 A Comparison w ith DSC The scheduling problem as defined by Tan Yang [18, 58, 59] (a sequence of task clustering) has some similarities with the way we define the partitioning problem (a sequence of task merging). Mainly both assume the availability of an infinite number of PEs and non-zero communication overhead between PEs. As a conse­ quence, both problems use the CPL of the task graph as the parallel execution time. 117 R ep ro d u ced with p erm issio n o f th e cop yright ow ner. Further reproduction prohibited w ithout p erm issio n . Merge n. and n 4,5 CPL = 15 CPL =18 Figure 4.17: Example of task merging However, there is a major difference, since merging tasks involves changes in the task graph9 whereas task clustering doesn’t 10 Because of this, there is a major difference in the effect of task merging and task clustering on the length of execution paths and as a consequence on the CPL of the task graph. 4.4.1 Task M erging As we saw previously, task merging could increase the CPL of the task graph. As an example, consider the task graph in figure 4.17. Before the merger, the critical path is (724, 715) and the CPL is 15. After the merger, both execution paths and (n 2, 723, 725) increase in length by 4, since they both go through 9 When two tasks are merged, they are replaced by a new task and some edges are replaced by new ones. ‘“Clustering sim ply means that all tasks in the same cluster are executed in the sam e PE. The only change in the task graph is the addition of pseudo-edges between independent tasks in the same cluster to impose an execution order in the PE. Also all weights of edges between tasks in the same cluster are zeroed. 118 R ep ro d u ced with p erm issio n o f th e copyright ow n er. Further reproduction prohibited w ithout p erm issio n . n^ and are put in the sam e cluster CPL = 14 CPL = 15 Figure 4.18: Example of task clustering only one of the 2 nodes merged. The CPL increases to 18. This is a side effect of the merger. 4.4.2 Task Clustering Task clustering rarely causes the CPL of the task graph to increase. As an exam­ ple, consider the task graph in figure 4.18. Before the merger, the critical path is (n4,n 5) and the CPL is 15. After n4 and ns are put in the same cluster, the execution paths (nt , n3, rc5) and (n2, n3, n5) are not affected11. The CPL decreases to 14. 4.4.3 Consequence Because of the side effect caused by task merging, reducing the CPL using task merging is much harder than using task clustering. This makes the partitioning 1 lThe only case when execution paths could increase in length is when pseudo-edges are added. 119 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . problem (as defined in this work) much harder than the scheduling problem (as defined by Tao Yang). 4.5 Criteria for M erging In this section, we list some criteria that will be used by the partitioning heuristics to choose the edge to be merged. We use the results of the previous analysis from the previous sections to obtain these criteria. • Edge has to belong to a critical path. • Edge that belongs to all execution paths (if such an edge exists). • Edge e such that none of the execution paths go through only one of the 2 nodes connected by e. • Edge e = (n r,n 2) such that ParSet{ni) = ParSet(n 2) (no parallelism loss as a result of the merger). • Edge e with the largest comm(e). This way all execution paths which go through e will decrease in length by a maximum quantity. • Edge e = (n i,n 2) such th at comp(e) is smallest. This way, if there is an execution path p that goes through only one of the two nodes rti and n2, the length of p increases by the smallest possible quantity. • Edge e = (n l5n2) such that the merger causes the minimum loss in paral­ lelism: (|P a r5 e t(n i)| — \ParSet{ni,2)\) + {\ParSet{n2)\ — \ParSet(ni,2)\) is the smallest. • Edge e = (n i,n 2) such that the merger causes the minimum loss in usable parallelism: (M axP ar(ni) — M axPar(nit2)) + (M axPar(n2) — M axPar(nlt2)) is the smallest. 120 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm issio n . • Edge e such that the merger causes the least amount of execution paths to increase in length. For instance, we could choose edge e = (711, 712) such that the number of execution paths that go through only one of the 2 nodes rii and n2 is mini­ mum. • Edge e such that the merger causes the largest number of execution paths to decrease in length. In other words, we look for edge e such th at the number of execution paths that go through e is maximum. We have to make sure that the criteria used in our partitioning algorithm are not too costly. For instance, the 2 last criteria mentioned above require a large time complexity. 121 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm issio n . Chapter 5 The Partitioning Heuristics N o te: In the partitioning algorithm, if there is more than one critical path, then choose one randomly. Safe Edges: Let g be a task graph (DAG). An edge e = (nx,n2) is said to be a safe edge if merging nodes nx and n2 does not cause any cycles to be created in g. Otherwise, e is said to be an unsafe edge. A R eq u irem en t: An edge e = (nl5 n2) in a task graph is chosen for merger if and only if e is safe. In other words, there should not exist a path from ni to n2 other than (nx, n2). L em m a: Let g be a task graph (DAG) and e = (n1}n2) be a safe edge. If a path p in g goes through both nodes n L and n 2, then e € p. P ro o f Assume a path p in g goes through both nodes nx and n2. If e ^ p then there are 2 possibilities: 1. There is a path from n L to n2 other than (nx,n2). This contradicts our assumption that e is a safe edge. 2. There is a path from n2 to nx. This means that g has a cycle, which con­ tradicts our assumption that g is a DAG. Hence, e € p. 122 R ep ro d u ced with p erm issio n o f th e copyright ow n er. Further reproduction prohibited w ithou t p erm issio n . P e rfec t E dges: We define a perfect edge to be one that belongs to all execution paths in the DAG. Otherwise, the edge is said to be an imperfect edge. R isk y E d g es: An edge e = (ni, n2) is said to be a risky edge if there exists at least one execution path that goes through only one of the nodes n i and n2. 5.0.1 H euristics In what follows, we show a few heuristics that can be used to choose the edge to be merged during each iteration of the partitioning algorithm. Since these heuristics are used in each merging step (i.e. merging iteration), we also call them merging heuristics. H euristic 1 1. Find heaviest safe edge e in task graph which is perfect. If there is more than one such edge e, choose the one such that comp(e) has the minimal value. If there is still more than one edge that satisfies that, then choose one randomly. If no such edge go to 2, else go to 5. 2. Find heaviest safe edge e in P^it which is not risky. If there is more than one such edge e, choose the one such that comp(e) has the minimal value. If there is still more than one edge that satisfies that, then choose one randomly. If no such edge go to 3, else go to 5. 3. Find Heaviest safe edge e = (ni, n2) € Pa-it such that ParSet(nx) = ParSet(n2). If there is more than one such edge e, choose the one such that comp(e) has the minimal value. If there is still more than one edge that satisfies that, then choose one randomly. If no such edge go to 4, else go to 5. 4. Find safe edge e = (ni, n2) € Pent such that (|P arS e£(ni)| — |P a rS e i(n i)2 )|) + (|Par5e£(n2)| — |ParSe£(nii2)|) 123 R ep r o d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm issio n . is the smallest among all safe edges in Pa-it- If there is more than one such edge e, choose the one such that comp(e) has the minimal value. If there is still more than one edge that satisfies that, then choose one randomly. 5. Merge 2 tasks linked by e. H euristic 2 1. Find Heaviest safe edge e = (nl5 n2) € Pcrit such that ParSet(ni) = ParSet(n2). If there is more than one such edge e, choose the one such that compile) has the minimal value. If there is still more than one edge that satisfies that, then choose one randomly. If no such edge go to 2, else go to 3. 2. Find safe edge e = (ni, ra2) € Pcrit such that (IP arS e^nt)! — |P arP ef(n i> 2 )|) + (|P ar5'et(n2)| — |P ar5ei(rc1 )2 )|) is the smallest among all safe edges in Pa-it- If there is more than one such edge e, choose the one such that comp(e) has the minimal value. If there is still more than one edge that satisfies that, then choose one randomly. 3. Merge n-i and n2. H euristic 3 1. Find Heaviest safe edge e = (n i,n 2) € Pa-it such that P a r S e t^ ) = ParSet(n2). If there is more than one such edge e, choose the one such that comp(e) has the minimal value. If there is still more than one edge that satisfies that, then choose one randomly. If no such edge go to 2, else go to 3. 124 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm issio n . 2. Find safe edge e = (ra^ n2) € P a - it such that (M axPar(ni) — M axPar{niy 2 )) + (M axPar(n2) — M axPar(nit2)) is the smallest among all safe edges in P a - it - If there is more than one such edge e, choose the one such th at comp{e) has the minimal value. If there is still more than one edge th at satisfies that, then choose one randomly. 3. Merge rii and n2. Finding M a x P a r(n i,2) w ithout doing th e m erger ParSet(ni> 2 ) = ParSet(ni) fl ParSet(n2). O r b e tte r yet: Express M axPar(nit2) in terms of M axPar(ni) and M axPar(n2). H e u ristic 4 1. Find Heaviest safe edge e = (n i,n 2) € P c r it such that ParSet(nx) = ParSet(n2). If there is more than one such edge e, choose the one such that comp(e) has the minimal value. If there is still more than one edge that satisfies that, then choose one randomly. If no such edge go to 2, else go to 4. 2. Find Heaviest safe edge e = (n i,n 2) in task graph such that ParSet(ni) = ParSet(n2). If there is more than one such edge e, choose the one such th at comp(e) has the minimal value. If there is still more than one edge th at satisfies that, then choose one randomly. If no such edge go to 3, else go to 4. 3. Find safe edge e = (n i,n 2) in graph, such that (|Par.Sef(n1)| — |P arS ef(n ii2)|) -f (|ParS'ef(n2)| — \ParSet(ni> 2 )\) is the smallest among all safe edges. If there is more than one such edge e, choose the one such that comp(e) has 125 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm issio n . the minimal value. If there is still more than one edge that satisfies that, then choose one randomly. 4. Merge n i and n,2 . H euristic 5 1. Find safe edge e in Pa-it that has the largest merge merit. If there is more than one such edge e, then choose one randomly. 2. Merge 2 tasks linked by e. M erge M e rit of a n Edge: The merge merit of an edge e is merge(e) := a.comm(e) — (3.comp(e), a , /? > 0 . D ete rm in in g oc an d /3 a J O Dcomm 0J } ’£* 'co m p ^com JT := communication to computation ratio of target machine. Remarks • Heuristic 4, Step2: if (n i,n 2) ^ Pcrit then no decrease in CPL !! • Heuristics 3 and 4 have very high time complexities. • Heuristics 1 and 2 have the lowest time complexities. Heuristic 2 is less costly than heuristic 1, but heuristic 1 is more efficient than heuristic 2. • We chose heuristic 1 to do the performance analysis of our partitioning algorithm. 5.1 Som e Properties The following properties enable us to reduce the tim e complexities of the parti­ tioning heuristics, by making it easier and quicker to find the edge to be merged. 126 R ep ro d u ced with p erm issio n o f th e cop yright ow ner. Further reproduction prohibited w ithout p erm issio n . Theorem: For any DAG g such that each node in the graph has at most one output, all edges in g are safe. Proof We use proof by contradiction. Assume that there exists an edge e = (711, 712) G g such that e is unsafe. « = > There exists at least one path p from 711 to 712 other than («!, n 2). Let p — (tZi, T lx\, 7 Z x 2 j • • • j ^ 2)5 — !• There are 2 possibilities: 1. nxi = 7i2- Therefore g is cyclic: contradiction. 2. nxi ^ 7i2• Therefore ni has at least 2 outputs: contradiction. Therefore there cannot be any unsafe edges in g. L em m a: Given a task graph (DAG) and a safe edge e = (tzi,7z 2) in the graph. e is not a risky edge Node 7ii has only one output edge (e), and node ti2 has only one input edge (e). Proof 1. Assume that No execution path goes through only one of the 2 nodes con­ nected by edge e. • If ni has more than one output edge (let the other output edge be e' = (tij , tz3) ), then there is at least one execution path p that goes through edge (n i,n 3). Clearly, p cannot go through edge (rii, 7i2) (otherwise p will have a cycle, which means that the graph is not acyclic). Thus, p cannot go through n2 (otherwise p goes through both nodes n L and n2, which implies that it goes through edge e). Hence p goes through ni and not n2. This contradicts our assumption. Therefore, ni has only one output edge. 127 R ep ro d u ced with p erm issio n o f th e copyright ow n er. Further reproduction prohibited w ithou t p erm issio n . • If n2 has more than one input edge (let the other input edge be e' = (ri3, 712)), then there is at least one execution path p that goes through edge (n3,n 2). Clearly, p cannot go through edge (n x,ra2) (otherwise p will have a cycle, which means that the graph is not acyclic). Thus, p cannot go through n x (otherwise p goes through both nodes n 1 and n2, which implies that it goes through edge e). Hence p goes through n2 and not nj. This contradicts our assumption. Therefore, n2 has only one input edge. 2. Assume that node n x has only one output edge (e), and node n 2 has only one input edge (e). • Any execution path p that goes through rax has to go through edge e (since n x has only one output edge). • Any execution path p that goes through n 2 has to go through edge e (since n2 has only one input edge). Hence, no execution path goes through only n x or only n 2. C orollary: Given a task graph (DAG) such that each node in the graph has at most one output, and a safe edge e = (nx,n 2) in the graph, e is not a risky edge «=*• Node n2 has only one input edge (e). Proof: Trivial, from the previous lemma. Lem m a: Given a task graph (DAG) and an edge e = (nx,n 2) in the graph, e is a perfect edge Node n x has only one output edge (e), and node n2 has only one input edge (e) AND ParSet(n\) = ParSet{ri 2 ) = 0 128 R ep ro d u ced with p erm issio n o f th e copyright ow n er. Further reproduction prohibited w ithou t p erm issio n . Proof Assume that e is a perfect edge. If ni has more than one output edge or n2 has more than one input edge, then clearly there exists at least one execution path that doesn’t go through e. Hence, e is not a perfect edge, which contradicts our assumption. Therefore, node ni has only one output edge, and node n2 has only one input edge. Now, let’s prove that ParSet(ni) = ParSet(n 2 ) = 0. We know that all execution paths go through e. For any node n in the graph other than ni and n2, n belongs to at least one execution path p. Since p goes through e, then p goes through ni and re2. Hence, n and n l are dependent and n and n2 are dependent. Therefore ParSet{ni) = P ar5'et(n2) = 0. L em m a: Given a task graph (DAG) such that each node in the graph has at most one output, and an edge e = (n1}n2) in the graph. ParSet(ni) = ParSet(n2) < £ = > • Node n2 has only one input edge (e). Proof 1. Assume that ParSet(nt) = ParSet(n2). If n2 has more than one input edge, then it will have at least one input edge (n3,n 2) other than e. There can be no path between ni and n3, otherwise we must have a path from n 2 to n3 (since ni has only one output edge, which is (nl5 n2)), which means that the task graph is acyclic. There can be no path between n3 and ni, otherwise we must have a path from n2 to ni (since n3 has only one output edge, which is (n3,n 2)), which means that the task graph is acyclic. Hence n3 € ParSet(ni). Clearly, n3 £ ParSet(n 2 ). Therefore, ParSet(ni) ^ P a ri'e t(n 2), which contradicts our original assumption. Therefore, n2 has only one input edge. 2. Assume that n2 has only one input edge. 129 R ep ro d u ced with p erm issio n o f th e copyright ow n er. Further reproduction prohibited w ithout p erm ission . • Vn £ ParSet(ni), there is no path from ni to n and no path from n to nx. There cannot be a path from n to n2, otherwise we must have a path from n to ni (since n2 has only one input edge). There cannot be a path from n2 to re, otherwise we must have a path from ni to n (because of edge («i, n2)). Hence, n £ P arS ef(n 2). Thus, ParSet(ni) C P a r5 e i(n 2). • Vn £ P arS ef(n2), there is no path from n2 to n and no path from n to n2. There cannot be a path from ni to n, otherwise we must have a path from n 2 to n (since ni has only one output edge). There cannot be a path from n to ni, otherwise we must have a path from n to ra2 (because of edge (n t,n 2)). Hence, n £ ParSet{rii). Thus, ParSet{p . 2 ) C ParSet{ni). Therefore, ParSet(ni) = ParSet(ri 2 ). Corollary: Given a task graph (DAG) such that each node in the graph has at most one output, and a safe edge e = (n i,n 2) in the graph. ParSet(rci) = ParSet(n2) •< =>• e is not a risky edge. Proof: From a previous corollary and a previous lemma. 5.2 Tim e C om plexity of Partitioning Algorithm Let E be the number of edges and N be the number of nodes in the program graph. 2, the initial task graph will have N nodes and at most E edges. 1 3 0 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm issio n . 5.2.1 DAG Traversal As will be seen later, the partitioning algorithm requires traversal of the task graph, which is a DAG. In what follows, we describe the general procedure for DAG traversal. Let Q be a queue (could be implemented as a linked list). <2 < - 0 . Insert all root nodes in Q (in any order). Repeat until Q = 0 n < — Front of Q. Delete n from Q. visit(n) % Node n is visited here. IF n is not a leaf node THEN Insert all children of n in Q % The way insertion is done depends on the traver­ sal % type (e.g. breadth-first, depth-first). Traversal of general DAGs is different from tree traversal. W ith general DAGs, if we are not careful, a node might be visited more than once. Clearly, this is not the case for trees. The reason for this is that for general DAGs, a node may have more than one input edge. In order to avoid visiting nodes more than once, when a node is put in the queue Q, it is marked as queued. After a node is visited, only its children which are not marked queued are inserted in Q. Hence the correct version of the general algorithm is as follows. Let Q be a queue (could be implemented as a linked list). Q <-0- Insert all root nodes in Q (in any order) % No need to mark root nodes as queued. Repeat until Q = 0 131 R ep ro d u ced with p erm issio n o f th e copyright ow n er. Further reproduction prohibited w ithout p erm issio n . n < — Front of Q. Delete n from Q. visit (n) % Node n is visited here. n is marked visited. % This marking may not be needed. IF n is not a leaf node THEN Insert all children of n that are not marked queued in Q, and mark them as queued % So that nodes are not visited more than once. % The way insertion is done depends on traversal % type (e.g. breadth-first, depth-first). Deadlock Situations: The algorithm for DAG traversal listed above never leads to deadlock situations (deadlock means that the algorithm ends and there are still nodes not visited). To see why this is the case, assume that during the execution of the algorithm we reach a deadlock situation. This means that the queue Q is em pty and there is at least one node n in the graph that hasn’t been visited yet. Clearly, n hasn’t been inserted in Q yet. Therefore, none of its parent nodes has been visited yet. Let ni be a parent node of n (if any). Then n i was never inserted in Q either. This goes on until we reach a root node r (since the graph is acyclic), and establish that r was never inserted in Q. Clearly this cannot be the case since all root nodes are inserted in Q at the beginning of the algorithm. Hence our assumption that there is a deadlock situation cannot be true. Another way to see why we cannot have any deadlock situations is to notice that starting from the root nodes, we can reach any node in the graph by following the paths emanating from the root nodes. Depth-First Traversal For depth-first traversal, the children of the node just visited are inserted at the Front of Q. The order among the children nodes in Q does not m atter. 132 R ep ro d u ced with p erm issio n o f th e copyright ow n er. Further reproduction prohibited w ithout p erm issio n . Breadth-First Traversal For breadth-first traversal, the children, of the node just visited are inserted at the Rear of Q. The order among the children nodes in Q does not m atter. Remarks 1. Breadth-first and depth-first traversals for general DAGs are different from the ones for trees. This is so because for general DAGs a node may have more than one input edge. 2. Time Complexity: Breadth-first and depth-first traversals take at the most 0 (E + N) time complexity. Proof: All nodes in the graph are inserted in Q exactly once and axe visited exactly once. This takes O(N) time. Each tim e a node n is visited, each one of its children is examined to see whether it is marked queued or not. The number of children of n is equal to the number of output edges of n. Therefore, since each node in the graph is visited exactly once, this takes 0 (E ) time. Parents-First Traversal In parents-first traversal, a node is not visited until all of its parent nodes are visited1. The idea here is to keep a counter for each node in the graph (except the root nodes). This counter is used to keep track of the number of parent nodes of a given node th at are already visited. When the counter of some node n is equal to the total number of parents of node n, then we know that all the parent nodes of n are already visited. A child node is inserted in the queue Q only when all of its parents are already visited. The procedure is as follows: 'T his is a traversal using a topological order. 133 R ep ro d u ced with p erm issio n o f th e copyright ow n er. Further reproduction prohibited w ithou t p erm issio n . Let Q be a queue (could be implemented as a linked list). Q <-0- Insert all root nodes in Q (in any order). FOR all non-root nodes n in the graph DO % Initialize the counters of the nodes. n.count Number of parent nodes of n Repeat until Q = 0 n < — Front of Q. Delete n from Q. visit(n) % Node n is visited here. n is marked visited. % This marking may not be needed. IF n is not a leaf node THEN FOR all children nodes n' of n DO n'.count < — n!.count — 1 % One more parent visited. IF n!.count = 0 % All parents of n' are vis­ ited. THEN Insert n' at the Rear of Q. Remarks 1. The above algorithm is similar to breadth-first traversal because the inser­ tion of the children nodes is done at the Rear of Q. We could have chosen to do the insertion of the children nodes at the Front of Q. This way the algorithm would have been similar to depth-first traversal. 2. Deadlock situations: The algorithm for parents-first traversal never leads to deadlock situations because our graph has no cycles. To see why this is the case, let’s assume that during the execution of the algorithm, we reach a deadlock situation. This means that Q is empty and there is still at least one non-visited node 134 R ep ro d u ced with p erm issio n o f th e copyright ow n er. Further reproduction prohibited w ithout p erm issio n . n. 2, n has never been inserted in Q. Hence, at least one parent (if any) ni of n hasn’t been visited yet. This means that ni has never been inserted in Q either. Hence, at least one parent (if any) n-i of hasn’t been visited yet. This goes on until we reach a root node r (since the graph is acyclic), and establish that r hasn’t been inserted in Q yet. Clearly this is a contradiction, since all root nodes are inserted in Q at the beginning of the algorithm. Hence it is not possible to reach any deadlock situation during the execution of the algorithm. 3. Time Complexity: The time complexity of the parents-first traversal is 0 (E + N). Proof: It takes 0 (N ) to initialize the counters of the nodes. All the nodes in the graph are inserted in Q and visited exactly once. This takes 0 (N ) time. Each time a node n is visited, each one of its children is examined (its counter is updated, and depending of the value of that counter, it may be inserted in Q). Examining a child node takes a constant amount of time. The number of children nodes of n is equal to to number of output edges of n. Hence, since each node in the graph is visited exactly once, this takes 0(E) time. 5.2.2 D eterm ining the N otions U sed by the Partitioning A lgorithm Determ ining the CPL For each node n in the graph, we define length(n) to be the length of the longest path from any input node to n, n excluded. Also, for each node n in the graph, we define pred(n) to be the predecessor node of n along the longest path from any input node to n , n included (if there is more them one such path, we choose any one of them). Furthermore, for each output node n in the graph, we define exec.path.length(n) to be the length of the longest execution path that ends in n. Finally, we define cp.last to be the last node (output node) in the critical path (if there is more than one critical path, we choose anyone of them). 135 R ep ro d u ced w ith p erm issio n of th e copyright ow ner. Further reproduction prohibited w ithou t p erm issio n . The idea here is to use a parents-first traversal of the graph to determine lengthen) for each node n, and pred(n) for each non-root node n in the graph. Then, we determine exec.path.length(n) for each output node n. Finally, we choose the leaf node I such that exec.path.length(Z) is the largest among ail leaf nodes. The CPL is equal to exec.path.length(/). Clearly, cp.last is I. To find the critical path Per it-, we use the function pred. I is the last node in Pa-it, h = pred(l) is the node preceding I in Pa-it, h = pred(li) is the node preceding /i in Pa-it, etc., until we reach an input node. The algorithm is as follows: Let Q be a queue (could be implemented as a linked list). Q < - 0- Insert all root nodes in Q (in any order). FOR all non-root nodes n in the graph DO % Initialize the counters of the nodes. n.count f - Number of parent nodes of n FOR all nodes n in the graph DO % Initialize length(n) for all nodes n in the graph. lengthen) < — 0 Repeat until Q = 0 n < — Front of Q. Delete n from Q. visit(n) % Node n is visited here. n is marked visited. % This marking may not be needed. IF n is not a leaf node THEN FOR all children nodes n' of n DO n'.count < — n!.count — 1 % One more parent visited. temp < — length(n) + comp(n) 4- comm(n, n') % temp is the longest path from any input 136 R ep ro d u ced with p erm issio n o f th e copyright ow n er. Further reproduction prohibited w ithout p erm issio n . node to n' % (n' excluded) that goes through edge (n, n'). IF temp > length(n') THEN % temp is the longest path that has been traversed so fax % from any input node to n . length(n') 4— temp pred(n') 4 — n IF n'.count = 0 % All parents of n' axe vis­ ited. THEN Insert n' at the Rear of Q % Find CPL CPL 4- 0 FOR all output nodes n DO exec.path.length(n) 4 — lengthen) + comp(n) IF exec.path.length(n) > CPL THEN CPL 4- exec.path.length(n) cp.last <r- n Time Complexity: The time complexity to find the CPL and the critical path of a DAG is 0 ( E + N). Proof: It takes O(N) to initialize the counters of the nodes. It takes 0 ( N ) to initialize length(n) for all nodes n. Each node in the graph is inserted in Q and visited exactly once. This takes 0(N ) time. 137 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm issio n . Each time a node n is visited, each one of its children n' is examined (its counter is updated and depending of the value of that counter it may be inserted in Q, variable temp is calculated and depending on its value lengthen') and pred(n') could be updated). Examining a child node takes a constant amount of time. The number of children nodes of n is equal to to number of output edges of n. Hence, since each node in the graph is visited exactly once, this takes 0 (E ) time. To determine the CPL and critical path of the graph (once lengthen) and pred(n) for all nodes n has been determined), each leaf node I is examined (exec.path.length(Z) is calculated, and depending on its value CPL and cp.la.st could be updated). Examining I takes a constant amount of time. Hence this takes O(N) time at the most. Finally, starting from cp.last and tracing back along the critical path until we reach an input node takes at the most O(N) time. Perfect Edges Consider an edge e = (n i,n 2). From a previous lemma, we know that if rii has more than one output edge or n2 has more that one input edge, then e is not a perfect edge. This check can be done in constant (0(1)) time. However, if ni has exactly one output edge and n2 has exactly one input edge, then e could be either perfect or imperfect. In this case, we do a special kind of graph traversal to determine whether the edge is perfect or imperfect. The way we traverse the graph is as follows: We do a complete graph traversal (e.g. breadth-first or depth-first) in the usual way with the following exception: when node ni is visited, its child n2 is not inserted in the queue Q. If any leaf node is visited, then edge e is not perfect. Otherwise (if no leaf node is visited), edge e is perfect. Hence, whenever a node is visited, we check whether it is a leaf node or not. If it is, we can stop the search immediately and conclude that e is not a perfect edge. If after the search is over none of the nodes visited is a leaf node, then e is a perfect edge. 138 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . The idea behind the above procedure is to traverse the graph without going through edge e. By not inserting n2 in the queue Q, we don’t traverse edge e. If a leaf node is reached, then there must exist at least one path from an input node to an output node which does not go through e. This means that there much exist at least one execution path which doesn’t go through e. If none of the leaf nodes is reached, then there cannot be any path from an input node to an output node which doesn’t go through e. This means that all execution paths must go through e. Time Complexity: The graph traversal described above takes at the most 0 (E + A T ) time. Therefore, we need 0 (E + N) time to determine whether an edge is perfect or imperfect. R em ark : The above procedure can be used to check whether edge e is perfect or not even when ni has more than one output edge or n 2 has more that one input edge. Safe Edges An edge e = (n t , n2) is safe if its merger does not result in a cycle. For this to be true, there should not be a path from ni to n2 other than (nl5 n2) before the merger. The procedure here is almost the same as the one for perfect edges described above and is as follows: We do a graph traversal (e.g. breadth-first or depth-first) starting from node ni (initially the queue Q has only node n i, instead of the root nodes). After ni is visited, all of its children nodes are inserted in Q except for node n2. This way, we traverse all paths emanating from ni and which don’t go through edge e. If node n 2 is visited, then we can stop the traversal immediately and conclude th at e is not a safe edge (i.e. there much exist at least one path from n\ to n 2 other than 139 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . (n l 5n2)). if after ail the traversal is complete node n 2 is not visited, then we can conclude that e is a safe edge (i.e. there cannot be a path from ni to n 2 other than (ra!,n2)). Time Complexity: In the worst case, we will traverse all the graph (except for edge e). Therefore, the above procedure takes at most 0 (E + N) time. Risky Edges From a previous lemma, we know that an edge e = (ni, n2) is not risky if and only if ni has only one output edge and n 2 has only one input edge. This check can be done in constant time. Hence we need 0(1) time to find out whether an edge is risky or not. Determ ining DepSet(n) Let n be a node in the DAG. First, we do a traversal of the graph starting from node n (initially the queue Q has only node n, instead of the root nodes). This will give us all nodes n' such that there is a path from n to n'. Second, we do a backwards traversal of the graph starting from node n (we follow the opposite direction of the edges). This will give us all nodes n1 such that there is a path from n' to n. Initially, we set DepSet(n) to 0. Each time we visit a node (other than n), we add it to DepSet(n). In the worst case, this takes 0 {E + N) time (complete graph traversal). Determ ining ParSet(n) Let n be a node in the DAG. One way to determine ParSet(n) is to first determine DepSet(n). By doing that, all nodes n' in the DAG such that n and n' are dependent axe marked visited. Initially, we set ParSet(n) to 0. Then we do a complete traversal of the graph, and any node which was not marked visited from the traversal to determine 140 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . DepSet(n) is added to ParSet(n). Note that we have to distinguish between the nodes that are marked visited during the graph traversal to determine DepSet(n), and during the complete graph traversal to determine ParSet(n). This can be done easily by using different markings (for instance, when we are determining DepSet(n) nodes that are visited axe marked with the letter ’D’, and when we axe determining ParSet(n) nodes that axe visited are marked with the letter ’P ’). It takes 0 (E + N) time to determine DepSet(n). Then it takes 0 ( E + N) to do the complete graph traversal to determine ParSet(n). Hence, it takes 0 (E + N ) time complexity to determine ParSet(n). 5.2.3 Tim e Com plexity U sing Heuristic 1 Each iteration of the partitioning algorithm consists of choosing the edge to be merged using some heuristic, then the edge chosen is merged2. In what follows, we determine the cost for each step of a merging iteration using heuristic 1. S tep 1 : First we determine the critical path and the CPL of the task graph3. Then for each edge e in the critical path, we check whether e is a safe edge. For all safe edges s found, we check whether s is a perfect edge. Finally, among all edges that axe found to be both safe and perfect (if any), we choose the heaviest one. It takes 0 (E + N) time to determine the critical path. This path has at the most E edges. For each edge e in the critical path, it takes 0 ( E + N ) time to determine whether e is safe and perfect or not. Given the m edges that are found to be perfect and safe (0 < m < E), it takes 0(m ) time at the most to determine the heaviest one. Hence the total time complexity for step 1 is 0 (E (E -f N)). S tep 2 : The critical path and all m (0 < m < E) safe edges in it are determined in step 1. Among these safe edges, we determine the ones that are not risky. This takes 0(m ) time. Among the ml (0 < ml < m ) non-risky edges found, 2This is called a merging iteration. 3These are actually determined before we start the current merging iteration. 141 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . we determine the heaviest one. This takes at the most 0(m') time. Hence, it takes at the most 0(E ) time for step 2. S tep 3: If the DAG is such that each node has at most one output edge, then using a previous corollary regarding risky edges we conclude that step 2 and step 3 are exactly the same, and therefore step 3 is skipped. Otherwise, we do the following. The critical path and all m (0 < m < E) safe edges in it are determined in step 1. Among these safe edges e = (n i,n 2), we determine the ones such that ParSet(ni) = ParSet(n 2 ). Therefore for all nodes n that belong to such edges, we need to determine ParSet(n). Since there are at most N nodes along the critical path, this takes at the most 0 (N (E + N)) time. Given 2 sets Si and S 2 that have mi and m 2 elements respectively, it takes at the most 0 (m im 2) time to find out whether Si = 6'2. ParSet(ni) and ParSet(n 2 ) have at most N elements each. Hence it takes at the most 0 ( N 2) time to find out whether ParSet(ni) = Par Set (1 1 2). This check has to be done for all the m safe edges. Hence the total tim e this takes cannot be more than 0 (E .N 2). Determining the heaviest edge among all edges found (if any) cannot cost more than 0(E). Therefore, step 3 takes 0 (E .N 2) time complexity. S tep 4: The critical path and all m (0 < m < E) safe edges in it are determined in step 1. Also, for each safe edge e = (rai,n2), we determined ParSet(n{) and ParSet(n 2 ) in Step 3, and we need to determine ParSet(ni< 2 ) = ParSet(ni)C\ ParSet(n2). Given 2 sets S\ and S2 that have m i and m 2 elements respec­ tively, it takes 0 (m im 2) time at the most to determine S\ fl S2. Since ParSet(ni) and ParSet(n2) have at most N elements each, it takes at the most 0 ( N 2) tim e to compute ParSet(nlt2). This computation has to be done for each one of the m safe edges found. This takes at the most 0 (E .N 2) time. Therefore, it takes at the most 0 (E .N 2) time to execute step 4. S tep 5: As we saw before, it takes at the most O(N) time to explicitly merge 2 tasks. 142 R ep ro d u ced with p erm issio n o f th e copyright ow n er. Further reproduction prohibited w ithout p erm issio n . Conclusion: A merging iteration using heuristic 1 costs at the most 0 {E (E + N 2)). Since there are N — 1 merging iterations in the partitioning algorithm, the total cost of the partitioning algorithm is 0 (E N (E + N 2)). Relationship between E and N: Let E and N be the total number of edges and nodes respectively in a DAG g. E is equal to the sum of the number of output edges of all nodes n in 9* • E = (number of output edges of n). For each node n in g, the number m of output edges of n is such that 0 < m < N — l 5. Therefore 0 < E < N (N — 1). E = 0 is the case when all nodes are output nodes. E = N (N — 1) is the case when there is an edge from each node n in g to all other nodes in g. These 2 cases never occur in practice. In fact, the case when E = N ( N — 1) doesn’t occur even in theory, since the graph is acyclic6. Another expression for Time Complexity: Since E < N 2, the time complexity of the partitioning algorithm using heuristic 1 can be w ritten7 as 0 (E .N 3). Over-Estimation of Time Complexity: In the previous analysis, we over-estimated the time complexity of the partitioning algorithm because we had to assume the worst case scenario. 4 We assume th at the input nodes don’t have any input edges. 5m = 0 is the case when n is an output node, m = N — 1 is the case when there is an edge from n to each other node in the graph. 6If there is an edge from each node n in g to all other nodes in g, then clearly the graph will have cycles. 7Since in this case 0 ( E + N 2) is the sam e as 0 ( N 2). 143 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . For instance, we used E and N for the numbers of edges and nodes respectively along the critical path. Also we used N for the number of elements in ParSet{n) for various nodes n. In addition, we used E for the number of safe edges along the critical path. Clearly for real applications, the actual numbers axe usually much smaller than that. Finally as was mentioned before, merging 2 tasks takes 0 (N ) tim e in the worst case, but it takes a constant amount of time in the average case (for real applications). Average Time Complexity: Assume: Average number of nodes along critical path: constant. Average number of edges along critical path: constant. Average number of elements in ParSet(n): constant. Then: Step 1: 0 ( E -f* N). Step 3: 0 ( E + N). Step 4: 0 ( E + N ). Hence, the average time complexity of the partitioning algorithm using heuristic 1 is 0 (N (E + N)). Remark It is very difficult to determine the average number of edges and nodes along the critical path. These numbers do not necessarily depend on E and N. For instance, we could have a DAG with a very large number of nodes that has a short critical path (i.e. a DAG with a large width), and a DAG with a much smaller number of nodes that has a longer critical path (i.e. DAG with a small width). 144 R ep ro d u ced with p erm issio n o f th e copyright ow n er. Further reproduction prohibited w ithout p erm issio n . Chapter 6 Performance Analysis 6.1 Partitioning Fork and Join DAGs Since a DAG is composed of fork and join components (Tao Yang: [18, 58, 59]), we study the performance of our partitioning algorithm on these primitive structures to further understand its behavior. 6.1.1 Fork DAGs Consider the fork DAG shown in figure 6.1. Each c ,- is the communication cost of edge (r, n,), and each et - is the execution cost of node n,-. Also, e is the execution cost of the root node r. W ithout loss of generality, assume that the leaf nodes are sorted such that c,+ et - > Ci+i + et+i, 1 < i < m — 1. O ptim al Partition The critical path of the fork DAG shown in figure 6.1 is (r, nx). Hence initially, the CPL is l0 = e + ci + ex. Clearly, the CPL of the optimal task graph is lopt < l0. The main thing to notice here is th at whenever an edge (r, nt) is merged, all the other execution paths (r, nj) (j ^ i) are increased in length by e,-. If lopt = Iq then the initial task graph is the optimal one. If l0 pt < Iq then in the optimal partition, nodes r and ni have to belong to the same task. Otherwise, lopt cannot be smaller than lo. Hence, edge (r, ni) has to 145 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . e m m . m Fork DAG Figure 6.1: Fork DAG be merged. The result of this merger is shown in figure 6.2 (a). The new critical path is (ri, n2), and the new CPL is /i = e + + c2 + Again, lopt < li. If lopt = 1 1 then the task graph in figure 6.2 (a) is the optimal one. If lopt < 1 1 then in the optimal partition, the root node r x and node «2 have to belong to the same task. Otherwise, lopt cannot be smaller than li. Hence, edge has to be merged. The result of this merger is shown in figure 6.2 (b). The new critical path is (r2, n3), and the new CPL is I2 = e + ex + e2 + C 3 + e3. This process goes on, and after the k 'th merging step, the task graph is shown in figure 6.2 (c). The critical path of this task graph is (rfc,nfc+1), and its CPL is Ik — e + ei + e2 + • • • + ejt + cjt+i + et+i. This process could go on until all tasks are merged together and the task graph is constituted of a single task. The CPL in this case is lm = e + ei + e2 H b em. Note that the order of merging the edges in the graph does not m atter, and we always get to the same result. In other words, to obtain task { r,n !,n 2, ..., n*}, we need to merge the k edges whose costs axe c1 } C 2, ..., cjt in any order. The above intuitive analysis leads us to the following theorem. T h eo rem : The optimal partition for the fork DAG is constituted of the following tasks: 146 R ep ro d u ced with p erm issio n o f th e copyright ow n er. Further reproduction prohibited w ithout p erm issio n . ---------------------------------------------------------------------- V (a) Tasks r and are merged together e + (b) Tasks and are merged together e + 6-^+ e 2 +... + 6 j• e. e, e k+ i k+2 m (c) Tasks r, n^, are merged together Figure 6.2: Determining optimal partition for fork DAGs R ep r o d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm issio n . e. + e. Figure 6.3: Proof: optimal partition for fork DAGs {r,rzi,n2,...,n fc}, {nk+1}, {nk+2}, . . {nm}, where 0 < k < m. For k — 0 the optimal partition is the trivial partition, and for k = m the optimal partition is the singleton partition. P ro o f First, let’s prove that the optim al partition is constituted of the following tasks: {r, Ni, N2, ..., Nk}, {Nk+1}, {N k+2}, ..., {Nm}, where 0 < k < m and Ni = some n.j (i not necessarily equal to j). In other words, in the optimal partition, there is a task containing r and zero or more other n .’s, and the rest of the n ,’s are in separate tasks (i.e. these tasks are constituted of a single n,). This is also equivalent to saying that in the optimal partition, any task that doesn’t contain r cannot contain more than a single n,-. Intuitively, since merging 2 independent tasks together doesn’t reduce com­ munication cost, it will never decrease the parallel execution tim e (CPL), and therefore, any task in the optim al partition should not consist entirely of n ,’s. Hence, any task which does not contain r is constituted of a single n,-. 148 R ep ro d u ced with p erm issio n o f th e copyright ow n er. Further reproduction prohibited w ith out p erm issio n . n. Figure 6.4: Proof: optimal partition for fork DAGs To prove this more formally, let’s show that any partition II that has a task t not containing r and that contains more than one nt - has a CPL that is greater or equal than the one of the partition obtained from II by putting each nt - that belongs to r in a separate task. Without any loss of generality, let r = {n,-, n7 -} (i < j). The task graph of II is shown in figure 6.3. Clearly, task T contains r 1. xjt is the execution cost of task T fc. yk is the communication cost of edge (T, Tk). x is the execution cost of task T. The communication cost of edge (T, r) is y = c,• + delay(data(T,r)). Also without any loss of generality, assume that yk + x k > yk+i + Xfc+i, 1 < k < p - 1. The CPL of n is I = x + max(yl + x u y + et - + e,-). Now consider the partition II’ obtained from II by putting n t - and rij in separate tasks. The task graph corresponding to II’ is shown in figure 6.4. The CPL of II’ is I' = x + max(yt + xi, a + e,). Since y + et - + > c ,- + e,-5 then I > I1 . lT could be constituted entirely of r. 149 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . C N Figure 6.5: Proof: optimal partition for fork DAGs R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . k + t Figure 6.6 : Proof: optimal partition for fork DAGs k + m Figure 6.7: Proof: optimal partition for fork DAGs R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . p + 2 p + p + 2 rn m Figure 6 .8 : Proof: optimal partition for fork DAGs Now let’s prove the claim in the theorem. First we use an intuitive reasoning. Let’s assume that the claim is not correct. Hence there exists at least one ns, s > k, such that the optimal partition is constituted of the following tasks: R = { r , Tli, n 2) • • • > n qt n s t n q+ 2i n q+3 j • ■ • j j { ^ f c + l } j { n f c + 2 } i • • • j l } ) {ns+l}j {ns+2}i ••• 5 {nm}- Refer to figure 6.5. The task graph corresponding to this partition is shown in figure 6 .6 . The CPL of this optimal partition is lopt — e + ei + e2 + • • • + e, + e q+2 + eq+3 + • • • + e* + e, + cg +i + e q+i. Consider the partition constituted of the following tasks: { r , t l i , . . . , t i^ } , • f n .g ^ - i} , ■ fn ^ ^ .2 } '> • • • > Its CPL is / = e + e 1 + e2 + • • • + e? + cq+ 1 + eq+i- Note that I < lopt, which should not be. Hence we have a contradiction, and therefore our assumption is not possible. Then the claim of the theorem is correct. Now let’s prove the claim in the theorem more formally. Assume that the claim is not correct. The optimal partition n opt is constituted of the following tasks: 152 R ep ro d u ced with p erm issio n o f th e copyright ow n er. Further reproduction prohibited w ithout p erm issio n . t = {r, N u N2, .. ., Nk}, {N k+i}, {Nk+2}, • • {Nm}, where 0 < k < m and N{ = some rij (i not necessarily equal to j). W ithout any loss of generality, assume that k > 1. W hen k = 0, the optimal partition is the trivial partition, and therefore the claim is true. Also, without any loss of generality, assume that the Ni s (1 < i < k) in r are ordered such that if N j = nji and N j+1 = nj2 then j2 > j 1, 1 < j < k — 1. Let N0 = r and no = r. Let p be the smallest integer such that Np = nv and N p+1 ^ np+i, 0 < p < k — 1. For instance, if t = {r, ni, n 2, n3, ris, ...} then p = 3. If r = {r, ni, n5, ...} then p = 1. If t = {r,n2,...} then p = 0. We have r = {r, ni, n2, ..., np, iV p+1, Np+2, N k } , 0 < p < k — 1. Clearly, ^ np+1, p + 1 < i < A :. Therefore, np+i is in a task by itself. The task graph corresponding to IIo p £ is shown in figure 6.7. Since one of the leaf nodes is np+1 and all n,-’s such that 1 < i < p are in task r, then critical path of this task graph is (r, np+1). Hence the CPL of n o pt is lopt = comp(r) + comm(T,np+i) + comp(np+l). Let the execution time of Ni be Ei, 1 < i < m. comp{r) = e + ei + e2 + — + ep + Ep+i + ^>+2 + • • • + Ek• comm(T,np+l) = C p +1. comp(np+1) = ep+i. Flence lopt = e + ej + e2 + • • • + eP + ^p+i + Ep+2 + • • • + Ek + C p +i + ep+i- Consider the partition II constituted of the following tasks: r' = {r, nu n2, ..., np}, {np+i}, {np+2}, ..., {nro}. The task graph corresponding to II is shown in figure 6 .8 . Its critical path is (t', np+i) and its CPL is I = e + e 1 -f- e2 + • • • + ep + Cp+i + ep+i. Note that I < lopt, which means that we have a contradiction. Therefore, our assumption is not correct, and the claim of the theorem is correct. 153 R ep ro d u ced with p erm issio n o f th e copyright ow n er. Further reproduction prohibited w ithout p erm issio n . U sing H e u ristic 1 The critical path of the fork DAG in figure 6.1 is (r, rii). Hence the first edge chosen to be merged is (r, ni). The result of this merger is shown in figure 6.2 (a). The new critical path is (r^, n2). Hence the next edge to be merged is ( ri,n 2). The resulting task graph is shown in figure 6.2 (b). This process goes on, and at the k’th merging step, edge (rk-i,rik)2 is chosen for merger. The result if this merger is shown is figure 6.2 (c). This merging process goes on until we reach the singleton partition. It is quite clear that using Heuristic 1, the partitions that we get during the iterations of the partitioning algorithm are constituted of the following tasks: {r, nu n2, . . . , n*}, {n jfc+1}, { n jfc +2}, ..., {nm}, where 0 < k < m. From the above theorem, we conclude th at our algorithm always leads to the optimal partition. Som e A nalysis Let Ilfc be the partition constituted of the following tasks: { r,n r,n 2, . .. ,n fc }, {nfc+1}, {nfc+ 2}, ..., {nm}, where 0 < k < m. Let Ik be the critical path length of II/fc . From the above theorem, we know that the optimal partition is one of the n ^ ’s, 0 < k < m . Note that using our partitioning algorithm with Heuristic 1, Iljt is the partition obtained after the fc’th merging step, and Ik is the CPL of the task graph after the fc’th merging step. The task graph corresponding to partition H/t is shown in figure 6.2 (c), with r0 and eo defined to be r and e respectively. /fc = e + ei + e2 + • • • + ejt + e/t+i + Ck+i, 0 < fc < m — 1. = e + ei + e2 + • • • + em. Note that /0 = e + e\ + ct. Ik 1 — dti 1 ^ ^ 7 7 1 1. 2 Assume th a t ro is defined to be r, so th at when k = I rk—i = f* o = r. 154 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm issio n . Hence, for 1 < k < m — 1, we have 4 < 4 - i & C fc > ek+i + ck+1 AND 4 ^ 4 - i ^ C f c < efc +1 + Cfc+i. 4 - 4» = cjt+i — e;t+2 — efc +3 — em, 0 < A ; < m — 1. Hence, for 0 < A ; < m — 2, we have 4 < 4 i ^ C fc+i < efc + 2 + et+ 3 H + em. Note that 4 i-i — lm = Cm . Hence /m_i > 4 i, and therefore n m_i can never be the optimal partition. C orollary: Assume that there exists an integer q, 1 < q < m — 2, such that V A ;, 1 < A : < q r , ck > ek+i + cfc+1 AND V k, q + l < k < m — I, ck < efc+1 + cfc+ l AND cq+ 1 < c ? + 2 + 3 + • • • + em. Then n , is the optimal partition. P ro o f Using the analysis above, we obtain the following: V A :, 1 < A : < q, lk < 4 _ t AND V k , q + l < k < m — l , l k > 4 - i AND 4 < lm. Hence, lq < 4 , 0 < A : < m. Therefore, n , is the optimal partition. E xam ples 1. Consider the fork DAG shown in figure 6.9 (a). Using the above corollary with q = 2, we conclude th at n 2 is the optimal partition. 155 R ep ro d u ced with p erm issio n o f th e copyright ow n er. Further reproduction prohibited w ithou t p erm issio n . 1 5 15 5 15 5 10 (d) 15 1 5 5 5 1 1 1 5 15 5 5 2 21 25 1 5 10 3 1 6 1 5 15 10 5 1 5 15 1 5 5 5 3 8 Figure 6.9: Examples of fork DAGs 156 R ep ro d u ced with p erm issio n o f th e copyright ow n er. Further reproduction prohibited w ithou t p erm issio n . 2. Consider the fork DAG shown in figure 6.9 (b). Using the above corollary with q = 1, we conclude that III is the optimal partition. 3. Consider the fork DAG shown in figure 6.9 (c). Here we cannot apply the above corollary because q does not exist. l0 = 70, h = 65, l2 = 75, l3 = 70, lA = 71, l5 = 67, l6 = 52. Therefore, 11 6 is the optimal partition. 4. Consider the fork DAG shown in figure 6.9 (d). Here we cannot apply the above corollary because q does not exist. l0 = 70, U = 65, l2 = 75, l3 = 70, Z 4 = 72, l5 = 74, l6 = 73. Therefore, Hi is the optimal partition. 5. Consider the fork DAG shown in figure 6.9 (e). Here we cannot apply the above corollary because q does not exist. l0 = 69, h = 85, l2 = 80, l3 = 6 8, Z 4 = 70, h = 71, /6 = 70. Therefore, II3 is the optimal partition. 6 . Consider the fork DAG shown in figure 6.9 (f). Here we cannot apply the above corollary because q does not exist. lQ = 70, h = 65, l2 = 70, l3 = 65, Z 4 = 61, h = 62, k = 61. Therefore, n 4 and n 6 are the optimal partitions. 7. Consider the fork DAG shown in figure 6.9 (g). Using the above corollary with q = 3, we conclude that n 3 is the optim al partition. Using Sarkar’s Partitioning M ethod Examples 1. Consider the fork DAG shown in figure 6.9 (a). The edges in the graph sorted in decreasing communication cost are as follows: ( r ,n x), ( r ,n 2), ( r,n 3), ( r ,n 4), ( r ,n 5), ( r ,n 6). 157 R ep ro d u ced with p erm issio n o f th e copyright ow n er. Further reproduction prohibited w ithout p erm issio n . Initially, the CPL of the graph is 33. Merging edge ( r ,7 ii) results in a CPL of 30, and therefore it is accepted. Merging edge ( r , 7 i2) results in a CPL of 29, and therefore it is accepted. Merging edge (r, n3) results in a CPL of 34, and therefore it is not accepted. Merging edge (r, ti4) results in a CPL of 35, and therefore it is not accepted. Merging edge (r, 715) results in a CPL of 33, and therefore it is not accepted. Finally, merging edge (r, n6) results in a CPL of 32, and therefore it is not accepted. Hence, the partition obtained using Sarkar’s m ethod is II2, which is as was seen earlier the optimal partition. 2. Consider the fork DAG shown in figure 6.9 (b). The edges in the graph sorted in decreasing communication cost are as follows: (r,rc i), ( r,n 3), ( r ,n 2), (r,rc5), (r,n 4), ( r ,n 6). Initially, the CPL of the graph is 70. Merging edge (r,n i) results in a CPL of 65, and therefore it is accepted. Merging edge (r, n3) results in a CPL of 70, and therefore it is not accepted. Merging edge (r, 712) results in a CPL of 75, and therefore it is not accepted. Merging edge (r, 7 1 5) results in a CPL of 70, and therefore it is not accepted. Merging edge (r, n4) results in a CPL of 80, and therefore it is not accepted. Finally, merging edge (r, n6) results in a CPL of 75, and therefore it is not accepted. Hence, the partition obtained using Sarkar’s m ethod is IIi, which is as was seen earlier the optimal partition. 3. Consider the fork DAG shown in figure 6.9 (c). The edges in the graph sorted in decreasing communication cost are as follows: ( r , 7 n ), (r,n 3), ( r ,n 2), ( r,n 4), (r,n 5), ( r ,n 6). Initially, the CPL of the graph is 70. Merging edge (r, n{) results in a CPL of 65, and therefore it is accepted. Merging edge (r, n3) results in a CPL of 70, and therefore it is not accepted. Merging edge (r, 712) results in a CPL of 75, and therefore it is not accepted. Merging edge (r, 7i4) results in a CPL of 70, and therefore it is not accepted. Merging edge (r, n5) results in a CPL 158 R ep ro d u ced with p erm issio n o f th e copyright ow n er. Further reproduction prohibited w ithout p erm issio n . of 6 6 , and therefore it is not accepted. Finally, merging edge (r, n6) results in a CPL of 6 6, and therefore it is not accepted. Hence, the partition obtained using Sarkar’s method is Hi. As was seen earlier, the optimal partition for this case is lie. Hence, Sarkar’s method does not lead to the optimal partition. 4. Consider the fork DAG shown in figure 6.9 (e). The edges in the graph sorted in decreasing communication cost are as follows: (r,n2), (r,rii), (r,n 3), (r,n 5), (r,n 4), (r,n 6)- Initially, the CPL of the graph is 69. Merging edge (r, n2) results in a CPL of 84, and therefore it is not accepted. Merging edge (r, ni) results in a CPL of 85, and therefore it is not accepted. Merging edge (r,n 3) results in a CPL of 79, and therefore it is not accepted. Merging edge (r,n 5) results in a CPL of 70, and therefore it is not accepted. Merging edge (r,n 4) results in a CPL of 72, and therefore it is not accepted. Finally, merging edge (r, ne) results in a CPL of 75, and therefore it is not accepted. Hence, the partition obtained using Sarkar’s method is Ho. As was seen earlier, the optimal partition for this case is n 3. Hence, Sarkar’s method does not lead to the optimal partition. 5. Consider the fork DAG shown in figure 6.9 (f). The edges in the graph sorted in decreasing communication cost are as follows: (r,ni), (r,n 2), (r,re3), (r,n 4), (r,n 5), (r,n 6). Initially, the CPL of the graph is 70. Merging edge (r, nx) results in a CPL of 65, and therefore it is accepted. Merging edge (r, n2) results in a CPL of 70, and therefore it is not accepted. Merging edge (r,n 3) results in a CPL of 75, and therefore it is not accepted. Merging edge (r, n4) results in a CPL of 70, and therefore it is not accepted. Merging edge (r,n 5) results in a CPL of 6 6 , and therefore it is not accepted. Finally, merging edge (r, n 6) results in a CPL of 70, and therefore it is not accepted. Hence, the partition obtained using Sarkar’s method is Hi. As was seen 159 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm issio n . earlier, II4 and n 6 are the optimal partitions. Hence, Sarkar’s method does not lead to the optimal partition. 6 . Consider the fork DAG shown in figure 6.9 (g). The edges in the graph sorted in decreasing communication cost are as follows: (r,n 1), ( r,n 2), (r,n 3), (r,n 5), (r,n 4), (r,n 6). Initially, the CPL of the graph is 70. Merging edge (r, ni) results in a CPL of 65, and therefore it is accepted. Merging edge (r,n 2) results in a CPL of 60, and therefore it is accepted. Merging edge (r,n 3) results in a CPL of 59, and therefore it is accepted. Merging edge (r, n5) results in a CPL of 62, and therefore it is not accepted. Merging edge (r,n 4) results in a CPL of 64, and therefore it is not accepted. Finally, merging edge (r, ns) results in a CPL of 67, and therefore it is not accepted. Hence, the partition obtained using Sarkar’s method is n 3, which is as was seen earlier the optimal partition. T heorem : If there exists an integer q, 1 < q < m — 2, such that V k, 1 < k < q, ck > ek+i + ck+ 1 AND V fc, q+1 < k < m - 1, ck < ek+i + ck+i AND c< i+ i < eq+2 + e7+3 + • • • + em. then Sarkar’s method finds the optimal partition n ?. P ro o f Clearly, in this case we have V Ar, 1 < k < q, cf c > cjt+i. Also, since c, > e, +1 + c,+1, and we know that c ,- + e ,- > c,- +1 + e,+1, 1 < i < m — 1, then cq > ck + ek, q + 1 < k < m . Therefore, c, > ck, q + 1 < k < m. Therefore, the edges are merged in the following order: (r,m ), (r,n 2), ..., (r,n ,), .... 160 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . m m. m Join DAG Figure 6.10: Join DAG The merging of (r, ni) leads to partition n x. Since Z i < l0, this merger is accepted. Next, (r, n2) is merged and we get partition II2. Since I2 < li, this merger is accepted. Since Ik < Zjt-i, 1 < k < q, then this process goes on until we reach partition II,, which is the optimal partition. The fork DAGs in figure 6.9 (a), (b) and (g) axe examples of such situation. T h e o rem : Assume that the optimal partition of the fork DAG is n p, 1 < p < m, and that II,-, 0 < i < p — 1, is not an optimal partition. If there exists an s, 1 < s < p, such that merging edge (r, ns) is not accepted (i.e. we have an increase in the CPL), then Sarkar’s method does not find the optimal partition. P ro o f We know that if IIt * is an optimal partition, then i > p. If edge (r, na) is not merged, then we can never obtain any partition II,-, s < i < m. Hence, the optimal partition can never be obtained. The fork DAGs in figure 6.9 (c), (e) and (f) are examples of such situation. 161 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . 6.1.2 Join DAG s Consider the join DAG shown in figure 6.10. Each c ,- is the communication cost of edge (n,-, r), and each et - is the execution cost of node n,-. Also, e is the execution cost of the leaf node r. W ithout loss of generality, assume th at the root nodes are sorted such th at c,-+et - > C i+i + e,+l, 1 < i < m — 1. The case of join DAGs is the same as the one for fork DAGs, and the analysis here is the same as for fork DAGs. We just have to reverse the direction of the edges in the graph. The critical path of the join DAG shown in figure 6.10 is (rii, r). Hence initially, the CPL is /0 = e + ci + ex. Again, the main thing to notice here is that whenever an edge (n,-, r) is merged, all the other execution paths (ray, r) (j ^ i) axe increased in length by e,-. T h e o re m : The optimal partition for the join DAG is constituted of the following tasks: { r,n l,n 2, . . . , n f c }, {n*+1}, { n ^ } , . . {nm}, where 0 < k < m. For k = 0 the optimal partition is the trivial partition, and for k = m the optim al partition is the singleton partition. P ro o f: The proof is exactly the same as the one for the theorem for fork DAGs. U sin g H eu ristic 1 The critical path of the join DAG in figure 6.10 is (n i,r). Hence the first edge chosen to be merged is (nx,r). The result of this merger is shown in figure 6.11 (a). The new critical path is (n2, r x). Hence the next edge to be merged is (n2, r x). The resulting task graph is shown in figure 6.11 (b). This process goes on, and at the k’th merging step, edge (nfc,rjt_ x)3 is chosen for merger. The result if this merger is shown is figure 6.11 (c). This merging process goes on until we reach the singleton partition. 3Assume that ro is defined to be r, so th at when k = 1 rfc_x = r< j = r. 162 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . m m m e + e (a) Tasks r and n * are merged together (b) Tasks r. and n0 are merged together k + 2 k+2 m rn k+2 m (c) Tasks r, r t . n , are merged together Figure 6.11: Merging steps for join DAGs using our heuristic R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm issio n . It is quite clear that using Heuristic 1, the partitions that we get during the iterations of the partitioning algorithm are constituted of the following tasks: {r,nx, n 2, . . . , nk}, {njk+i}, {n*+2}, . . {nm}, where 0 < k < m. From the above theorem, we conclude that our algorithm always leads to the optimal partition. Som e A nalysis Let n* be the partition constituted of the following tasks: { r,n i,n 2,...,njfc}, {n*+1}, {nfc+2}, {nm}, where 0 < k < m. Let Ik be the critical path length of n^. From the above theorem, we know that the optim al partition is one of the H t’s, 0 < k < m. Note that using our partitioning algorithm with Heuristic 1, 11* is the partition obtained after the A r ’th merging step, and Ik is the CPL of the task graph after the fc’th merging step. The task graph corresponding to partition n* is shown in figure 6.11 (c), with r 0 and eo defined to be r and e respectively. Ik = e + ei + e2 + • • • + ek + ejt+i + ck+i, 0 < k < m — 1. lm = e + ei + e2 + • • • + em. Note that l0 = e + ei + ci. h — h -i = ejt+i + ck+i — C fc , 1 < k < m — 1. Hence, for 1 < k < m — 1, we have h < 4 - i C k > ejt+i + cfc+ i AND h > Ik-1 ^ cjt < ejt+i + cfc+ i. I f c lm & k+ 2 ^fc+3 * C m , 0 ^ 1* ^ T T t 1. Hence, for 0 < k < m — 2, we have •O - cjt+i < ejt+ 2 + efc +3 H + em. Note that lm-i — /m = Cm. Hence /m_i > /m, and therefore IIm_i can never be the optim al partition. C orollary: Assume that there exists an integer q, 1 < q < m — 2, such that 164 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm issio n . v k, 1 < k < q, ck > ejt+i + C f c +1 AND Vfc, g + l < f c < m - l , cfc< efc+ i + cfe+1 AND c,+i < e, +2 + e?+3 H 1 - em. Then IT, is the optimal partition. P ro o f: The proof is exactly the same as for the corollary for fork DAGs. E x a m p les: The same examples that were used for fork DAGs can be used here. We just have to reverse the direction of the edges in the graph. U sin g S a rk a r’s P a rtitio n in g M e th o d The case for join DAGs is the same as the one for fork DAGs. T h e o re m : If there exists an integer q, 1 < q < m — 2, such that V k, 1 < k < q, ck > efc+ i + cjt+1 AND V fc, q + I < k < m - 1, cf c < ek+i + cfc+1 AND c q+ 1 e 7 + 2 + e q+ 3 H + C m - then Sarkar’s method finds the optimal partition 11,. P ro o f Exactly the same as for the case of fork DAGs. T h e o re m : Assume that the optimal partition of the fork DAG is lip, 1 < p < m, and that IT ,-, 0 < i < p — 1, is not an optimal partition. If there exists an s, 1 < s < p, such that merging edge (n„ r) is not accepted (i.e. we have an increase in the CPL), then Sarkar’s method does not find the optimal partition. 165 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . Proof Exactly the same as for the case of fork DAGs. 6.2 Partitioning Complete Binary Trees In this section, we assume that the program graph to be partitioned is a complete binary tree. We also assume that all actors in the graph have the same weight, and all edges in the graph have the same weight. Usefulness Binary trees represent many useful problems, such as search, sort, finding the minium or maximum of a list, numerical algorithms constituted of binary opera­ tors only, and divide and conquer problem solving techniques. Furthermore, it was shown that (see [15]) many efficient algorithms for several scheduling problems use binary trees (generally complete binary trees). Dekel and Sahni ([15]) used binary trees to design efficient parallel algorithms. More precisely, they used binary trees for parallel computations, and showed th at the binary tree is an im portant and very useful program graph for parallel algorithms. Definitions R elated to Trees Consider a directed graph constituted of a tree. We assume that all leaf nodes are at the first level (top most, level 1) of the tree and the root node is at the last level (bottom level, level N where N is the number of levels in the tree) of the tree. The tree is upside down and therefore the leaf nodes are at the top and the root node is at the bottom. The direction of the arcs are from smaller to larger levels. 166 R ep ro d u ced with p erm issio n o f th e copyrigh t ow ner. Further reproduction prohibited w ithout p erm issio n . 6.2.1 Properties Property 1: G-Trees Given a program graph g th at is a complete binary tree, such that all actors in the graph have the same weight and all edges in the graph have the same weight. For any task graph T that corresponds to a partition of g, there is a task graph T' such that the CPL of T' is less or equal than the CPL of T, and T' has the following properties: T' is a tree such that • The number of input arcs of any node is a power of 2. • All nodes (tasks) at the same level have the same number of actors (and as a consequence the same weight). • The number of actors in any task (node in the task graph) is equal to the number of input edges of the node corresponding to the task minus 1 (clearly, this is not the case for nodes at level 1). • The sum of the num ber of actors in all nodes (i.e. tasks corresponding to the nodes) in T' except the nodes at the top most level is 2“ — 1, where 2“ is the number of nodes at the top most level of the tree. T' is called a G-tree (G stands for Good). Clearly all G-trees satisfy the following 2 properties: • T' is a complete tree and all execution paths have the same length. • All nodes at the top most level (level 1) represent tasks that have 2N~a — 1 actors, where N is the total number of levels in the tree and there are 2“ nodes at the top most level of the tree. P ro p e rty 2 This is a direct consequence of Property 1. Given a program graph g that is a complete binary tree, such that all actors in 167 R ep ro d u ced with p erm issio n o f th e copyright ow n er. Further reproduction prohibited w ithout p erm issio n . the graph have the same weight and all edges in the graph have the same weight. The optimal task graph is a G-tree. 6.2.2 Optimal Partition One way to figure out the optimal partition for complete binary trees with N levels is to find all possible task graphs which are G-trees and determine the one with the minimum CPL. T hat task graph is the optimal one (from Property 2). Hence, we determine all possible G-trees with 1 level, all possible G-trees with 2 levels, ... , all possible G-trees with N — 1 levels, and finally all possible G-trees with N levels. Clearly the task graph consisting of a single node is the only G-tree with 1 level. Also the task graph corresponding to the trivial partition is the only G-tree with N levels. E x am p le Assume that our program graph is a complete binary tree with 4 levels. Figure 6.12 shows all G-trees with 2 levels. Figure 6.13 shows all G-trees with 3 levels. The number next to each node is the number of actors in the task corresponding to the node. CPL of G-trees Assume that our program graph is a complete binary tree with N levels. Consider all G-trees with m levels (1 < m < N). For a path p, let A(p) be the sum of the numbers of actors in all nodes in p. Clearly for any G-tree T, A(p) is the same for any execution path p of T. Also any execution path of T is a critical path. Let’s define A(T) to be A(p), where p is any execution path of T. For a G-tree T with m levels, the possible values for A(T) are (2“l — 1) + (2“2 — 1) -I + (2°m — 1), where ol5 a2, . . . , am are positive integers (not zero) such that a l + a 2 + • • • + flm = N . 168 R ep ro d u ced with p erm issio n o f th e copyright ow n er. Further reproduction prohibited w ithou t p erm issio n . Figure 6.12: All G-trees with 2 levels R ep r o d u ced with p erm issio n o f th e copyright ow n er. Further reproduction prohibited w ithout p erm issio n . Figure 6.13: All G-trees with 3 levels 170 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . (2“‘ — 1) is the number of actors in each node at level 1. (2“2 — 1) is the number of actors in each node at level 2 . (2“m — 1) is the number of actors in each node at level m. Let e be the execution time of each actor in the original program graph. Let c be the communication time of each edge in the graph4. Hence, the possible values for the CPL of T are CPLm = [(2“l - 1) + (2“s - 1) + • • - + (2“m - 1)] * e + (m - 1) * c, where ai, a2, . . . , am are positive integers (not zero) such that a i + a 2 + ' ' ' + a m = N . E xam ples 1. Let N — 4 and m = 3. Let’s find all possible values of (01, 02, 03). The solution of ai + a2 + a3 = 4 is: {(1,1,2), (1,2,1), (2,1,1)}. Hence there are 3 possible G-trees with 3 levels. These are shown in figure 6.13. 2. Let N = 6 and m = 3. In this case 0 1 + 0 2 + 03 = 6 . There are 10 possible values for (01, 02, 03). The solution is: {(1,1,4), (1,2,3), (1 ,3 ,2 ),(1 ,4 ,1 ),(2 ,1 ,3 ),(2 ,2 ,2 ),(2 ,3 ,1 ),(3 ,1 ,2 ),(3 ,2 ,1 ), (4 ,1, 1)}. Hence there are 10 possible G-trees with 3 levels. M inim al C P L Assume that our program graph is a complete binary tree with N levels. The G-tree with m levels which has the minimal CPL among all G-trees with m levels is one for which CPLm is minimal. Hence, we find a 1 ? a2, . . . , am such that A m = 2“l + 2“2 + • • • + 2“m is minimal given that 4Note that the weight of the edges in any task graph is the same as the weight of the edges in the original program graph. 171 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . Figure 6.14: Optimal G-tree with 3 levels fli, a2i • ■., are positive integers (not zero) such that ai + a 2 + • • • + om — N. A m is minimal when the a,-’s are chosen in the following manner: We divide N as evenly as possible on 01, 02, .. . ,a m. In other words, if T V is a multiple of m, then each a,- will take the value N /m (using integer division). Otherwise, (N MOD m) a,-’s will take the value (N /m + 1) and the rest take the value N/m (using integer division). Hence when N is not a multiple of m, we have more than one solution for (ai, 02, . . . , am). E xam ples 1. Let N = 4 and m = 3. The G-tree with 3 levels for which (01, 02, 03) = (2,1,1) has minimal CPL among all G-trees with 3 levels. Also the G-tree with 3 levels for which (01, 02, 03) = (1,2,1) has minimal CPL among all G-trees with 3 levels. 172 R ep ro d u ced with p erm issio n o f th e copyright ow n er. Further reproduction prohibited w ith out p erm issio n . 2. Let N = 6 and m = 3. The G-tree with 3 levels and (a i,a 2, a3) = (2,2,2) has minimal CPL among all G-trees with 3 levels. (2“l — 1) = 3 is the number of actors in each node at level 1. (2°2 — 1) = 3 is the number of actors in each node at level 2 . (2“3 — 1) = 3 is the number of actors in each node at level 3. Using property 1 for G-trees (first point), it is easy to construct this G-tree. This is shown in figure 6.14. The number next to each node is the number of actors in the task corresponding to the node. Finding Optimal Partition Given a complete binary tree with N levels as a program graph, one way to find the optimal partition is to determine: 1- IIi: the G-tree with one level which has minimal CPL among all G-trees with one level. 2- n 2: the G-tree with 2 levels which has minimal CPL among all G-trees with 2 levels. m- n m: the G-tree with m levels which has minimal CPL among all G-trees with m levels. N- FIjv: the G-tree with N levels which has minimal CPL among all G-trees with N levels. The II,- which has the smallest CPL among all II,’s corresponds to the optimal partition. 173 R ep ro d u ced with p erm issio n o f th e copyright ow n er. Further reproduction prohibited w ithout p erm issio n . Hen.ce the CPL of the optimal partition can be expressed as C P L o p t — M I N { ( 2 * - l ) * e , M IN {[(2“l - 1) + (2“2 - 1)] * e + c / ai +d2 = N }, M IN {[(2ai - 1 ) + (2°2 - 1)+ (2 “3 - l ) ] * e + 2 * c / a i + 02 + a 3 = N y , M I N {[(2ai - 1) + (2°2 - 1)+ (2“3 _ l) + (2a< — 1)] * e + 3 * c / a l + 02 + a 3 + 04 = M I N {[(2“l - 1) + (2“2 - 1)+ • • • + (2B * - 1)] * e + (N - 1) * c / O i + 0 2 + • • • + Off = N } } Note that {[(2ai —1)+(2°2— 1)4" * •'h(2a^ f— l)]*e-h(iV— l)*c j Ox-Fc^d" • •-\-o.ff = N} = {[(2ai — 1) + (2“2 — 1) H h (2“* — 1)] *e + (A T — 1) *c / ax = a2 = • • • = ayv = 1} {N * e + (N — 1) * c}. R e m ark : In general, there could be more than one optimal solution. A Quicker Way to Find an Optimal Partition In general, it is not necessary to find all II'-s in order to find the optimal partition. There are two tricks to do that: 1. It is enough to find IIx, I I 2 ,. . . , IIm, where m is such that any G-tree with m + I or more levels has a CPL greater or equal than /O T tn , and fmin = M IN {C PL(U .i), 1 < i < m}, where C P L (H i) is defined to be the CPL of 174 R ep ro d u ced with p erm issio n o f th e copyright ow n er. Further reproduction prohibited w ith out p erm issio n . CPL of n. Optimal G-tree Figure 6.15: Plot of f(i) partition 11,- . Note that if N is large, this method will take too much time, and the next method should be used. 2. Let /(*) := CPL of II,-. The general shape of the plot of /(i) is shown in figure 6.15s. Hence, all we need to do is to find II,- such that f(i — 1) > /( i) and f ( i + 1) > f(i). Ilf corresponds to the optimal partition6. In general, by determining several points in the curve of the function / (i.e. we determine several II/s and their corresponding / ( j ) ’s), we can determine i. This is shown in the following examples. 5Note th at it is possible for F E i to correspond to the optim al partition. For instance, if [N, c, e) = (4,10,1) then IIi corresponds to the optimal partition. fiIf 1 1 1 is the optim al G-tree, then t = 1 and / ( i — 1) doesn’t make any sense. If /(2 ) > / ( l ) then we know that ITi is the optim al G-tree. 175 R ep ro d u ced with p erm issio n o f th e copyrigh t ow ner. Further reproduction prohibited w ithout p erm issio n . Figure 6.16: Optimal task graph. E x am p le 1 Assume that our program graph is a complete binary tree w ith 5 levels. The total number of nodes in the graph is 25 — 1 = 31. Assume that the execution time of the actors is 1 and the communication cost of edges is 5. 1. The G-tree with one level is constituted of 1 node. The corresponding task of this node contains 31 actors. Hence the CPL is 31. 2. G-tree with 2 levels that has minimum CPL: a t = 5/2 + 1 = 3. a2 = 5/2 = 2. The corresponding CPL is [(23 — 1) + (22 — 1)] * 1 + (2 — 1) * 5 = 15. 3. G-tree with 3 levels that has minimum CPL: fll = 5/3 + 1 = 2. a2 = 5/3 + 1=2. a3 = 5/3 = 1. The corresponding CPL is [(22 — 1) + (22 — 1) + (21 — 1)] * 1 + (3 — 1) * 5 = 17. 4. G-tree with 4 levels that has minimum CPL: fll = 5/4 + 1 = 2. a2 = 5/4 = 1. 176 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithou t p erm issio n . Figure 6.17: Optimal task graph a3 = 5/4 = 1. aA = 5/4 = 1. The corresponding CPL is [(22 —1) + (21 —1) + (21 — l)-h(21 — l) ] * l+ ( 4 —1)*5 = 21. 5. The G-tree with 5 levels is the original program graph. Its corresponding CPL is (1 + 1 + 1 + 1 + 1) * 1 -f (5 — 1) * 5 = 25. Hence, the G-tree with 2 levels and (ai, < 12) = (3,2) is an optimal task graph (i.e. the partition it represents is optimal). This G-tree is shown in figure 6.16. The number next to each node is the number of actors in the task corresponding to the node. R em ark : It was not necessary to look at G-trees with levels 4 and 5. We could have stopped when we found the best G-tree with 1,2 or 3 levels. This is true for the following reason: knowing that the best G-tree (i.e. the one th at has minimal CPL) with 1, 2 or 3 levels has a CPL of 15, and that any G-tree with 4 or more levels has a CPL greater or equal than (4 * 1 + 3 * 5) = 19, we can conclude that the best G-tree found for 1, 2 or 3 levels is the optimal task graph. 177 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm issio n . C O C O C O C O C O C O C O C O co C O C O C O C O C O Figure 6.18: Optimal task graph R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm issio n . Exam ple 2 Assume that our program graph is a complete binary tree with 6 levels. The total number of nodes in the graph is 26 — 1 = 63. Assume that the execution tim e of the actors is 1 and the communication cost of edges is 5. 1. The G-tree with one level is constituted of 1 node. The corresponding task of this node contains 63 actors. Hence the CPL is 63. 2 . G-tree with 2 levels that has minimum CPL: ai = 6 /2 = 3. a2 = 6/2 = 3. The corresponding CPL is [(23 — 1) + (23 — 1)] * 1 + (2 — 1) * 5 = 19. 3. G-tree with 3 levels that has minimum CPL: a\ = 6/3 = 2. a2 = 6/3 = 2. a3 = 6/3 = 2. The corresponding CPL is [(22 — 1) + (22 — 1) + (22 — 1)] * 1 + (3 — 1) * 5 = 19. 4. G-tree with 4 levels that has minimum CPL: fll = 6/4 + 1 = 2. 0 -2 = 6/4 + 1=2. a3 = 6/4 = 1. a 4 = 6/4 = 1. The corresponding CPL is [(22 —1) + (22 —1)+(21 —1) + (21 — l)]* l + (4 —1)*5 = 23. 5. Any G-tree with 5 or more levels: Its corresponding CPL is greater or equal than (l + l + l + l + l)* l+ (5 —1)*5. Thus CPL > 25. Hence, the G-tree with 2 levels and (ai, a2) = (3,3) is an optimal task graph. This G-tree is shown in figure 6.17. 179 R ep ro d u ced with p erm issio n o f th e copyright ow n er. Further reproduction prohibited w ithout p erm ission . Also, the G-tree with 3 levels and (a!, 02, 03) = (2,2,2) is an optimal task graph. This G-tree is shown in figure 6.18. E x am p le 3 Assume that our program graph is a complete binary tree with 5 levels. The total number of nodes in the graph is 25 — 1 = 3 1 . Assume that the execution time of the actors is 1 and the communication cost of edges is 10. 1. The G-tree with one level is constituted of 1 node. The corresponding task of this node contains 31 actors. Hence the CPL is 31. 2. G-tree with 2 levels that has minimum CPL: ai = 3. 0 .2 — 2 . The corresponding CPL is [(23 — 1) + (22 — 1)] * 1 + (2 — 1) * 10 = 20. 3. G-tree with 3 or more levels: The corresponding CPL is greater or equal than (l + l + l ) * l + (3 — 1)*10. Hence CPL > 23. Hence, the G-tree with 2 levels and (a1 } a2) = (3,2) is an optimal task graph (i.e. the partition it represents is optimal). This G-tree is shown in figure 6.16. E x am p le 4 Assume that our program graph is a complete binary tree with 6 levels. The total number of nodes in the graph is 26 — 1 = 63. Assume that the execution time of the actors is 1 and the communication cost of edges is 10. 1. The G-tree with one level is constituted of 1 node. The corresponding task of this node contains 63 actors. Hence the CPL is 63. 180 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm issio n . 2. G-tree with 2 levels that has minimum CPL: ai = 3. a2 = 3. The corresponding CPL is [(23 — 1) + (23 — 1)] * 1 + (2 — 1) * 10 = 24. 3. G-tree with 3 levels that has minimum CPL: o,\ — 2 . 0 . 2 — 2. 0 3 = 2. The corresponding CPL is [(22 — 1) + (22 — 1) + (22 — 1)] * 1 + (3 — 1) * 10 = 29. 4. G-tree with 4 or more levels: The corresponding CPL is greater or equal than (l + l + l + l) * l + (4 —1) * 10. Hence CPL > 34. Hence, the G-tree with 2 levels and (al5 02) = (3,3) is an optimal task graph. This G-tree is shown in figure 6.17. E x am p le 5 Assume that our program graph is a complete binary tree with 100 levels. The total number of nodes in the graph is 2100 — 1. Assume that the execution time of the actors is 1 and the communication cost of edges is 10. • Hio: o.\ = 0 2 = • • • = Oio = 10. /(10) = 10320. • n 20: d \ — 02 — — 020 = 5- / ( 2 0) = 810. • n 50: a\ = a2 = • • • = aso = 2 . /(50) = 640. • n 25: ai = a2 = • • • = a25 = 4. /(25) = 615. 181 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm issio n . • n30: C ii — a . 2 — - •' — a 10 = 4. an = o . 12 = • • • = a3 o = 3. /(30) = 580. • n 40: ai = a . 2 = • • * = O 20 = 3. a 21 = a 22 = • • • = 0 4 0 = 2 . /(40) = 590. • n 35: ai = 02 = • • • = o3o = 3. 031 = o 32 = • • • = a 35 = 2 . /(35) = 565. • ri3 8: Oi = 02 = • • • = 024 = 3. 025 = O 26 = • • • = a 3 8 = 2. /(38) = 580. • n3 6 : Oi = 02 = • • • = 028 = 3. a 2 9 = a3o = • • • = a36 = 2. /(36) = 570. • n3 4 : Oi = 02 = • • • = a32 — 3. O 33 = a 34 = 2 . /(34) = 560. • n ^ : 02 = a3 = • • • = 0. 3 3 = 3. ai = 4. /(33) = 559. • n32: Oi = 02 = a3 = 04 = 4. 05 = 06 = • • • = a3 2 = 3. /(32) = 566. Hence, II3 3 is the optimal G-tree and the optimal CPL is 559. 182 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . Example 6 Assume that our program graph is a complete binary tree with 50 levels. The total number of nodes in the graph is 25 0 — 1. Assume that the execution tim e of the actors is 1 and the communication cost of edges is 10. • II25: ai = a2 = ■ ■ ■ = & 2 S = 2. /(25) = 315. • I I 2 0 ' ai = a 2 = ■ ■■ = ctio = 3. a n = < 212 = • • • = 020 = 2. /(20) = 290. • n i5: 0.1 = a 2 = • • • = a s = 4. as = a 7 = = < 215 = 3. /(15) = 285. • n 14: a.i = a2 = • • • = a$ = 4. a 9 = n 10 = • • • = 0 1 4 = 3. /(14) = 292. • II16: a,i — a 2 = 4. 03 = a4 = • • • = a 16 — 3. /(16) = 278. • ITiy: ai = a2 = ■ ■ • = ai6 — 3. <217 = 2 . /(17) = 275. • nl8: 0.1 = a2 — = a i4 = 3. a 1 5 = a 1 6 = • • • = C ti8 = 2. /(18) = 280. Hence, is the optimal G-tree and the optimal CPL is 275. 183 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . 6.2.3 U sing H euristic 1 Example 1 Assume that our program graph is a complete binary tree with 3 levels. Assume that the execution tim e of the actors is 1 and the communication cost of edges is 2. Figure x31 shows the merging steps using Heuristic 1. The number next to each node is the number of actors in the task corresponding to the node. The edge that has an “x” m ark next to it is the edge chosen for merger. From the figure, we see that the task graph that has a CPL of 6 corresponds to the the best partition using Heuristic 1. Example 2 Assume that our program graph is a complete binary tree with 5 levels. Assume that the execution tim e of the actors is 1 and the communication cost of edges is 5. Figure x32 shows the merging steps using Heuristic 1. From the figure, we see that the task graph that has a CPL of 19 corresponds to the the best partition using Heuristic 1. Example 3 Assume that our program graph is a complete binary tree with 6 levels. Assume that the execution tim e of the actors is 1 and the communication cost of edges is 10. Figure x33 shows the merging steps using Heuristic 1. From the figure, we see that the task graph that has a CPL of 37 corresponds to the the best partition using Heuristic 1. Example 4 Assume that our program graph is a complete binary tree with 8 levels. Assume that the execution tim e of the actors is 1 and the communication cost of 184 R ep ro d u ced with p erm issio n o f th e copyright ow n er. Further reproduction prohibited w ithout p erm issio n . edges is 15. Figure x34 shows the merging steps using Heuristic 1. From the figure, we see that the task graph that has a CPL of 79 corresponds to the the best partition using Heuristic 1. In the last task graph shown in the figure there axe two “x” maxks to indicate that both execution paths to which the marks belong are critical paths, and therefore either of the edges marked could be chosen for merger (clearly these are not the only edges that could be chosen). Best Partition Using Heuristic 1 Assume that the execution time of actors is e and the communication cost of edges is c. We find the smallest integer x such th at x > 1 and (2X — 1) * e > c. Hence, we find the smallest integer x such that x > 1 and 2X > f + 1. The best partition using Heuristic 1 is the one for which the task graph is a complete binary tree with [N — (x — 1)] levels such that all top most nodes have (2X — 1) actors and all other nodes have 1 actor. Hence, the corresponding CPL is C P L = [(2X — 1) + ((N — (x — 1)) — 1)] * e + [(N — (x — 1)) — 1] * c = (2r — 1 + iV — x) * e + (N — x) * c Example 1 Assume that our program graph is a complete binary tree with 5 levels. Assume that the execution time of the actors is 1 and the communication cost of edges is 10. Find the smallest integer x such that x > 1 and 2X > 11. x = 4. Hence, the best partition using Heuristic 1 is the one for which the task graph is a complete binary tree with 5 — 3 = 2 levels such that all top most nodes have 24 — 1 = 15 actors and all other nodes have 1 actor. 185 R ep ro d u ced with p erm issio n o f th e copyright ow n er. Further reproduction prohibited w ithout p erm issio n . The corresponding CPL is CPL = (24 - 1 + 5 - 4) * 1 + (5 - 4) * 10 = 26. E x am p le 2 Assume that our program graph is a complete binary tree with 6 levels. Assume that the execution time of the actors is 1 and the communication cost of edges is 5. Find the smallest integer x such that x > 1 and 2r > 6. x = 3. Hence, the best partition using Heuristic 1 is the one for which the task graph is a complete binary tree with 6 — 2 = 4 levels such that all top most nodes have 23 — 1 = 7 actors and all other nodes have 1 actor. The corresponding CPL is C P L = (23 - 1 + 6 - 3) * 1 + (6 - 3) * 5 = 25. E xam ple 3 Assume that our program graph is a complete binary tree with 4 levels. Assume that the execution time of the actors is 1 and the communication cost of edges is 10. Find the smallest integer x such that x > 1 and 2X > 11. x = 4. Hence, the best partition using Heuristic 1 is the one for which the task graph is a complete binary tree with 4 — 3 = 1 level such that all top most nodes have 24 — 1 = 15 actors and all other nodes have 1 actor. 2, this is the task graph constituted of 1 node that has all 15 actors. The corresponding CPL is C P L = (24 - 1 + 4 - 4) * 1 + (4 - 4) * 5 = 15. Example 4 Assume that our program graph is a complete binary tree with 8 levels. Assume that the execution time of the actors is 1 and the communication cost of 186 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . edges is 5. Find the smallest integer x such that x > 1 and 2X > 6. x = 3. Hence, the best partition using Heuristic 1 is the one for which the task graph is a complete binary tree with 8 — 2 = 6 levels such that all top most nodes have 23 — 1 = 7 actors and all other nodes have 1 actor. The corresponding CPL is C P L = (23 - 1 + 8 - 3) * 1 + (8 - 3) * 5 = 37. E x a m p le 5 Assume that our program graph is a complete binary tree with 100 levels. Assume that the execution time of the actors is 1 and the communication cost of edges is 10. Find the smallest integer x such that x > 1 and 21 > 11. x = 4. Hence, the best partition using Heuristic 1 is the one for which the task graph is a complete binary tree with 100 — 3 = 97 levels such that all top most nodes have 24 — 1 = 15 actors and all other nodes have 1 actor. The corresponding CPL is C P L = (24 - 1 + 100 - 4) * 1 + (100 - 4) * 10 = 1071. E x a m p le 6 Assume that our program graph is a complete binary tree with 50 levels. Assume that the execution time of the actors is 1 and the communication cost of edges is 10. Find the smallest integer x such that x > 1 and 2r > 11. x = 4. Hence, the best partition using Heuristic 1 is the one for which the task graph is a complete binary tree with 50 — 3 = 47 levels such that all top most nodes have 24 — 1 = 15 actors and all other nodes have 1 actor. The corresponding CPL is 187 R ep ro d u ced with p erm issio n o f th e copyright ow n er. Further reproduction prohibited w ith out p erm issio n . CPL = (24 - 1 + 50 - 4) * l + (50 - 4) * 10 = 521. 6.2.4 Perform ance o f Heuristic 1 Assume that we have a complete binary tree with N levels. Let the communication cost of edges be c = 10 and the execution cost of actors be e = 1. We determine the performance of Heuristic 1 by comparing the partition ob­ tained using this heuristic with the optimal partition for various values of N. We assume th at the performance of a partitioning algorithm is the inverse of the CPL of the task graph corresponding to the partition obtained using the algo­ rithm. Let I be the CPL of the task graph corresponding to the partition obtained using Heuristic 1 and lopt be the CPL of the task graph corresponding to the optimal partition. Let p be the percentage of the performance of Heuristic 1 relative to the optimal partitioning algorithm. — * -2 - = - 1 => D — l£E! * 100 /o p t 1 0 0 / P / A U U - • N = 4. Optimal: lopt = 15. I = 15. p= 1 0 0 • /V = 5. Optimal: n 2. lopt = 20. 1 = 26. p = 76.9 • N = 6. Optimal: n 2. 188 R ep r o d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . Optimal: II3. l0 p t = 3 7 . / = 5 9 . p = 6 2 . 7 • N = 10 . Optimal: II3. lo p t = 4 9 . I = 8 1 . p = 6 0 .5 • N = 2 0 . Optimal: n 7. I o p t = 1 0 5 . I = 1 9 1 . p = 5 5 • AT = 3 0 . Optimal: II10. L p t = 1 6 0 . / = 3 0 1 . p = 5 3 . 2 • AT = 4 0 . Optimal: IT13. / opt = 2 1 9 . I = 4 1 1 . p = 5 3 . 3 • N = 5 0 . Optimal: II17. R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . N 4 5 6 8 1 0 2 0 3 0 4 0 5 0 1 0 0 1 5 0 2 0 0 lo p t 1 5 2 0 2 4 3 7 4 9 1 0 5 1 6 0 2 1 9 2 7 5 5 5 9 8 4 0 1 1 2 5 I 1 5 2 6 3 7 5 9 8 1 1 9 1 3 0 1 4 1 1 5 2 1 1 0 7 1 1 6 2 1 2 1 7 1 p 1 0 0 7 6 .9 6 4 .9 6 2 . 7 6 0 .5 5 5 .0 5 3 .2 5 3 .3 5 2 .8 5 2 . 2 5 1 .8 5 1 .8 Table 6 .1: Performance of Heuristic 1 for Complete Binary Trees / op£ = 2 7 5 . / = 5 2 1 . p = 5 2 .8 • N = 1 0 0 . Optimal: n ^ . / opt = 5 5 9 . / = 1 0 7 1 . p = 5 2 .2 • N = 1 5 0 . Optimal: II50. /opt = 8 4 0 . / = 1621. p = 5 1 .8 1 9 8 6 « 5 1 .8 2 • N = 200. Optimal: 1167- /opt = 1 1 2 5 . / = 2 1 7 1 . p = 51.81943 « 51.82 Table 6.1 summarizes the above results. Figure 6.19 shows the plot of p as a function of N. We see from the curve that as N exceeds 25, p staxts to decrease very slowly. When N reaches 50, the decrease in p becomes almost negligible. We can safely conclude that the performance of Heuristic 1 is above 50% of the performance of the optimal partitioning algorithm, for any value of N. 190 R ep ro d u ced with p erm issio n o f th e copyright ow n er. Further reproduction prohibited w ithout p erm issio n . Percentage p Percentage p versus N 100 80 100 120 140 160 Number of Levels N 180 200 Figure 6.19: Performance of Heuristic 1 R ep ro d u ced with p erm issio n o f th e copyright ow n er. Further reproduction prohibited w ith out p erm issio n . 6.2.5 U sing Sarkar’s Partitioning M ethod Since all edges in the graph have the same weight, Sarkar’s method chooses edges to be merged randomly. Clearly, this could result in very poor performance. 6.3 M erger that Results into Higher PARTIME In this section, we show why we have to keep doing the merger until the coarsest partition is obtained. We will see that we accept the merger of the edge chosen, even if it results into a higher CPL (PARTIME). As was seen earlier, the parallel execution time of the program is PARTIME = Tc + T 0, where Tc := Computation Time Component, and T0 := Overhead Component (communication overhead only, no scheduling over­ head). There is a trade-off between computation component and overhead component. The more parallelism we exploit, the smaller Tc and the larger T 0 will be, and vice versa. In general, merging tasks results into an increase in Te (loss in parallelism and more sequentialization) and a decrease in T0 (reduction in communication overhead). The CPL of the task graph is C P L = I3n€Pertt comp(n) -f- J2e^pcrit comm(e). Clearly, Tc = E n e P c r i e comp(n). and T0 = Z e€Pcrit comm(e). The partitioning algorithm consists of a loop in which each iteration consists of a merging step. Assume that there are N iterations in the algorithm. Let II,- be the partition obtained after iteration i is done7. IIat is the coarsest partition (i.e. the singleton partition). Let l(i) be the CPL of the task graph at the end of iteration i, 1 < i < N , of our partitioning algorithm (using Heuristic 1). /(0) is the CPL of the initial task ite ra tio n 0 is not defined. IIo is defined to be the initial partition, before any iteration is performed. 1 9 2 R ep ro d u ced with p erm issio n o f th e copyright ow n er. Further reproduction prohibited w ithout p erm ission . r i V_________________________________________________________ s Figure 6.20: CPL as a function of the iteration number graph. l(i) = Tc(i) + T0(z), where Tc(i) and T0 (i) axe the computation component and overhead component at the end of iteration i respectively. Intuitively, it is easy to see that during the execution of the algorithm, Te starts to increase and T 0 starts to decrease. The net result (i.e. the sum of the 2 components) is shown in figure 6.20. From the curve shown in figure 6.20, we can see that PARTIME can increase momentarily, then decrease. Therefore, if during an iteration of the algorithm the merger causes an increase in PARTIME, we should still accept this merger, and continue with the next merging iterations. 193 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm issio n . PARTIME versus i 3 0 Curve forTe 2 5 Curve (or To leration Number i Figure 6.21: PARTIME Plot for Example 1 Otherwise we can get caught at a local minimum. It is for this reason that we keep merging tasks until the coarsest partition (singleton paxtition) is reached. E xam ples We consider the case of the fork DAG shown in figure 6.1. Tc(i) = r + ni + n2 H + n ,- + T z t+1, 0 < i < m — 1. Tc(m) — r + ni + n2 H b nm. T0 ii) = C i+i, 0 < i < m — 1. T 0 (m ) = 0. 1. Consider the fork DAG shown in figure 6.9 (a). Here N = 6. The values of Tc, T0 and l{i) for each iteration of the algorithm are shown in table 6.2. The plots for PARTIME, Tc and T 0 are shown in figure 6.21. 2. Consider the fork DAG shown in figure 6.9 (b). Here N = 6. The values of Tc, Ta and l(i) for each iteration of the algorithm are shown in table 6.3. The plots for PARTIME, Tc and Ta are shown in figure 6.22. 194 R ep ro d u ced with p erm issio n o f th e cop yright ow n er. Further reproduction prohibited w ithout p erm issio n . PARTIME versus i live for l(i) Curve for Tc Curve for To Iteration Number i Figure 6.22: PARTIME Plot for Example 2 PARTIME versus i live for l(i) 'Cuive’forT c Curve for To leration Number i Figure 6.23: PARTIME Plot for Example 3 R ep ro d u ced with p erm issio n o f th e copyright ow n er. Further reproduction prohibited w ithout p erm issio n . PARTIME versus i live (or 1 (1 ) 6 0 Curve for Tc 4 0 Curve for To Iteration Number i Figure 6.24: PARTIME Plot for Example 4 PARTIME versus i Curve for 1(0 Curve for T c Curve for To leration Number i Figure 6.25: PARTIME Plot for Example 5 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . PARTIME versus i 'Curve forTc .Curve for To Iteration Number i Figure 6.26: PARTIME Plot for Example 6 PARTIME versus i Curve for l(i) C u r v e forTc Curve for To Iteration Number i Figure 6.27: PARTIME Plot for Example 7 R ep ro d u ced with p erm issio n o f th e copyright ow n er. Further reproduction prohibited w ithout p erm issio n . i Tc T ■ * o I 0 13 20 33 1 15 15 30 2 20 9 29 3 26 8 34 4 30 7 37 5 33 5 38 6 33 0 33 Table 6.2: Table for Example 1 i Tc T0 / 0 25 45 70 1 40 25 65 2 45 30 75 3 60 20 80 4 65 25 90 5 75 15 90 6 75 0 75 Table 6.3: Table for Example 2 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . i Tc To / 0 25 45 70 1 40 25 65 2 45 30 75 3 50 20 70 4 51 20 71 5 52 15 67 6 52 0 52 Table 6.4: Table for Example 3 i Tc To I 0 25 45 70 1 40 25 65 2 45 30 75 3 50 20 70 4 52 20 72 5 73 1 74 6 73 0 73 Table 6.5: Table for Example 4 3. Consider the fork DAG shown in figure 6.9 (c). Here N = 6. The values of Tc, T0 and l(i) for each iteration of the algorithm axe shown in table 6.4. The plots for PARTIME, Tc and T0 axe shown in figure 6.23. 4. Consider the fork DAG shown in figure 6.9 (d). Here N = 6. The values of Tc, T 0 and l(i) for each iteration of the algorithm axe shown in table 6.5. The plots for PARTIME, Tc and T0 are shown in figure 6.24. 5. Consider the fork DAG shown in figure 6.9 (e). Here N = 6. The values of Te, T 0 and l(i) for each iteration of the algorithm are shown in table 6.6. The plots for PARTIME, Tc and T0 are shown in figure 6.25. 6. Consider the fork DAG shown in figure 6.9 (f). Here N = 6. The values of Tc, T 0 and l{i) for each iteration of the algorithm are shown in table 6.7. R ep ro d u ced with p erm issio n o f th e copyright ow n er. Further reproduction prohibited w ithout p erm ission . i Tc T I 0 35 34 69 1 50 35 85 2 60 20 80 3 63 5 68 4 64 6 70 5 70 1 71 6 70 0 70 Table 6.6: Table for Example 5 i Tc T0 I 0 25 45 70 1 40 25 65 2 50 20 70 3 55 10 65 4 56 5 61 5 61 1 62 6 61 0 61 Table 6.7: Table for Example 6 R ep r o d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm issio n . i Tc T 0 I 0 25 45 70 1 40 25 65 2 45 15 60 3 50 9 59 4 53 11 64 5 61 5 66 6 61 0 61 Table 6.8: Table for Example 7 The plots for PARTIME, Tc and T0 are shown in figure 6.26. 7. Consider the fork DAG shown in figure 6.9 (g). Here N — 6. The values of Tc, T0 and l(i) for each iteration of the algorithm axe shown in table 6.8. The plots for PARTIME, Tc and T0 axe shown in figure 6.27. It is easy to see that for the join DAGs in figure 6.9 (c), (e) and (f), if we don’t accept the mergers that result into a higher PARTIME, we can never reach the optimal partition. In other words, the only way to reach the optim al partition is to keep merging the tasks chosen until the coarsest partition is reached. R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm issio n . Chapter 7 Conclusions and Future Research 7.1 Conclusions la this thesis, we presented heuristics for autom atic program code partitioning and grain size determination for DMMs. Like most partitioning methods, our ap­ proach is compile-time. Given a weighted non-hierarchical (flat) Directed Acyclic Graph (DAG) representation of the program, we proposed a data-flow based parti­ tioning method where all levels of parallelism available in the DAG axe exploited. Our procedure automatically determines the granularity of parallelism by par­ titioning the graph into tasks to be scheduled on the DMM. The granularity of parallelism depends only on the program to be executed and on the target machine parameters. The output of our algorithm is passed on as input to the scheduling phase. We use the definition of code partitioning given by Sarkar [48]. Mainly, we assume the availability of an infinite number of processing elements and that the communication overhead between the processing elements is minimum but not zero, then we find the optimal partition (i.e. the one that results into minimal parallel execution time). In this case, the parallel execution time of the input pro­ gram graph is the same as the CPL of the corresponding task graph. Our scheme is based on minimizing the CPL of the task graph, by performing a sequence of task merging. Since we assume that our execution model obeys the convexity con­ straint, our method guarantees that the task graph remains acyclic at all times, 202 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithou t p erm issio n . in order to avoid deadlock situations. The algorithm consists of a loop, and dur­ ing each iteration of the loop a pair of tasks is chosen to be merged1 using some heuristic. Hence, the main work of the algorithm is to decide on which tasks axe to be chosen for merger during each merging iteration, with the goal being the minimization of the CPL of the task graph. In order to come up with the criteria for task merging to be used by the heuristics, we did some analysis of the task graph to better understand the effect of merging tasks on the CPL and on the available parallelism in the Task graph. We also studied the effect of parallelism loss due to task merging on the CPL of the task graph. We determined a necessary and sufficient condition for parallelism loss as a result of task merging. Then we presented some rules to determine the mergers that result into the maximum decrease in the CPL, the mergers that result in no increase in the CPL, etc. We showed that the pair of tasks to be merged has to belong to a critical path of the task graph, otherwise the CPL can never decrease as a result of the merger. Finally, we showed that if there is no parallelism loss as a result of the merger, then the CPL of the task graph is guaranty not to increase. However, if there is parallelism loss, then the CPL could increase as a result of the merger. Finding an optimal solution to the code partitioning problem is NP-complete. Due to the high cost of graph algorithms2, it is nearly impossible to come up with close to optimal solutions that don’t have very high cost (higher order polynomial). For instance, some of the criteria that can be used in choosing the pair of tasks to be merged, and that result into good performance in minimizing the CPL, have a very high time complexity. Therefore, we had to use criteria that result into lower performance and that have lower time complexities. In other words, we had to trade performance for less time complexity. Hence, we proposed heuristics that give reasonably good performance and that have relatively low cost. Our algorithm has a worst case time complexity of 0 (E .N 3). However as was explained earlier in this thesis, for real applications, the average time complexity is expected to be 1 These are called merging iterations. 2 For our analysis we had to use various DAG traversal algorithms. 203 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm issio n . 0(N(E + N)). For fork and join DAGs, our algorithm gives optimal performance. For complete binary trees, the performance is more than 50% of the optimal one. 7.2 Future Research We need to do more experiments using other kinds of regular DAGs and real life benchmarks to further test our heuristics and improve them. Also, an interesting future research would be to consider complete binary trees where not all actors have the same weights and not all edges have the same weights. In addition, if we can find ways to reduce the time complexity of our parti­ tioning algorithm, our proposed procedure will be even more useful. Following are some methods that could be used to achieve this goal and optimize our proposed algorithm: • Use incremental methods to determine CPL, ParSet(n), DepSet(n), perfect edges, safe edges, instead of recomputing them for each iteration of the algorithm. For instance, we could try to express the CPL at step n as a function of the CPL at step n — 1 and the way the merger is done at step n. Tao Yang [18, 58, 59] uses an incremental way to determine the CPL of a task graph. He defines the tlevel and blevel of a node n in a DAG to be the length of the longest path from an entry node to n, excluding the weight of n, and the length of the longest path from n to an exit node respectively. Then he defines the priority PRIO(n) of the node n to be tlevel(n) +blevel(n). It is easy to see that the node with the highest priority belongs to the critical path. Using these definitions, Tan Yang was able to make the choice of the edge to be zeroed (i.e. merged) without having to compute the critical path of the task graph, and in an incremental manner. It would be interesting to see if we can use a method similar to his, to determine the edge to be merged in our partitioning algorith m . • Find optimal solution for restricted classes of DAGs. We expect that the time complexity of our algorithm becomes much lower when we restrict ourselves to special classes of DAGs. For instance for DAGs 204 R ep ro d u ced with p erm issio n o f th e copyright ow n er. Further reproduction prohibited w ithout p erm ission . for which the nodes have at most one output edge, there are some properties that could enable us to find the edge to be merged in a much cheaper way. Also, we can study DAGs for which each node has at most one input edge (trees). • In our partitioning algorithm, we keep merging tasks until we reach the coarsest partition (i.e. the partition consisting of a single task). We accept the merger even if it results in an increase in PARTIME. If we can find a way to stop the merging process much earlier without sacrificing performance, then we will have a reduction in the tim e complexity of the algorithm without losing any performance. Furthermore, an interesting question to answer would be: if heuristic Hi gives larger improvements between successive merging iterations than heuristic Hi, does that mean that Hi performs better than H i ? Finally, an interesting future research would be to investigate merging more than 2 nodes in the task graph (i.e. tasks) at a time. 205 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm ission . Bibliography [1] W. B. Ackerman. Efficient implementation of applicative languages. PHD Dissertation MIT/LCS/TR-323, Massachusetts Institute of Technology, Cambridge, Massachusetts, Mar 1984. [2] Stephen J. Allan and R. R. Oldehoeft. Hep sisal: parallel functional pro­ gramming. In J. Kowalik, editor, Parallel MIMD Computation: The HEP Supercomputer and its Applications, pages 123-150. MIT Press, Cambridge, MA, 1985. [3] Jean-Loup Baer. Computer Systems Architecture. Digital System Design. Computer Science Press, Inc., 11 Taft Court, Rockville, Maryland 20850, 1980. [4] Utpal Banerjee, Rudolf Eigenmann, Alexandra Nicolau, and David A. Padua. Automatic program parallelization. In Proceedings of the IEEE, pages 211- 243. IEEE, Feb 1993. Vol. 81, No. 2. [5] Jeffery M. Barth. A practical interprocedural data flow analysis algorithm. Communications o f the ACM, 21(9):724-736, Sep 1978. [6] D. C. Cann. Vectorization of an applicative language: Current results and future directions. In Compcon 91, pages 396-402, Feb 1991. Also available as a Technical Report number UCRL-JC-105654, Lawrence Livermore National Laboratory, November 1990. [7] David Cann. Retire fortran? a debate rekindled. Technical Report UCRL- 107018, Lawrence Livermore National Laboratory, Livermore, CA 94550, Jul 1991. Rev. 1. [8] David Cann and R. R. Oldehoeft. A guide to the optimizing sisal compiler. Preprint of a paper intended for publication UCRL-MA-108369, Lawrence Livermore National Laboratory, Livermore, CA 94550, Sep 1991. [9] David C. Cann. Compilation techniques for high performance applicative computation. PHD Thesis CS-89-108, Colorado State University, May 1989. 206 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithou t p erm issio n . [10] David C. Cann, Ching-Cheng Lee, R. R. Oldehoeft, and S. K. Skedzielewski. Sisal multiprocessing support. Technical report, Lawrence Livermore Na­ tional Lab, Livermore, CA 94550, 1987. [11] Keith D. Cooper, M ary W. Hall, Robert T. Hood, Ken Kennedy, Kathryn S. McKinley, John M. Mellor-Crummey, Linda Torczon, and Scott K. War­ ren. The parascope parallel programming environment. In Proceedings of the IEEE, pages 244-263. IEEE, Feb 1993. Vol. 81, No. 2. [12] P. Crooks and R. H. Perrott. Language Constructs fo r Data Partitioning and Distribution. The Queen’s University of Belfast, Belfast BT7 INN, Northern Ireland, email: p.crooks@v2.qub.ac.uk, r.perrott@v2.qub.ac.uk. [13] David E. Culler, Seth Copen Goldstein, Klaus Erik Schauser, and Thorsten von Eicken. Tam - a compiler controlled threaded abstract machine. Journal o f Parallel and Distributed Computing, July 1993. [14] David E. Culler, Anurag Sah, Klaus Erik Schauser, Thorsten von Eicken, and John Wawrzynek. Fine-grain parallelism with minimal hardware support: A compiler-controlled threaded abstract machine. In Forth International Con­ ference on Architectural Support for Programming Languages and Operating Systems, pages 164-175, SIGPLAN NOTICES, ACM, 11 W. 42nd St., New York, NY 10036, April 1991. ACM. [15] Eliezer Dekel and Sartaj Sahni. Binary trees and parallel scheduling algo­ rithms. IEEE Transactions on Computers, 32(3):307-315, March 1983. [16] John T. Feo, David C. Cann, and Rodney R. Oldehoeft. A report on the sisal language project. Technical Report UCRL-102440, Lawrence Livermore National Laboratory, Livermore, CA 94550, Jul 1990. Rev. 1. [17] J-L. Gaudiot and L. Lee. Occamflow: A methodology for programming mul­ tiprocessor systems. Journal of Parallel and Distributed Computing, August 1989. [18] Apostolos Gerasoulis and Tao Yang. On the granularity and clustering of directed acyclic task graphs. IEEE Transactions on Parallel and Distributed Systems, 4(6):686-701, June 1993. [19] D. H. Grit. A distributed memory implementation of sisal. In Proceedings of the Fifth Distributed Memory Computing Conference, Apr 1990. Extended Abstract. [20] Dale H. Grit. A distributed memory implementation of sisal. Technical report, Lawrence Livermore National Laboratory, Livermore, CA 94550, Oct 1993. 207 R ep ro d u ced with p erm issio n o f th e copyright ow n er. Further reproduction prohibited w ith out p erm issio n . [21] M att Haines and Wim Bohm. Towards a distributed memory implementation of sisal. Technical Report CS-91-123, Colorado State University, Computer Science Department, Colorado State University, Fort Collins, CO 80523, Nov 1991. [22] M atthew Haines and Wim Bohm. A comparison of explicit and implicit pro­ gramming styles for distributed memory multiprocessors. Technical Report CS-93-104, Colorado State University, March 1993. [23] M atthew Dennis Haines. Distributed runtim e support for task and data man­ agement. PHD Dissertation CS-93-110, Colorado State University, August 1993. [24] John P. Hayes. Computer Architecture and Organization. Computer Organi­ zation and Architecture. McGraw-Hill, Inc., second edition, 1988. [25] John L. Hennessy and David A. Patterson. Computer Architecture: A Quan­ titative Approach. Morgan Kaufmann Publishers Inc., San Mateo, CA, 1990. [26] S. Hiranandani, K. Kennedy, and C. W. Tseng. Compiling fortran d for mimid distributed-memory machines. CACM, 35(8):66-80, Aug 1992. [27] Susan F. Hummel, Edith Schonberg, and Lawrence E. Flynn. Factoring, a method for scheduling parallel loops. CACM, 35(8):90— 101, Aug 1992. [28] Kai Hwang and Faye A. Briggs. Computer Architecture and Parallel Process­ ing. Computer Organization and Architecture. McGraw-Hill, Inc., 1984. [29] C. Lee. Experience of implementing applicative parallelism on cray x-mp. In Proc. CONPAR ’ 8 8 , pages 19-25, Manchester, England, Sep 1988. British Computer Society. Technical Report UCRL-98303, Lawrence Livermore Na­ tional Laboratory, May 1988. [30] C. Lee, S. K. Skedzielewski, and J. T. Feo. On the implementation of ap­ plicative languages on shared-memory, mimd multiprocessors. In Proceedings o f the Symposium on Parallel Programming: Experience with Applications, Languages and Systems, pages 188-197, New Haven, CT, Sep 1988. Available in SIGPLAN Notices 23,9. [31] Liang-Teh Lee. Occamflow: Programming a Multiprocessor System in a High- Level Data-Flow Language. PhD thesis, University of Southern California, August 1989. [32] E. S. Lowry and C. W. Medlock. Object code optimization. Communications o f the ACM, 12(1):13— 22, Jan 1969. 208 R ep ro d u ced with p erm issio n o f th e copyright ow n er. Further reproduction prohibited w ithout p erm issio n . [33] James McGraw, Stephen Skedzielewski, Stephen Allan, Rod Oldehoeft, John Glauert, Chris Kirkham, Bill Noyce, and Robert Thomas. SISAL: Streams and Iteration in a Single Assignment Language: Language Reference Manual, Version 1.2, Manual M -lf 6 , Rev. 1. Lawrence Livermore National Labora­ tory, Livermore, CA 94550, Max 1985. [34] Dan I. Moldovan. Modem Parallel Processing. Department of Electrical Engineering Systems, University of Southern California, Los Angeles, CA 90089-0871, 1986. [35] R. R. Oldehoeft and S. J. Allan. Execution support for hep sisal. In J. Kowa- lik, editor, Parallel MIMD Computation: The HEP Supercomputer and Its Applications, pages 151-180. MIT Press, Cambridge, MA, 1985. [36] R. R. Oldehoeft and D. C. Cann. Applicative parallelism on a shared-memory multiprocessor. IEEE Software, pages 62-70, Jan 1988. [37] R. R. Oldehoeft, D. C. Cann, and S. J. Allan. Sisal: Initial mimd performance results. In Proceedings of the 1986 Conference on Algorithms and Hardware for Parallel Processing, pages 120— 127, Aachen, Federal Republic of Germany, Sep 1986. [38] D. Padua, D. Kuck, and D. Lawrie. High-speed multiprocessors and compi­ lation techniques. IEEE Transactions on Computers, C-29(9):763-776, Sep 1980. [39] D. Padua and M. Wolfe. Advanced compiler optimizations for supercomput­ ers. Communications of the ACM, 29(12): 1184-1201, Dec 1986. [40] Cherri M. Pancake. M ultithreaded languages for scientific and technical com­ puting. In Proceedings of the IEEE, pages 288-304. IEEE, Feb 1993. Vol. 81, No. 2. [41] Santosh S. Pande, Dharma P. Agrawal, and Jon Mauney. Mapping functional parallelism on distributed memory machines. In John T. Feo, Christopher Frerking, and Patrick J. Miller, editors, Proceedings of the Second Sisal Users Conference, pages 139-159, Livermore, CA 94550, Dec 1992. Lawrence Liv­ ermore National Laboratory. [42] Constantine D. Polychronopoulos and David J. Kuck. Guided self-scheduling: a practical scheduling scheme for parallel supercomputers. IE E E Transac­ tions on Computers, C-36(12):1425-1439, Dec 1987. [43] William Pugh. A practical algorithm for exact array dependence analysis. CACM, 35(8): 102-114, Aug 1992. 209 R ep ro d u ced with p erm issio n o f th e copyright ow n er. Further reproduction prohibited w ithou t p erm issio n . [44] John E. Ranelletti. Graph Transformation Algorithms for Array Memory Optimization in Applicative Languages. PhD thesis, University of Califor­ nia, Davis, Livermore, CA 94550, Nov 1987. Report number UCRL-53832, Lawrence Livermore National Laboratory. [45] Anurag Sah. Parallel language support on shared memory multiprocessors. Technical Report UCB/CSD 91/NUM BER#631, University of California Berkeley, Berkeley, CA 94720, May 1991. [46] V. Sarkar and J. Hennessey. Compile-time partitioning and scheduling of par­ allel programs. In Proceedings o f the S IG P L A N 1986 Symposium on Compiler Construction, pages 17-26, Palo Alto, CA, Jun 1986. ACM. [47] V. Sarkar and J. Hennessey. Partioning parallel programs for macro-dataflow. In Proceedings of the 1986 AC M Conference on Lisp and functional program­ ming„ pages 202-211, Aug 1986. [48] Vivek Sarkar. Partitioning and scheduling parallel programs for execution on multiprocessors. PHD Thesis CSL-TR-87-328, Stanford University, Stanford, CA 94305-2192, Apr 1987. [49] Vivek Sarkar, Stephen Skedzielewski, and Patrick Miller. An automatically partitioning compiler for sisal. Technical Report UCRL-98289, Lawrence Livermore National Laboratory, Livermore, CA 94550, Dec 1988. [50] S. K. Skedzielewski and M. L. Welcome. D ata flow graph optimization in ifl. In Jean-Pierre Jouannaud, editor, Functional Programming Lan­ guages and Computer Architecture, pages 17-34. Springer-Verlag, New York, NY, Sep 1985. Also in: Lecture Notes in Computer Science Number 210 (Functional Programming Languages and Computer Architecture), Springer- Verlag, Berlin, pp. 17-34, 1985. Also as a Technical Report, Lawrence Liver­ more National Laboratory, number UCRL-92122, Rev. 1, December 2, 1987. [51] Stephen Skedzielewski and John Glauert. IF 1 : An Intermediate Form for Applicative languages, Version 1.0, Manual M -170. Lawrence Livermore Na­ tional Laboratory, Livermore, CA 94550, Jul 1985. [52] Stephen Skedzielewski and Robert Kim Yates. Fibre: An external format for sisal and ifl data objects, version 1.1, revision 1. Technical Report M-154, Lawrence Livermore National Laboratory, Livermore, CA 94550, Apr 1988. [53] Stephen K. Skedzielewski and Rea J. Simpson. A simple method to re­ move reference counting in applicative programs. In Proceedings AC M SIG­ P L A N ’ 89 Conference on Programming Language Design and Implementa­ tion, Portland, OR, Jun 1989. ACM. Also available as a Technical Report 210 R ep ro d u ced with p erm issio n o f th e copyrigh t ow n er. Further reproduction prohibited w ithout p erm issio n . number UCRL-100156, Lawrence Livermore National Laboratory, November 23, 1988. [54] Michael Welcome, Stephen Skedzielewski, Robert Kim Yates, and John Ranelletti. IF2: An Applicative Language Intermediate Form with Explicit Memory Management, Manual M-195. Lawrence Livermore National Labo­ ratory, Livermore, CA 94550, Dec 1986. [55] Paul G. W hiting. Compilation of a functional programming language for the csirac ii dataflow computer. Master’s Thesis T R DB-91-11, Commonwealth Scientific and Industrial Research Organization, CSIRO, Division of Informa­ tion Technology, 723 Swanston Street, Carlton, 3053, Australia, Oct 1991. Version 1.0. [56] R. Wolski, J. Feo, and D. C. Cann. A prototype functional language imple­ mentation for hierarchical-memory architectures. Technical Report UCRL- JC-107437, Lawrence Livermore National Laboratory, Livermore, CA 94550, Jim 1991. [57] Richard Wolski and John Feo. Program partitioning for numa multiproces­ sor computer systems. In John T. Feo, Christopher Frerking, and Patrick J. Miller, editors, Proceedings of the Second Sisal Users Conference, pages 111- 137, Livermore, CA 94550, Dec 1992. Lawrence Livermore National Labora­ tory. [58] Tao Yang. Scheduling and Code Generation fo r Parallel Architectures. PhD thesis, Rutgers University, New Brunswick, Jew Jersey, May 1993. [59] Tao Yang and Apostolos Gerasoulis. Dsc: Scheduling parallel tasks on an unbounded number of processors. IEEE Transactions on Parallel and Dis­ tributed Systems, 5(9):951— 967, September 1994. [60] Hans P. Zima and Babara Mary Chapman. Compiling for distributed- memory systems. In Proceedings o f the IEEE, pages 264-287. IEEE, Feb 1993. Vol. 81, No. 2. 2 1 1 R ep ro d u ced with p erm issio n o f th e copyright ow n er. Further reproduction prohibited w ithout p erm issio n . 
Asset Metadata
Creator Ayed, Moez (author) 
Core Title Automatic code partitioning for distributed-memory multiprocessors (DMMs) 
Contributor Digitized by ProQuest (provenance) 
Degree Doctor of Philosophy 
Degree Program Computer Engineering 
Publisher University of Southern California (original), University of Southern California. Libraries (digital) 
Tag Computer Science,engineering, electronics and electrical,oai:digitallibrary.usc.edu:usctheses,OAI-PMH Harvest 
Language English
Advisor Gaudiot, Jean-Luc (committee chair), Gupta, Sandeep (committee member), Horowitz, Ellis (committee member) 
Permanent Link (DOI) https://doi.org/10.25549/usctheses-c17-537703 
Unique identifier UC11350079 
Identifier 9720180.pdf (filename),usctheses-c17-537703 (legacy record id) 
Legacy Identifier 9720180.pdf 
Dmrecord 537703 
Document Type Dissertation 
Rights Ayed, Moez 
Type texts
Source University of Southern California (contributing entity), University of Southern California Dissertations and Theses (collection) 
Access Conditions The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the au... 
Repository Name University of Southern California Digital Library
Repository Location USC Digital Library, University of Southern California, University Park Campus, Los Angeles, California 90089, USA
Tags
engineering, electronics and electrical
Linked assets
University of Southern California Dissertations and Theses
doctype icon
University of Southern California Dissertations and Theses 
Action button