Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Combining compile -time and run -time parallelization
(USC Thesis Other)
Combining compile -time and run -time parallelization
PDF
Download
Share
Open document
Flip pages
Copy asset link
Request this asset
Transcript (if available)
Content
INFORMATION TO USERS This manuscript has been reproduced from the microfilm master. UM I films the text directly from the original or copy submitted. Thus, some thesis and dissertation copies are in typewriter face, while others may be from any type of computer printer. The quality of this reproduction is dependent upon the quality of the copy submitted. Broken or indistinct print, colored or poor quality illustrations and photographs, print bleedthrough, substandard margins, and improper alignment can adversely affect reproduction. In the unlikely event that the author did not send U M I a complete manuscript and there are missing pages, these will be noted. Also, if unauthorized copyright material had to be removed, a note w ill indicate the deletion. Oversize materials (e.g., maps, drawings, charts) are reproduced by sectioning the original, beginning at the upper left-hand comer and continuing from left to right in equal sections with small overlaps. ProQuest Information and Learning 300 North Zeeb Road, Ann Arbor, M l 48106-1346 USA 800-521-0600 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. COMBINING COMPILE-TIME AND RUN-TIME PARALLELIZATION by Sungdo Moon A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COMPUTER SCIENCE) December ‘ 2001 Copyright 2002 Sungdo Moon Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. U M I Number: 3073818 _ ___ ( R ) UMI U M I Microform 3073818 Copyright 2003 by ProQuest Information and Learning Company. A ll rights reserved. This microform edition is protected against unauthorized copying under Title 17, United States Code. ProQuest Information and Learning Company 300 North Zeeb Road P.O. Box 1346 Ann Arbor, M l 48106-1346 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. UNIVERSITY OF SOUTHERN CALIFORNIA The Graduate School University Park LOS ANGELES, CALIFORNIA 90089-1695 This dissertation, w ritten b y vSiA r^cip M o < y ft_____________________ Under the direction o f hiS - D issertation Committee, and approved b y a ll its members, has been presented to and accepted b y The Graduate School, in partial fulfillm ent o f requirements for the degree o f DOCTOR OF PHILOSOPHY Date M a y 10, 2002 DISSER TA TION COMMITTEE Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Acknowledgements First of all. I would like to thank my advisor, Mary W. Hall. I was very fortunate to meet her while I was going nowhere. Without her support, enthusiasm, and motivation, this dissertation would not be possible. I would also like to thank Rafael H. Saavedra, who guided me during an early part of my graduate life. I have worked with many talented people including Jacqueline Chame, Pedro Diniz, Weihua Mao, Pablo Moisset, Daeyeon Park, Joonseok Park, Jaewook Shin, and Heidi Ziegler. Chapter 3 is joint work with Byungro So, who have had to put up with me in the same office from the day one he joined the group. I also thank Junsup Lee, Joohaeng Lee and their families who made my family's life in US more enjoyable and enlightening. My parent and parent-in-law have encouraged and supported me to pursue my interest during many difficult circumstances. I deeply appreciates their understand ing and belief in me. Special thanks to my late father-in-law who would proud of me in the heaven. Lastly, I wish to express my deepest love to my lovely wife, Soyoung, and my daughter, Chaehyun, for their patience during long academic life and for sharing great moments with me. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Contents A cknow ledgem ents ii List O f Tables vi List O f Figures vii A bstract ix 1 Introduction 1 1.1 Problem Statement ..................................................................................... 4 1.1.1 Limitation of Traditional Data-Flow Analysis ........................... 7 1.1.2 Run-Time Parallelization.................................................................. 10 1.2 Our A pproach................................................................................................. 11 1.3 C o n trib u tio n s................................................................................................. 13 1.4 Organization of D issertatio n ....................................................................... 14 2 Background on Parallelization 16 2.1 Parallelization Analysis T ech n iq u es....................................................... . 16 2.1.1 Data-Dependence A nalysis............................................................... 16 2.1.2 Array P rivatization............................................................................ 18 2.1.3 Array R e d u c tio n ............................................................................... 20 2.1.4 Symbolic Analysis ..................................................................... 21 2.2 Array Data-Flow Analysis in S U I F ........................................... 21 2.2.1 Data-flow Values: S u m m a rie s.................................... 22 2.2.2 Key Operators ....................... 24 2.2.3 Array Data-Flow Analysis A lg o rith m ............................................ 26 2.2.4 Dependence and Array Privatization T e s ts .................................. 26 2.3 Chapter Summary ........................................................................................ 29 3 T he Effectiveness o f A utom atic Parallelization in SU IF 30 3.1 Overview............................................................................................................ 32 3.2 Instrumentation A nalysis............................................................................... 34 3.2.1 Instrumentation Analysis for a Single L o o p.................................. 34 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 3.2.2 Interprocedural Instrumentation A nalysis.................................... 35 3.3 Instrumentation Transform ation................................................................. 36 3.3.1 Instrumenting a Single Loop - LPD T e s t .................................... 38 3.3.1.1 Run-Time Updates To Shadow A r ra y s ........................ 39 3.3.1.2 Reducing Space Requirem ents........................................ 41 3.3.2 Instrumenting Multiple Loops in a Nest - E L P D ...................... 42 3.3.3 Instrumentation Across P ro ce d u re s................................... 44 3.4 Experimental R esu lts..................................................................................... 44 3.4.1 Benchmark Program s........................................................................ 44 3.4.2 Experimental Methodology.............................................................. 45 3.4.3 Categorization of L o o p s ................................................................. 47 3.4.4 Dynamic Performance M easurem ents.......................................... 49 3.4.5 Detailed Consideration of R e s u lts ................................................ 52 3.4.5.1 Spe c fp95 Benchmarks ................................................... 57 3.4.5.2 N as Benchm arks................................................................ 58 3.4.5.3 P erfect B enchm arks...................................................... 59 3.4.5.4 S u m m a ry ............................................................................. 61 3.4.6 Evaluating ELPD ...................... 62 3.5 Chapter Summary ................................................................................. 66 4 P redicated Array D ata-Flow A nalysis 69 4.1 Overview............................................................................................................ 71 4.2 Capabilities of Array Data-Flow A nalysis................................................ 74 4.2.1 Improving Compile-Time A nalysis................................................. 74 4.2.1.1 MOFP Solution.................................................................. 74 4.2.1.2 Predicate Embedding ..................................................... 74 4.2.2 Deriving Low-Cost Run-Time T e s ts ................................ 78 4.2.2.1 Breaking Conditions on Data Dependences................. 78 4.2.2.2 VPS S o lu tio n ...................................................................... 78 4.2.2.3 Predicate Extraction for Optimistic Assumptions . . 79 4.3 Predicated Array Data-Flow A nalysis....................................................... 80 4.3.1 Predicate D om ain............................................................................... 80 4.3.2 Deriving Predicates ........................................................... 83 4.3.2.1 Control-Flow P re d ic a te s.................................................. 83 4.3.2.2 Predicate E x tra c tio n ......................................................... 86 4.3.3 Extensions to Array Data-Flow Analysis O perations................. 88 4.3.3.1 Meet F u n c tio n ..................................................... 89 4.3.3.2 Union O perations....................................................................89 4.3.3.3 Subtraction O p e ra tio n ...................................................... 91 4.3.4 Predicated Dependence and Privatization T e s t s ............................ 92 4.4 A Refinement of Predicated Data-Flow V a lu e s........................................ 95 4.4.1 Predicate Em bedding......................................................................... 97 4.4.2 Predicate E x tra c tio n ......................................................................... 99 iv Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 4.5 Implementation I s s u e s .............................................................................. ... 101 4.5.1 Managing P red ic ate s........................................................................102 4.5.2 Scoping R u l e ....................................................................................... 105 4.6 Code G e n e ra tio n .............................................................................................106 4.7 Chapter Summary ..........................................................................................110 5 E xperim ental R esults 111 5.1 Increased Number of Parallel L o o p s.............................................................I l l 5.2 Speedup Im provem ents................................................................................... 118 5.3 Case Studies of Missed L o o p s .......................................................................122 5.4 Chapter Summary ..........................................................................................127 6 R elated W ork 128 6.1 Parallelization Experiments .......................................................................... 129 6.2 Run-Time Parallelization T echniques..........................................................131 6.3 Parallel Programming T o o ls .......................................................................... 133 6.4 Analysis Exploiting Predicates.......................................................................134 6.4.1 Guarded Array Regions ...................................... 136 6.4.2 Constraint-Based Array Dependence analysis................................ 140 6.5 Chapter Summary .......................................................................................... 144 7 C onclusion 145 7.1 Summary of the D issertation.......................................................................... 145 7.2 Future w o rk ....................................................................... 147 7.3 Final R em arks.................................................................................................... 148 R eference List 150 A ppendix A Detailed Results of Instrumented Loops ............................................................. 156 v Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. List Of Tables 3.1 Benchmark programs................................................................ 46 3.2 Requirements of remaining parallel loops in S p e c f p 95, N a s , ancl P e r f e c t ............................................................................................................ 54 3.3 Characteristics of non-parallel loops in S p e c f p 9 5 , N a s , and P e r f e c t . 63 5.1 Additional loops in SPECFP95 parallelized by predicated array data flow analysis......................................................................................................... 113 5.2 Additional loops in P e r f e c t and arc3d parallelized by predicated array data-flow analysis..................................................................................... 114 A .l Detailed result of instrumented loops. ..............................................151 vi Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. List Of Figures 1.1 Array privatization example............................................................................ S 1.2 Traditional data-flow analysis......................................................................... 9 2.1 Array privatization with initialization........................................................... 19 2.2 Array reduction example.................................................................................. 20 2.3 Previous array data-flow analysis algorithm.................................................. 27 2.4 Dependence and Array Privatization test in SUIF................................... 28 3.1 Sketch of run-time parallelization testing system...................................... 33 3.2 Instrumentation analysis algorithm for a single loop................................ 35 3.3 Interprocedural instrumentation analysis algorithm...................................... 37 3.4 Run-time updates to shadow arrays in a single loop................................ 40 3.5 Run-time updates to shadow arrays in a nested loop...................................43 3.6 Static categorization of loops in parallelized regions, comparing compiler- parallelized loops to total parallelizable loops............................................ 4S 3.7 Coverage and granularity comparison........................................................... 53 4.1 Improving compile-time analysis with predicated data-flow analysis. . 75 4.2 Using predicated array data-flow analysis in run-time parallelization. 76 4.3 Parallelized versions of examples from Figure 4.2. . ............................ 77 4.4 Predicate representation example................................................................... 82 4.5 Algorithm for deriving control-flow predicates.................................................84 vii Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 4.6 predicate propagation across procedure b o u n d ary .................................. 86 4.7 Reshape examples benefiting from predicate extraction............................. 87 4.8 PredMerge operation examples...................................................................... 89 4.9 Dependence and Privatization test on Predicated data-flow values. . . 92 4.10 Predicate tables............................... 103 4.11 Example parallel code skeleton generated by SUIF.....................................107 4.12 Example parallel code for conditionally parallel loop..................................109 5.1 Coverage and granularity improvements with predicated a rra y data flow analysis........................................................................................................117 5.2 Speedups due to predicated array data-flow analysis.................................. 119 5.3 Simplified setall-4048 loop in apsi.................................................................... 124 5.4 Simplified main-126 loop in fftpde................................................................... 126 6.1 An example showing the difference of merge operations in guarded array regions and predicated array data-flow analysis................................ 139 6.2 Examples showing the difference between constraint-based conditional data-dependence analysis and predicated array data-flow analysis. . . 142 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Abstract Multiprocessor systems have become an important computing platform to meet the increasing demands for high performance, but the difficulty in writing parallel pro grams has been a major obstacle in their widespread use. Parallelizing compiler technology has been an active research area as it eases programmers’ efforts by automatically translating sequential programs into a parallel form. The early 90’s experimental results on the effectiveness of parallelizing compilers have pushed com piler researchers and developers to incorporate more advanced analysis techniques. Over the last decade, significant advances in parallelizing compiler technology have been seen, and several research compilers have demonstrated some successes. In this dissertation, we investigate how good today’s parallelizing compilers are, what opportunities remain to improve them, and, what technology is needed to exploit the remaining opportunities. To answer these questions, we perform exper iments that measure the safety of parallelization at run time for loops left unpar allelized by the Stanford SUIF compiler’s automatic parallelization system. The experimental results demonstrate that significant improvements to automatic par allelization technology require that existing systems be extended in two ways: (1) they must combine high-quality compile-time analysis with low-cost run-time test ing; and, (2) they must take control flow into account during analysis. ix Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. In order to exploit these remaining opportunities, we have developed a new compile-time analysis technique that can be used to parallelize most of these re maining parallelizable loops. This technique is designed to not only improve the results of compile-time parallelization, but also to produce low-cost, directed run time tests that allow the system to defer binding of parallelization until run-time when safety cannot be proven statically. We call this approach predicated array data-flow analysis. We augment array data-flow analysis, which the compiler uses to identify independent and privatizable arrays, by associating with each array data flow value a predicate. Predicated array data-flow analysis allows the compiler to derive “optimistic” data-flow values guarded by predicates; these predicates can be used to derive a run-time test guaranteeing the safety of parallelization. We demonstrate the effectiveness of predicated data-flow analysis by implement ing it in the Stanford SUIF compiler and performing experiments on three benchmark suites and one additional program. Experimental results with a prototype imple mentation show that predicated array data-flow analysis is promising at finding ad ditional parallel loops, as it parallelizes more than 40% of the remaining inherently parallel loops left unparallelized by the SUIF compiler. We demonstrate improved speedups with negligible run-time overhead for 5 programs among 6 programs in our benchmark suite where significant speedup improvement is possible. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 1 Introduction The advances in semi-conductor technology have continued to improve the perfor mance of microprocessors at an amazing rate over the last several decades, but they have not been able to catch up with continuously increasing demands for higher computing power. Thus, to extract more performance out of microprocessors, the current trend in today’s high-performance microprocessors is to exploit instruction- level parallelism (ILP) with a variety of architectural features such as pipelining, wider superscalar, and VLIW (very long instruction word). However, as micropro cessors become more complex with additional hardware, fewer programs can utilize all the available resources effectively. In fact, many programs usually exhibit a lim ited degree of instruction-level parallelism. Also, higher processor complexity can increase a machine’s cycle time. The complexity of today’s microprocessors and the limit of available instruction- level parallelism in programs make the use of multiple processors a particularly attractive alternative, as the computing power can be scaled with the demand by just 1 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. adding more processors. More importantly, additional hardware in a multiprocessor is not wasted, since it can also run multiple programs simultaneously to deliver better overall system throughput. As a result, many types of multiprocessors have been developed and used in the scientific computing community. In recent years, small to moderate-scale shared-memory multiprocessor systems have become popular as they can be built using affordable but powerful commodity microprocessors. In addition, they provide a much easier programming model compared to that of message-passing multiprocessors, where processors communicate with each other by sending explicit messages. In order to effectively utilize multiprocessors, the user has to either write ex plicitly parallel programs or use a parallelizing compiler to automatically extract parallel computations from ordinary sequential applications. As writing explicitly parallel programs is a tedious and difficult process, parallelizing compilers have be come increasingly important to the success and wide-spread use of multiprocessors. Automatic parallelization by a compiler is an attractive approach to software devel opment for multiprocessors, as it enables existing sequential programs to make use of the multiprocessor hardware without any user involvement. In an effort to simplify the programmers job. there has been a significant body of research in parallelizing compilers over the last two decades. In the early 90!s, several researchers performed a series of experiments to eval uate state-of-the-art commercial parallelizing compilers [6, 13, 46]. They showed 2 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. that compilers were frequently not effective at deriving efficient parallel code, even for applications with inherent parallelism. The results of these experiments have motivated researchers and developers of parallelizing compilers to begin incorporat ing techniques for locating coarse-grain parallelism, such as array privatization and interprocedural analysis, that have significantly enhanced the effectiveness of auto matic parallelization. As a result, parallelizing compilers are becoming increasingly successful at exploiting coarse-grain parallelism in scientific computations, as evi denced by recent experimental results from both the Polaris system at University of Illinois and from the Stanford SUIF compiler [5. 25]. While these results are impressive overall, some of the programs presented achieve little or no speedup when executed in parallel. This observation raises again ques tions that have been previously addressed by experiments in the early 90!s. Is the compiler exploiting all of the inherent parallelism in a set of pro grams? I f not. can we identify the techniques needed to exploit remaining paral lelism opportunities? Now that the recently identified techniques are performed automatically by some research and commercial parallelizing compilers, it is an appropriate time to re visit the questions that were raised in the early 90!s to determine whether further improvements are possible. In this dissertation, we give answers to these questions. 3 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. First, we demonstrate that significant improvements to automatic parallelization technology are possible, requiring existing systems to be extended in two ways: (1) they must take control flow into account during analysis, and (2) they must combine high-quality compile-time analysis with low-cost run-time testing. We support this claim with the results of an experiment that measures the safety of parallelization at run time for loops left unparallelized by the Stanford SUIF compiler’s automatic parallelization system. Based on the results of this experiment, we have developed a new compile-time analysis technique that can be used to parallelize most of these remaining loops. This technique is designed to not only improve the results of compile-time parallelization, but also to produce low-cost, directed run-time tests that allow the system to defer binding of parallelization until run time when safety cannot be proven statically. We call this approach predicated array data-flow analysis. In the next section, we identify the limitations of current parallelization tech niques based on our experiments and describe some existing solutions to these prob lems. Then, in the subsequent section, we briefly overview our approach. The contributions and organization of this dissertation follow. 1.1 Problem Statem ent Traditionally, loop nests in scientific applications axe the main target of parallelizing compilers. They are the most time-consuming parts of programs, and relatively easy 4 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. to partition into the same amount of computation across processors. A parallelizing compiler locates loops whose iterations can be executed in parallel by analyzing accesses to scalar and array variables within the loop nests. A requirement for safe parallelization is that the memory locations accessed (read or written) within one iteration are independent of those written in another iteration. Analysis to determine if accesses are dependent across iterations is called data-dependence analysis. The first-generation parallelizing compilers, which were based on vectorization technology, were successful at finding fine-grain computations (e.g., inner loops in a loop nest). However, such fine-grain parallelism prohibits obtaining good parallel performance in shared-memory multiprocessors, as multiprocessors need to perform expensive synchronization operations after executing each parallel region. Run-time overhead of synchronization grows as number of processors increases and synchro nization requires all the processors to be idle while waiting for the slowest processor, which might result in poor processor utilization. Thus, frequent synchronization operations have to be avoided. With the popularity of shared-memory multiproces sors, extracting coarse-grain parallelism, that is, locating outer-most parallel loops that can perform a significant amount of work in between synchronization points, has become the main focus of today’s parallelizing compilers. Detecting coarse-grain parallelism requires much more sophisticated analysis such as interprocedural anal ysis, array privatization analysis, array reduction recognition, and symbolic analysis in addition to standard analyses. 5 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Motivated by the early 90’s study, over the last decade, many researchers de veloped and implemented all of these new analyses in research parallelizing compil ers. Recent experimental results showed that these analyses played an important role in improving the quality of parallelizing compiler, specially at finding coarse grained parallelism. However, automatic parallelization analysis in current practice is severely limited because it is static in nature, basing optimization decisions solely on knowledge provable at compile time. Because compilers typically produce only a single optimized version of a computation, each optimization is performed based on conservative data-flow analysis results - i.e.. an optimization is performed only when it is guaranteed to be safe for all control flow paths taken through a program and all possible program inputs. To support our arguments, we perform a series of experiments to evaluate the effectiveness of the Stanford SUIF compiler, a state-of-the-art research paralleliz ing compilers. We empirically evaluate the Stanford SUIF compiler, to identify the remaining parallelism opportunities using an automatic run-time parallelization testing system. For our system, we define the extended-LPD test (ELPD) based on the Lazy Privatizing Doall (LPD) test [43], which tests whether a loop contains data dependences, and further, whether such dependences can be safely eliminated with privatization. ELPD extends LPD to test all loops in a loop nest simultaneously, 6 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. rather than a single loop in a nest at a time, including when loop nests cross proce dure boundaries. We use ELPD to instrument and test whether any of the candidate unparallelized loops in the program can be safely parallelized at run time. For 29 programs in three benchmark suites, the ELPD test was executed at run time for each candidate loop left unparallelized by the SUIF compiler to identify which of these loops could safely execute in parallel for the given program input. The results of this experiment point to two main requirements for improving the effectiveness of parallelizing compiler technology: incorporating control flow tests into analysis and extracting low-cost run-time parallelization tests from analysis results. In the next two sections, we will examine these two requirements in detail. 1.1.1 Lim itation of Traditional Data-Flow Analysis While data-dependence analysis suggests that loops carrying data dependences are unparallelizable. in some cases, they can be parallelized by transforming array data structures. Array privatization is one of the crucial transformation techniques in automatic parallelization, which provides each processor with a local copy of the array so that each processor accesses a private copy that is thus independent of the locations accessed by other processors. In general, array privatization can be applied if all locations read by an iteration are first written within the same iteration such as the code in Figure 1.1. Finding privatizable arrays requires array data-flow analysis, which extends scalar data-flow analysis to individual array elements. 7 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. for i = 1 to c for j = 1 to d help[j] = - • - endfor for j = 1 to d ... = help[j] endfor endfor Figure 1.1: Array privatization example. Traditional data-flow analysis produces data-flow values based on the conserva tive assumption that all control-flow paths through a program may be taken. In addition, in traditional analyses, two data-flow values at a merge point are com bined to a single data-flow value that conservatively approximates the two values. These two ways in which traditional data-flow analysis conservatively approximates the true data-flow values in the program may significantly limit the precision of program analysis. For example, array data-flow analysis in SUIF conservatively assumes that reads of array help in the loop given in Figure 1.2(a) may access the values defined outside of the loop, since there exists a possible control-flow path (containing dotted lines in Figure 1.2(b)) that references array help but bypasses the preceding assignment to it. This conservative approximation prohibits privatization of array help. Obviously, the loop can be executed in parallel with array privatization on certain conditions of the if statements (for instance, when both conditions are the same, reads of array 8 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. for i = 1 to c for j = 1 to cl if (...) th e n helpjj] = ... endfor for j = 1 to d if (...) th e n ... = helpjj] endfor en d fo r (a) Sample code loop i j 'oop j loop j (b) Control-flow graph Figure 1.2: Traditional data-flow analysis. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. help always reference the values defined in the preceding assignments within the same iteration). However, traditional data-flow analysis does not take path information into account during analysis, and thus misses some optimization opportunities that can be exploited with path-sensitive information. As conditional control-flow statements prohibit a precise analysis of a program, it is worthwhile to analyze and manipulate conditions associated with them. Especially in parallelizing compilers for shared-memory multiprocessors, analyzing control-flow conditions can be crucial since coarse-grained parallel loops have a higher probabil ity to contain conditional control-flow statements. However, handling control-flow conditions may dramatically increase the cost of analysis in terms of time and space, so it is important to design an efficient analysis framework. A few existing data-flow analyses incorporate control-flow information [7, 23, 27, 49, 51], particularly guarded array data-flow analysis by Gu, Li and Lee [20, 21, 22]. 1.1.2 R un-T im e Parallelization In some cases, it is safe to parallelize a loop only for certain conditions. For instance, suppose that the first condition of if statement in Figure 1.2(a) is (x > 5) and the second condition is (x > 2). Then when both conditions hold, which is equivalent to (x > 5), the loop cannot be parallelized as written as same location is read and written in different iterations of the loop. However, in this circumstance, array help is always written first before its uses within the same iteration. Therefore, array help 10 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. is privatizable if x > 5. As shown in this example, some loops can be parallelized only on certain input data or conditions. To take full advantage of these situations, a compiler should be able to generate multiple versions of the loop. Also, compile-time analysis should be able to de rive run-time evaluable tests to guard execution of conditionally transformed codes. These run-time tests should be simple so as not to introduce too much overhead. Some previous work in run-time parallelization uses specialized techniques not based on data-flow analysis. An inspector/executor technique inspects array ac cesses at run time immediately prior to execution of the loop [43, 44]. Then the inspector decides whether to execute a parallel or sequential version of the loop. A few researchers have considered how to derive breaking conditions on data depen dences, conditions that would guarantee a data dependence does not hold at run time [17, 18, 41, 53]. 1.2 Our Approach The primary goal of this dissertation is to identify the limitation of current par allelizing compilers and suggest ways to improve their quality. The results of the experiment described in the previous section indicate that there is still some room for improving automatic parallelization technology, particularly in two areas: • more precise compile-time analysis by taking conditioned control flow into ac count 11 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. • derivation of focused, low-cost run-time parallelization tests from analysis re sults. For these purposes, we propose a new analysis technique called predicated array data-flow analysis, which can be used to parallelize most of the remaining loops missed by the compile-time analysis in the SUIF compiler. Predicated array data flow analysis associates a predicate with each data-flow value: analysis interprets these predicates as describing a relationship on the data-flow value when the pred icate evaluates to true. These predicates can be used to derive conditions under which dependences can be eliminated or privatization is possible. These conditions, which can consist of arbitrary program statements, can be used both to enhance compile-time analysis and to introduce run-time tests that guard safe execution of a parallelized version of a computation. Predicated array data-flow analysis unifies in a single analysis technique several different approaches that combine predicates with array data-flow values. By folding predicates into data-flow values, which we call predicate embedding, we can produce more precise data-flow values by incorporating constraints derived from control-flow tests. By deriving predicates from operations on the data-flow values, which we call predicate extraction, we can obtain breaking conditions on dependences and for privatization, such that if the conditions hold, the loop can be parallelized. 12 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 1.3 Contributions The primary contributions of this dissertation are the following: • It presents the ELPD test, a significant extension to the LPD test. The ELPD test is a complete and efficient solution for run-time parallelization testing of whole programs, enabling instrumentation of multiple loops in a nest and loops containing procedure calls. • It presents detailed experimental results on three benchmark suites, identify ing remaining opportunities for automatic parallelization, using the ELPD test implemented in the SUIF compiler. It also evaluates the effectiveness of the ELPD test, considering whether loops that ELPD identified as not paralleliz- able also have inherent parallelism that could be exploited by the compiler. • It proposes a new analysis for parallelizing compilers called predicated array data-flow analysis, whereby array data-flow analysis for parallelization and privatization is extended to associate predicates with data-flow values. These predicates can be used to derive conditions under which dependences can be eliminated or privatization is possible. These conditions, which can consist of arbitrary program statements, can be used both to enhance compile-time anal ysis and to introduce run-time tests that guard safe execution of a parallelized version of a computation. 13 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. • It presents a comprehensive evaluation of the effectiveness of predicated array data-flow analysis across 30 programs, demonstrating that it parallelizes 40% of the remaining parallelizable loops previously missed by the SUIF compiler. 1.4 Organization of Dissertation The next chapter provides background on automatic parallelization. We briefly de scribe a series of analyses and techniques that are commonly used in parallelizing compilers, such as data-dependence analysis, array reduction, and array privatiza tion. We also describe the existing parallelization analysis system in the Stanford SUIF compiler upon which we base our implementation. Chapter 3 describes our instrumentation system based on the ELPD run-time parallelization test. Next we present the results of the instrumentation experiment on programs from the S p e c f p 9 5 , N a s sample programs and P e r f e c t benchmark suites. We discuss features of remaining parallelizable loops, and we suggest tech niques that could enable the compiler (with run-time support) to parallelize them as well. In addition, we examine loops that ELPD identified as not parallelizable, some of which also have inherent parallelism. Predicated array data-flow analysis is presented in Chapter 4. We discuss how the existing analysis must be modified to support predicated array data-flow anal ysis, and describe some features of our implementation. To measure the impact 14 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. of predicated array data-flow analysis at finding additional parallelism on the pro grams, Chapter 5 presents extensive experimental results from applying predicated array data-flow analysis across three benchmark suites and one additional program. Chapter 6 presents related work and compares with our approach. Finally, Chap ter 7 summarizes the important results in this dissertation and discusses future di rections for this research. 15 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 2 Background on Parallelization This chapter overviews a series of parallelization analysis techniques that are com monly used in today’s parallelizing compilers such as data-dependence analysis, array privatization, array reduction, and symbolic analysis. We also describe in detail the array data-flow analysis implemented in SUIF. 2.1 Parallelization Analysis Techniques 2.1.1 D ata-D ependence Analysis A parallelizing compiler transforms loops in a sequential program into collections of threads that can be executed in parallel while preserving the semantics observed by a linear order of statement execution of the original program. The semantics of the original sequential program can be characterized by dependence relations, which are determined by the way memory locations axe read and written. Data dependence relations represent the essential ordering constraints among statements 16 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. or operations in a program. Any execution order of a program which obeys the dependence relations of the original program is semantically valid. Two references are said to be dependent when they access the same memory location and one of the references is a write. A dependence is a loop-carried data dependence if it occurs between different iterations of the same instance of the loop. There exist three kinds of data dependence: true, anti, and output dependence. Let and s2 be two statements in a program and s L occur before s2. We say si is flow- dependent on s2 if a variable defined in si is possibly later used in s2, anti-dependent if Si uses the variable subsequently defined in s2, and output-dependent if both si and s2 write to the same variable. Anti dependences and output dependences arise from the reuse or reassignment of variables, and are sometimes called false dependences. Flow dependence is also called true dependence since it is inherent in the computation and cannot be eliminated by renaming variables [52]. Data-dependence analysis examines each pair of accesses within the loop to dis cover dependences among references. A loop can be parallelized if there are no loop-carried data dependences. Data-dependence analysis on array locations has been shown to be equivalent to integer linear programming, where constraints are formed with loop bounds and linear subscript expressions, that is, subscript expres sions that are linear combinations of the induction variables. The data dependence problems found in practice are usually simple, and efficient methods exist to solve them exactly [41]. Even in the presence of some nonlinear subscript expressions, 17 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. such as the common case of access expressions involving multi-loop induction vari ables, approaches based on integer linear programming can detect independence with assistance from symbolic analysis [26]. 2.1.2 Array Privatization Array privatization is one of the most effective transformations in extracting coarse- grain parallelism. It eliminates memory-related dependences by replicating arrays across processors [49]. As one example, if all locations read by an iteration are first written within the same iteration such as the code in Figure 1.1, it may be possible to privatize the variable so that each processor accesses a private copy that is thus independent of the locations accessed by other processors. Array privatization is possible only if there are no read accesses within an iteration of the loop that are upwards exposed to the beginning of the iteration and are written within other iterations. A read access is called upwards exposed if there is a possible control flow path from the beginning of the loop body to the read access that contains no definition of the accessed location. Array privatization is usually only applied to arrays where each iteration of the loop in which the array appears first defines the values in the array before they are used. However, it is also applicable to loops whose iterations use values computed outside the loop. In such cases, the private copies must be initialized with these values before parallel execution begins. An example of array privatization with 18 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. for i = 1 to c for j = 1 to 5 helpjj] = ... endfor for j = 1 to 10 ... = helpjj] endfor endfor Figure 2.1: Array privatization with initialization. initialization is shown in Figure 2.1. Here, part of array help, the first half, is modified before being referenced in the second j loop. The remainder of the array is not modified at all in the loop. Array help is privatizable in the outer loop by giving each processors a private copy with the second half initialized with original values. When the privatized array is used after the completion of parallel loop, we also need to finalize the array at the end of parallel region by copying the values defined within the loop from private copies to the original array. While testing for independence can be performed by pairwise comparison of each access in a loop, privatization testing requires th at the compiler perform data-flow analysis on array accesses in the loop so that the relative ordering of read and write accesses within an iteration of the loop can be determined [4, 12, 14, 15, 16, 19, 41]. The SUIF automatic parallelization system performs a single array data-flow analysis to test for both independence and privatization [26], as discussed in Section 2.2. 19 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. for i = 1 to c xi = x[i] for j = 1 to cl sum[j] = sum[j] + a[j] * xi endfor endfor Figure 2.2: Array reduction example. 2.1.3 Array R eduction A reduction occurs when a location is updated on each loop iteration, where a com mutative and associative operation is applied to that location’s previous contents and some data values. The SUIF compiler implements a simple, yet powerful approach to recognizing reductions, in response to the common cases such as the example in Figure 2.2. In the example, reductions can be transformed to a parallel form by creating a private copy of array sum for each processor, initialized to 0. Each processor updates its private copy with the computation for the iterations of the i loop assigned to it, and following execution of the parallel loop, automatically adds the values of its private copy to the global sum. The reduction recognition algorithm in SUIF searches for computations which is a commutative update to a single memory location A of the form A = A op .... where op is one of the commutative operations such as -t-, *, M IN , and M A X . At the same time, in the loop, the only other reads and writes to the location references by A should be also commutative updates of the same type described by op, and 20 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. there should be no remaining dependences that cannot be eliminated either by a privatization or reduction transformation. 2.1.4 Symbolic Analysis Parallelizing compilers incorporate a host of scalar symbolic analyses, including con stant propagation, value numbering, induction and loop-invariant variable recogni tion, and common subexpression recognition. These can eliminate some scalar de pendences which cannot be removed by simple privatization. They are also used to simplify subscripts as needed for an array analysis since array analysis is more effec tive when the array subscript expression are phrased as affine expressions in terms of loop indexes and loop invariants. The SUIF combines the effect of such analyses in a single interprocedural symbolic analysis. This symbolic analysis determines, for each variable appearing in an array access or loop bound, a symbolic value: an ex pression describing its value in terms of constants, loop-invariant symbolic constants, and loop indices. 2.2 Array Data-Flow Analysis in SUIF The Stanford SUIF compiler system is an automatic parallelization system that is fully interprocedural. The system incorporates all the standard analyses included in today’s parallelizing compilers, such as data dependence analysis, analyses of 21 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. scalar variables including constant propagation, value numbering, induction vari able recognition and scalar dependence and reduction recognition. In addition, the system employs analyses for array privatization and array reduction recognition. The SUIF compiler performs single interprocedural array data-flow analysis to determine which loops access independent memory locations, or for which privatization elimi nates remaining dependences. A more detailed treatment of SUIF:s array data-flow analysis can be found in Amarasinghe’ s dissertation [1]. 2.2.1 Data-flow Values: Summaries The array data-flow analysis computes data-flow values for each program region, where a region is either a basic block, a loop body, a loop, a procedure call, or a pro cedure body. The data-flow value at each region consists of a 4-tuple (R , E. W. M ), with the four components of the tuple defined as follows: • The read set R describes the portions of arrays that may be read inside the program region. • The exposed read set E describes the portions of arrays that may have upwards exposed reads inside the program region, i.e., reads with no preceding writes in the program region, so that they use values computed prior to execution of the region. (E C R). • The write set W describes the portions of arrays that may be written inside the program region. 22 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. • The must write set M describes the portions of arrays that must be written inside the program region (M C W). The R, E. W, and M sets consist of summaries that describe the array regions accessed by a program component. The summary for a given array contains a list of regions, with each region represented by a system of integer linear inequalities representing the constraints on the boundaries of the region. For example, consider the following loop nest. for i = 1 to c for j = 1 to d A [ i+ 1 , 2*1] The region of array A written by a single execution of the statement is represented by a set containing one system of inequalities, parameterized by the program variables c and d. and normalized loop index variables i and j : ( s i, s 2) 0 < j < d — 1. si = j + 2, 0 < i < c — 1. s2 = 2 i -f 2 The list of regions is used instead of a single region to avoid loss of information when multiple, very different accesses to an array appear in the same loop: in previ ous work by the SUIF group, they have found this feature of the implementation to be very important to the precision of the result in practice and have not found the number of regions to grow unmanageably. 23 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. The use of summaries to represent the indices of all accesses to an array in a program region is an important feature. It eliminates the need to perform 0(rr) pairwise dependence and privatization tests for a loop containing n array accesses, which becomes prohibitively expensive for very large loops. 2.2.2 Key Operators Array data-flow analysis relies on four operators on array summaries. • Union: AU B = {c \ c € A or c € B}. The union of two summaries A and B combines systems in B with the set A. Since the union of two convex regions can result in a non-convex region, a set is necessary to maintain the precision of the union operator. • Intersection: An£? = { a n 6 |a € .4 and b € B and a(~)b 0}. The intersection of two sets of systems is the set of all non-empty pairwise intersections of their elements. • Subtraction: A — B subtracts ail systems of set B from each system in A, using a subtraction operation Sub tract (a, b) to subtract each array section b from an array section a. Since a precise subtraction of two convex array sections might result in a non-convex section, the implementation conservatively approximates the result. The subtraction of each system b E B from a single system a is performed iteratively, since system a may be contained by the combination of systems in B. 24 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. • Projection: Project (A, I) eliminates the set of variables I from the constraints of all the systems in set A by applying the Fourier-Motzkin elimination tech nique to each system. The principal operators rely on a number of ancillary operators to simplify and reduce the number of systems in a summary following the above principal operators. • Merge(A): Merge is applied following each of the four main operators to sim plify the resulting set of systems of inequalities, reducing the number of systems in the set while avoiding loss of precision. • IsEmpty(A): (A = 0) = V„€/t(a = 0). An array summary is empty if and only if every system in the summary is empty; a system is empty if there is no integer solution for the system of inequalities. A Fourier-Motzkin elimina tion technique with branch-and-bound is used to check for the existence of an integer solution. • lsContained(A,B): A C B < -> (a E A — > a € B). The Merge operator relies on the containment test to determine when merging two array sections will not result in a loss of precision. A set of systems is contained in another if and only if each system in the first set is contained in a single system in the other set. This test is conservative as it may return a false negative. 25 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 2.2.3 Array Data-Flow A nalysis Algorithm The array data-flow analysis algorithm is presented in Figure 2.3. In the algorithm, the Project function is applied at loop boundaries to summarize the region accessed by the loop. Project uses Fourier-Motzkin projection to eliminate from the array regions constraints that include scalar variables that vary within the loop body. For example, the loop index variable is eliminated from the region, and replaced by its lower and upper bounds. To summarize the array regions accessed by a procedure, the Reshape function maps arrays and other variables in their regions across the call. In some cases, the Reshape function must modify the array region to support the common practice in Fortran programs of adding or dropping dimensions across procedure boundaries. The Reshape calculation also uses Fourier-Motzkin projection to eliminate variables representing the dimensions of the formal parameter and replace them with corresponding values for variables representing the dimensions of the actual parameter [1]. 2.2.4 D ependence and Array Privatization Tests The dependence and privatization tests, and the derivation of regions in privatizable arrays requiring initialization are presented in Figure 2.4. Each array region is described in terms of its loop index variable i. In the tests, the notation Wrj j1 refers to replacing i with some other index i\ in the iteration space. 26 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. for each procedure P from bottom to top over call graph: for each region R from innermost to outermost: if R is a basic block, Tr = ({Read(R)}, {Exposed(R)}. { Write(R)}, {Must(R)}) if R is a loop with body R'. with constraints on loop varying variables /. Tr = (Project(RTR ,,l), Project(ETa,,l), Project(W tr,,1). Project{MTn,, if R is a call at site s to procedure with body R!. Tr = Reshape,{Tr> ) if R is a loop body or procedure body, for each immediate subregion R!: Tr,r> = / \ Tp o Tr,p. p€pred{R') where a o b = (Ra U /?& , Ea U (■ £ & — Ma). Wa U W & , Ma U Mb) and a A 6 = (Ra U Rb, Ea U Eb, Wa U Wb, Ma n Mb) T r = 7 ix it(R ) ° T r,Exit(R) Figure 2.3: Previous array data-flow analysis algorithm. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. IndependentL ViL ) i2 G I, (ii ± i2) A (WL|Jl fl Ri\\- — 0) A (\VL||l n Wrf* — 0) Privatizablel V^. i2 6 /. (ii ^ z2) A fl E l \\2 = 0) /n iiiafae, = { UPrivatizableL ^ 0. otherwise Figure 2.4: Dependence and Array Privatization test in SUIF. These tests refer to the R, W and E components of the data-flow value of a loop for accesses to a particular array. The dependence test verifies that read and write regions or write regions do not overlap for any two different iterations in iteration space I. The privatization test verifies that exposed reads do not overlap with writes for any two different iterations. The initialization test collects the upwards- exposed read regions of a privatizable array: for correctness, these regions will require initialization of the private copy of the array. The SUIF compiler limits privatization to those cases where every iteration in the loop writes to exactly the same region of data. So the analysis performs the following test to finalize a loop whose index i has upper bound ub: If W = M |“6. the last loop iteration is peeled, and this final iteration writes to the original array. 28 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 2.3 Chapter Summary This chapter has described a series of analysis techniques that are essential to auto matic parallelization. These include data-dependence analysis, array privatization, array reduction and symbolic analysis. We have also described the interprocedural array data-flow analysis implemented in the SUIF compiler which combines data- dependence analysis with data-flow analysis, to track data-flow properties of indi vidual array elements. 29 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 3 The Effectiveness of Autom atic Parallelization in SUIF This chapter presents an evaluation of automatic parallelization in the Stanford SUIF compiler, which exploits loop-level parallelism on shared-memory multiprocessor ar chitectures. Unlike the previous experiments in the early 90!s, which involved hand parallelization of programs to identify and yield the requirements for the compiler, this experiment empirically evaluates the remaining parallelism opportunities using an automatic run-time parallelization testing system. We augment the Lazy Privatizing Doall (LPD) test [30, 43] to instrument and test whether any of the candidate unpai-allelized loops in the program can be safely parallelized at run time. The LPD test is based on an inspector/executor model, where an inspector tests array subscript expressions in the loop at run time and an executor executes the loop in parallel if the inspector determines it is safe [39, 43]. Our Extended-LPD {ELPD) test instruments multiple loops in a nest and performs 30 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. instrumentation across procedure boundaries. We are thus able to locate all the loops in the program whose iterations can be executed in parallel for a particular program input. The ELPD test used in our experiments focuses on a specific type of parallelism, where each iteration of the loop accesses independent data, possibly by making some of the data private to each processor. Thus, ELPD does not consider many other forms of parallel computation, such as instruction-level parallelism, task-level par allelism, parallel reductions, or loop-level parallelism that requires communication of data (i.e., doacross parallelism). Further, the SUIF run-time system parallelizes only the outermost loop in a nest of possibly parallel loops. While these restrictions preclude many interesting forms of parallelism from being measured in our experi ment, nevertheless, the loop-level parallelism discovered by the ELPD test represents the parallel computations that are most commonly exploited in today's parallelizing compilers. We should point out that successful automatic parallelization involves not just identifying parallelism, but also managing data locality on each processor: considering parallelizable loops in light of the impact on data locality should be a subject for future work. We present measurements on programs from the S p e c f p 9 5 , N as sample pro grams and P e r f e c t benchmark suites. Overall, SUIF has been successful at par allelizing these programs. We find remaining parallelizable loops in eighteen of the 29 programs, but we show that this additional parallelism is only significant for ten 31 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. of the programs. We discuss features of these remaining loops, and we suggest tech niques that could enable the compiler (with run-time support) to parallelize them as well. In addition, we examine loops that ELPD identified as not parallelizable, some of which also have inherent parallelism. There are two key implications of these results. We demonstrate that there is still some room for improving autom atic paral lelization technology, particularly in two areas: incorporating control flow tests into analysis and extracting low-cost run-time parallelization tests from analysis results. We also find that an inspector/executor run-time test like LPD may be overkill for most of these loops, and we suggest that the loops could instead be parallelized with less expensive, focused run-time tests derived by compile-time analysis. 3.1 Overview The overview of our experiment is illustrated in Figure 3.1. The compiler takes a program and attaches instrumentation codes. Then, combined with a particular input at run-time, the instrumented code evaluates loops missed by the compiler to determine whether they are parallelizable by performing run-time parallelization and privatization test. The instrumentation system uses the results of array data-flow analysis and de pendence and privatization tests to decide which loops and which variables in each loop should be instrumented. An initial instrumentation analysis phase, described in the next section, designates loops and arrays for instrumentation. A subsequent 32 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Program Input 4 instrumented codr evaluation Figure 3.1: Sketch of run-time parallelization testing system. transformation phase, described in Section 3.3. actually inserts the instrumentation code. The instrumentation analysis is implemented in the interprocedural framework that is part of the SUIF system [26]. The analysis phase uses the results of array data-flow analysis to significantly limit the amount of work performed by instrumentation. The system reduces in strumentation overhead by not instrumenting loops nested inside already parallelized loops. This feature is important because the SUIF run-time system only exploits a single level of parallelism, so the inner loop would not be parallelized even if proven to be parallel. In a previous publication, we also eliminated from the list of candi dates those loops with compile-time provable dependences [47]. In this dissertation, we include such loops so that they may be considered in our evaluation of the ef fectiveness of the ELPD test in Section 3.4.6. A further distinguishing feature of our system is that it may instrument multiple loops in a nest. When instrumenting accesses to an array within an inner loop, it may be simultaneously keeping track of accesses to the same array in an outer enclosing scope. 33 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 3.2 Instrum entation Analysis Instrumentation analysis derives three results at each loop L in the program. • DoInstr(L) is a boolean indicating whether the current loop should be instru mented. • Instr(L) is the set of arrays to be instrumented in the current loop (empty if -i DoInstr(L)). • Glnstr(L) is the set of arrays in the current loop that must be instrumented to test some outer enclosing loop. 3.2.1 Instrum entation A nalysis for a Single Loop Within a single loop L, we compute initial values for Dolnstr and Instr. based on examining only loop L. We also introduce the boolean IsCand.ida.te. which is true only for loops that should be instrumented (that is, for which it is safe to parallelize them if all dependences are eliminated). IsCandidate(L) is false in our system if the loop is not parallelized because it contains an internal exit or return, or it contains read I/O statements. The boolean IsPar denotes whether the compiler was able to statically parallelize the loop. Obviously, there is also no need to instrument parallel loops. Figure 3.2 describes the calculation of Dolnstr and Instr for a single loop. 34 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. DoInstr(L) = IsCandidateih) A~'IsPar(L) Instr{ L) = 0 for each array A accessed in L if {Dependent l{ A) A NotPrivatizable^A)) Instr{L) = Instr{L) U {A} if {->DoInstr{L)) Instr{L) = 0 Figure 3.2: Instrumentation analysis algorithm for a single loop. 3.2.2 Interprocedural Instrum entation Analysis After deriving initial values for Dolnstr and Instr for each loop in the program, we can perform interprocedural propagation to obtain the final Dolnstr. Instr and GInstr sets for each loop. For this purpose, we introduce one additional boolean OuterPar. which is true at a particular loop or procedure if and only if the current loop is already parallelized by the compiler or the current loop or procedure is always executed inside an already parallelized loop. In such a case, it is not necessary to instrument the current loop because the compiler is only interested in exploiting a single level of parallelism and would prefer to parallelize the outer loop instead of this one. The interprocedural propagation algorithm also makes use of a mapping function lj.c that maps a set of arrays across a particular call site c, replacing actual parameters with formal parameters and eliminating local variables of the caller. The implementation applies the same interprocedural analysis framework for non-recursive programs used for array data-flow analysis, as discussed in Chapter 2. 35 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Here, the only interesting program regions are loops, procedure bodies and procedure calls. The interprocedural algorithm is defined in Figure 3.3. The calculation of Gln- str(P) derives which global variables and formal parameters must be instrumented in enclosing loops for all possible calls to procedure P. This set of variables may be a superset of the variables needing to be instrumented for a particular call to P if P is invoked by different calling contexts. Similarly. OuterPar is true only if the procedure is always invoked inside an outer parallel loop. To behave correctly in the presence of different calling contexts, analysis can employ procedure cloning to replicate the procedure body and tailor it to the set of instrumented variables and OuterPar for a particular calling context [10, 34]. Alternatively, the procedure body can examine flags passed to it as parameters at run time to decide whether instrumentation is required for the current calling context. For expediency, our implementation currently uses the latter solution. 3.3 Instrum entation Transformation Following the analysis phase, the compiler uses Dolnstr and the Instr and GInstr sets described above to insert shadow arrays - auxiliary arrays to record accesses to arrays in Instr(L) - and uses the shadow arrays to perform run-time dependence and privatization tests. For this purpose, we extend the Lazy Privatizing DOALL {LPD) test defined by Rauchwerger and Padua [43] and subsequently refined by 36 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. for each procedure P: DoInstr(P) = false Glnstr(P) = 0 Instr(P) = 0 OuterPar(P) = false for each procedure P from top to bottom over call graph: let R0 be the region for the procedure body of P. Glnstr(Ro) = Glnstr(P) Fnstr(Ro) = Instr(P) OuterPar(Ro) = OuterPar(P) for each region R R0) in P from outermost to innermost: let R' be the region immediately enclosing R. if R is a loop. DoInstr(R) = DoInstr(R) A ->OuterPar{R') Glnstr(R) = Glnstr(R') U Instr(R') OuterPar(R) = OuterPar(R') V IsPar(R) if (->DoInstr(R)) Instr(R) = 0 if R is a procedure call c invoking procedure P', DoInstr(P') = DoInstr{P') V DoInstr(R') GInstr{R) = Glnstr(R') U Instr(R') Glnstr(P') = Glnstr(P') U ty -c{GInstr(R)) OuterPar(P') = OuterPar(P') A OuterPar(R') Figure 3.3: Interprocedural instrumentation analysis algorithm. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Lawrence [30]. Our Extended-LPD (ELPD) test simultaneously tests for indepen dence and privatization across multiple loops in a nest, which we will describe after initially presenting the ELPD test within a single loop. To simplify the presentation, in the following, we assume normalized loop bounds that start at 1 with a step size of 1, and array dimensions declared with 1 as their first element. 3.3.1 Instrum enting a Single Loop — LPD Test We begin by presenting how instrumentation is performed within a single loop. For some array A[1 : d\. 1 : d-i, ..., 1 : dn] in Instr(L) for loop L with bound b = [1 : bu], the system introduces four shadow arrays: • Sw marks elements written within L. • Sr marks elements read but not written within at least one iteration of L. • Snp marks elements that are read only or read before written within at least one iteration of L, for use in the privatization test. • Srf marks elements read first before any writes for all iterations of L. These arrays are of the same dimensionality as A but are integer or boolean only. We also introduce a boolean O that is set if the loop contains an output dependence. Let I = i-i, .... in) refer to a subscript expression for array A appearing in the code for loop L. which we call an access function. To clarify the algorithms 38 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. presented below, we first define the shadow arrays and a set of properties of the values of shadow array elements for the location described by access function I at completion of loop V s execution. • SW [I] = b & b is the last iteration of L that writes location A[I], • Sr[I] = b b is the first iteration of L that reads and does not write location A[I\. • S„p[/] = true A[I] has an upwards exposed read in some iteration of L. • Sr/[/] = true ^ the first access to A[I] in L is a read. • O = true •O’ write accesses to some location A[I] occur in different iterations of L. The shadow array elements for Sr and Sm are initialized to 0. The elements for S np and Sr f, and boolean O, are initialized to false. 3.3.1.1 R un-Tim e U pdates To Shadow Arrays During program execution, the system performs the added computations described in Figure 3.4 in response to read and write accesses of A. The term bc in Figure 3.4 refers to the current iteration of loop L. Upon exit of the loop, the LPD test examines the shadow arrays to determine whether the array is independent, or if not, whether dependences can be safely eliminated with privatization. 39 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. if {SW[I} # bc) if (SW [I] # bc) if {Sm[I} # 0 ) 0 = true if (Sr[I] = 0) ST[I] = bc if {Sr[I\ = bc) Sr [J] = 0 if (SW[I\ = 0) Sr/[7] = true SW [I] = bc 5„p[/] = true (a) Write A[I] (b) Read A[I} Figure 3.4: Run-time updates to shadow arrays in a single loop. We now show how to reformulate the dependence and privatization tests from Chapter 2 as run-time tests on the shadow arrays, for a particular array accessed in L. The dependence test is defined as follows. IndependentL (V/ £ [1 : d\, 1 : d2; • • *; 1 : dn]. {Sr[I] = 0) V (Su,[7] = 0)) A (O = false) The first term of the above test determines whether there are loop-carried true or anti-dependences, while the second term identifies loop-carried output dependences. If there are output dependences but no true or anti-dependence, we can apply the following test to determine whether the remaining dependences could be eliminated by privatization. Privatizablei < = * ► V/ £ [1 : d L . 1 : d2. . -., 1 : dn], (5np[/] = false) V (5W [/] = 0) 40 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Given that we have already proven that there is an output dependence and no true or anti-dependences, the privatization test determines whether there is a read upwards exposed to the beginning of some iteration that is also written in some other iteration of the loop. As compared to the LPD test, this formulation of the single-loop ELPD test introduces the additional shadow array Srf to recognize whether the first access to an array element in a loop is a read. This shadow array is needed only to derive the solution at an outer loop based on accesses in the inner loop, as described in' the next section. This shadow array can be omitted for the outermost loop in a nest for which the ELPD test is performed. 3.3.1.2 R educing Space Requirem ents As compared to previous work, our approach can be implemented such that it reduces the space requirements for shadow arrays. Our formulation uses a single boolean O to test for output dependences, replacing a full integer shadow array used in previous work [30, 43]. The shadow array is not necessary because we are using a time stamp of the iteration of the write access for the write shadow array Sw rather than the boolean used by Rauchwerger and Padua [43]. To further reduce the space requirements, we can employ a space-saving tech nique suggested by Lawrence [30]. We can represent the boolean shadow array elements for Snp and Srf as signs on the time stamp for the corresponding write and read shadow array elements Sm and Sr, respectively (positive integer for true, 41 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. negative for false, distinguishing between positive and negative 0, and assuming nor malized loop bounds). Using signs on read and write shadow arrays reduces the total space requirements to two integer shadow arrays plus a scalar boolean for output dependence rather than the three boolean and one integer shadow arrays used by Rauchwerger and Padua [43] or the four integer shadow arrays used by Lawrence [30]. 3.3.2 Instrum enting M ultiple Loops in a N est — ELPD Suppose loop L fc is nested inside loop Lj and for both loops, array A must be instrumented (i.e., A G Instr(Lk) A A € Instr(Lj)). Then each loop will have its own shadow arrays for array A. In this case, we need to update the values of the shadow arrays of the outer loop following completion of each invocation of the inner loop. Given shadow arrays S£, S£, and S^} for the inner loop and shadow arrays SJ W , S 3 t , S J np, and S3 t j, and boolean Oj for the outer loop, the updated values for the outer loop's shadow arrays is defined in Figure 3.5. When shadow arrays for the outer loop appear on the right hand side of equations in Figure 3.5, we are referring to the value prior to performing these updates. The term refers to the current iteration of loop Lj, and ba indicates any iteration value ^ 0. The shadow array Si,[I] sets its value to the current iteration if it is written in the inner loop. The shadow array Sj![/] sets its value to the current iteration if it was previously unread in the outer loop, was not written in the current iteration of the outer loop and was read but not written in the inner loop; its value is set to 42 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. f true i f {Si[I} = ba)A(SZ[I}^0)A{ba7Lbi) ^ O -7 otherwise r & > .7 (sy /i # o ) | 5^[/] otherwise f o i/(S ? [/] = (>>)A(SJ|/]#0) { y c it (SJ[/i= o) a (s im * H ) a (s*m 5 4 o) a ( s ‘ m = o > [ S-![/] otherwise f true i f (5£[/] # 6j) A (S*jr[J] = true) | Sj'pf/] otherwise | true i f (5i[/J = 0) A (5jr[/J = true) | SJ T f[I] otherwise Figure 3.5: Run-time updates to shadow arrays in a nested loop. 0 if it was previously read in the current iteration, and it is written by the inner loop. The shadow array Sj[p[/]? s value is only set to true if it was not written in the current iteration of the outer loop, and is read in the inner loop before possibly being written. Similarly. S3 rf[I\ is set to true if it has never been written in the outer loop and is read in the inner loop before possibly being written. In general, updating shadow arrays for an enclosing loop is needed whenever .4 G Glnstr(L) A A G Instr(L). The shadow arrays to be updated are either from the immediately enclosing loop or, if L is the outermost instrumented loop in the procedure, those passed into the procedure as parameters. 43 Oj sim s;m sy/] s^m Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 3.3.3 Instrum entation Across Procedures Whenever Glnstr(P) for some procedure P is not empty, a procedure must instru ment accesses to some of its arrays for the purposes of testing a loop in a calling procedure. In this case, the shadow arrays are passed as parameters, as is the index of the loop to be tested. 3.4 Experim ental Results We have implemented the instrumentation approach described previously in the Stanford SUIF compiler. This section presents a series of results on three benchmark suites. 3.4.1 Benchmark Programs The benchmark programs used in the experiment appear in Table 3.1. The entries in the table include the program name, the number of lines of code, a brief description, and a count of the loops in the program that are actually executed. (This number is smaller than previous static loop counts [1. 49], because some programs contain a significant number of loops that are never executed.) Our experiments consider two benchmark suites (S p e c f p 95 and N as) for which the SUIF compiler was already mostly successful at achieving good speedups, and the PERFECT benchmark suite for which SUIF was less successful. We omit from our experiment fpppp (SPECFP95) and spice ( P e r f e c t ) because they have so many type errors and parameter number 44 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. mismatches in the original Fortran code that they would require significant mod ification to pass through SUIF; these programs are widely considered not to have significant loop-level parallelism. We used reference inputs for SPECFP95, the small inputs for N a s , and the standard inputs for P e r f e c t . In previous publications, SUIF achieved a speedup on 7 of the 10 S p e c f p 95 programs; of these, su2cor achieved a speedup of only 4 on 8 processors of a Digital AlphaServer 8400 [25]. Six programs obtained a speedup of more than 6. The programs apsi, wave5, and fpppp were the only three not to obtain a speedup. In the N a s benchmark suite, only buk and fftpde failed to achieve a speedup. The compiler was less successful in parallelizing the P e r f e c t benchmark suite, obtaining a speedup on only 5 of the 12 programs in the experiment [24]. 3.4.2 Experim ental M ethodology To obtain the results presented below, we performed three different executions of each program. First, we executed the instrumented program to locate the ELPD- proven parallel loops. Second, we executed the original program without instru mentation, performing measurements described below for just the loops parallelized by the compiler. Recall that we E ire only interested in exploiting a single level of parallelism. As a result, the measurements only consider the impact of parallelizing the outermost possible loop. The final run measures the impact of parallelizing both the compiler-parallelized loops and the ELPD-proven parallelizable loops. Dynamic 45 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. P ro g ra m L ength D escrip tio n # o f loops S p e c f p 95 applu 3868 parabolic/elliptic PDEs 155 apsi 7361 mesoscale hydrodynamic model 122 hydro2d 4292 Navier-Stokes 151 mgrid 484 multigrid solver 50 su2cor 2332 quantum physics 112 swim 429 shallow water model 24 tom catv 190 mesh generation 13 turb3d 2100 isotropic, homogeneous turbulence 54 wave5 7764 2-D particle simulation 124 Nas appbt 4457 block tridiagonal PDEs (123 x 5" grid) 182 applu 3285 parabolic/elliptic PDEs (123 x 52 grid) 156 appsp 3516 scalar pentadiagonal PDEs (123 x 52 grid) 1S8 buk 305 integer bucket sort (65.536 elts) 10 cgm 855 sparse conjugate gradient (1,400 elts) 31 embar 135 random number generator (256 iters) 7 fftpde 773 3-D FFT PDE (64 x 3 grid) 38 mgrid 676 multigrid solver (32 x 3 grid) 56 P e r f e c t adm 6105 pseudospectral air pollution model 124 arc2d 3965 2-D fluid flow solver 172 bdna 3980 molecular dynamics of DNA 195 dyfesm 7608 structural dynamics 197 flo52 1986 transonic inviscid flow 174 mdg 1238 molecular dynamics of water 44 mg3d 2812 depth migration 140 ocean 4343 2-D ocean simulation 130 qcd 2327 quantum chromodynamics 134 spec77 3889 spectral analysis weather simulation 369 track 3735 missile tracking 71 trfd 485 2-electron integral transform 33 Table 3.1: Benchmark programs. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. measurements were gathered on a single processor of an SGI Origin 2000 with 195 MHz R10000 processors. 3.4.3 Categorization of Loops We begin with measurements that compare the percentage of a program’s loops that are in parallelized regions. In the graph in Figure 3.6. the first bar for each program represents only compiler-parallelized loops, while the second bar also includes loops proven parallelizable by the ELPD test. The loops are broken down into three categories: (1) loops that are always executed as outermost parallel loops; (2) loops that are always executed in parallel regions, but for at least some of their invocations are nested inside parallel loops; and, (3) loops that are sometimes or always executed sequentially. This categorization is needed since a loop may be invoked through different procedures; it may be executed in parallel in some invocations but not others, or as an outermost parallel loop in some invocations but nested within a parallel loop in others. These loops are represented as a percentage of the total loops that axe actually executed at run time, taken from Table 3.1. We observe from the figure that the percentage of loops always executed in par allel (loops in the first two categories) is fairly high for the compiler-parallelized version of the program, but there are some sequential loops in almost every single program. By comparing the compiler-parallelized result with the ELPD result, we 47 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Reproduced with permission o f th e copyright owner. Further reproduction prohibited without permission. Loops in sequential program regions m Loops nested inside parallel program regions m Outermost parallel loops SPECfp95 NAS Perfect Figure 3.6: Static categorization of loops in parallelized regions, comparing compiler-parallelized loops to total parallelizable loops. £ - 00 45892139 see that ELPD finds additional parallel loops in over half of the 29 programs. Of par ticular interest are the programs which show a decrease in the number of outermost parallel loops with ELPD, but the total number of loops executed within parallel regions increases. This effect occurs when ELPD parallelizes an outer loop in which several compiler-parallelized loops are nested; the compiler-parallelized loops appear as loops nested inside a program region rather than as outermost parallel loops. These static results would suggest that there are some remaining opportunities to enhance the effectiveness of automatic parallelization, although perhaps restricted to a few programs. The results in the remainder of the chapter evaluate this initial hypothesis, considering the impact of parallelizing these additional loops on dynamic performance metrics, pinpointing the characteristics of the ELPD-proven loops and then considering the loops in the sequential category to determine their potential for parallelization. 3.4.4 Dynam ic Performance M easurements Static counts of loops such as those presented above can be somewhat misleading because parallelizing a single coarse-grain loop can have a significant impact on a program’s performance, while parallelizing dozens of fine-grain loops can actually 49 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. degrade performance. To more accurately assess remaining parallelism opportuni ties. we compared dynamic measurements of three kinds for the original compiler- parallelized program and a version of the program where the ELPD-proven loops are parallelized. P a ra lle lism Coverage: The overall percentage of sequential execution time spent in parallelized regions is the parallelism coverage. Coverage is an important metric for measuring the effectiveness of parallelization analysis, because by Amdahl’s law, programs with low coverage cannot possibly get good paral lel speedup. Coverage measurements were taken by measuring both overall execution time and time spent in the parallelized regions under consideration. P a ra lle lism G ran u larity : A program with high coverage is not guaranteed to achieve parallel speedup due to a number of factors. The granularity of paral lelism is a particularly important factor, as frequent synchronizations introduce significant overhead that can slow down, rather than speed up, a fine-grain parallel computation. For this experiment, we compute a weighted granularity calculated by the following formula: wgranL = (tim e(L)/ invoc(L)) * coverage(L) This calculation provides an average execution time for each invocation of the loop (computed by the first term), weighted by the coverage of loop L. 50 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. The granularity for the program P is then calculated as the sum of weighted granularities across all loops. An additional dynamic measurement of importance is parallel speedup compar isons with and without the additional ELPD loops. Gathering meaningful speedup measurements involves a significant amount of work beyond identifying parallel loops: particularly, performance tuning to reduce parallelization overhead and se- quentialize loops that are unprofitable to execute in parallel, and locality optimiza tion such as data placement, loop transformations and page coloring. Rather than gather speedup measurements, which is beyond the scope of this work, we reference existing speedup measurements wherever possible. The coverage and granularity measurements are obtained in a straightforward manner, and only one aspect of their measurement requires further description. Each loop invocation must be timed, and the overhead of invoking the timer can skew the results, particularly when comparing a version of a program with fine- grain parallelism versus one with coarser-grain parallelism. We found it essential to account for this timer overhead, so we performed a series of measurements to determine its impact on our overall results. By measuring 10,000 executions of the timer and associated code surrounding it, we found it introduced an average overhead of 4.4 fi-secs per loop invocation, where 1.9 /j-secs of this overhead was charged to the loop time. These amounts were subtracted from the execution time for the program and for each loop invocation, respectively. We validated this measurement 51 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. by comparing it against the difference between the program’s execution with and without timer calls and found our estimate to be accurate. The coverage and granularity results are presented in Figure 3.7. In each graph, the first bar for each program is for the compiler-parallelized version, and the second bar is for both compiler-parallelized and ELPD-proven parallelizable loops. Because of the wide divergence in granularity measurements, the granularity graph uses a log scale to represent time. 3.4.5 D etailed Consideration of Results Overall the compiler was already doing a good job of parallelizing the S p e c f p 95 and N as benchmarks. Only eight of the 17 programs in these two benchmark suites had any additional parallelizable loops, and most programs had just a few such loops. On the other hand, in the PERFECT benchmark suite, ten of the twelve programs had remaining parallelizable loops. Once we identified the eighteen programs with remaining parallelism opportuni ties, we examined the ELPD-proven parallelizable loops in these programs to evalu ate how a parallelizing compiler might exploit this additional parallelism autom at ically. We characterized the requirements of these additional loops as presented in Table 3.2. (The detailed categorization of individual loops axe included in Ap pendix.) The number of ELPD-proven loops each program contains appears in the 52 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Granularity o f Parallelism Coverage o f Parallelism (%) Perfect a Original SUIF ■ Extended LPD lOOsec lOsec 1sec 100ms 10ms 1ms 100US 10US SPECfp95 NAS Perfect Figure 3.7: Coverage and granularity comparison. 53 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. P r o g r a m ST C F B C C F + B C O D IE D D T o ta l S p e c f p 95 apsi 1 10 0 6 0 0 0 17 mgrid 0 0 0 0 0 1 0 1 su2cor 0 9 5 0 4 1 0 19 wave5 0 1 9 2 1 0 2 15 N as buk 0 0 0 0 0 1 0 1 cgm 0 0 0 0 0 0 2 2 fftpde 0 3 0 0 0 0 0 3 mgrid 0 0 0 0 0 1 0 1 P e r f e c t adm 1 11 0 6 0 0 0 18 arc2d 0 0 0 2 0 1 0 3 bdna 0 0 0 2 5 3 1 11 dyfesm 0 0 1 0 11 1 0 13 flo52 0 0 0 0 4 0 0 4 mdg 1 0 0 0 0 2 0 3 ocean 10 0 0 0 1 0 0 11 qcd 0 0 0 3 0 2 0 5 spec77 2 3 6 0 0 0 0 11 track 1 0 0 0 2 0 0 3 T o ta l 16 37 21 21 28 13 5 141 Table 3.2: Requirements of remaining parallel loops in S p e c f p 95. N a s . and PER FECT. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. last column. The remaining columns provide a count of how many of the loops could potentially be parallelized with a particular technique, defined as follows1: • ST: This category includes loops that could potentially be parallelized with existing static analysis techniques. Two loops could be parallelized with range analysis [7]. The remaining loops contain nonlinear array subscript expressions that might be parallelized by the range test [7] or access region analysis using the LMAD [38]. • C F: Identifies loops for which parallelization analysis fails because of control flow within the loop. The control flow paths that would result in a dependence can potentially be ruled out at compile time by associating predicates with array data-flow values during analysis and comparing predicates on reads and writes to rule out impossible control flow paths. The control flow tests in these loops are based only on scalar, loop-invariant variables: it is possible for the control flow tests leading to dependences can be ruled out at compile time. While not in common practice, a few techniques refine their array data-flow analysis results in this way [21, 22, 35, 36, 50]. • B C : Identifies certain loops whose safe parallelization depends on values of variables not known at compile time. For the loops in this category, it is lThe difference in these results as compared to a previously published version [47] is mostly due to eliminating loops that only execute a single iteration. Parallelizing such loops is obviously not going to improve performance, and counting them skews the results. Further, the compiler found a few more parallel loops by turning on support for floating point induction variable recognition, and we corrected a few misclassifications. 55 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. straightforward to derive breaking conditions by extracting constraints on de pendences directly from the dependence and privatization tests [17. 18, 41, 42]. • CF-I-BC: Identifies loops that require both breaking conditions and analysis taking control flow into account. In some cases, derivation of breaking condi tions is not as simple as extracting them directly from the dependence test. In two loops, run-time testing is needed because dependences arising from certain control flow paths and the control flow test involves array variables. • O D : Identifies loops with only output dependences, for which SUIF's analysis was unable to determine how to finalize their values. The loops are paralleliz able with some run-time assistance to determine the iteration of the last write of each location. • C l: Identifies loops that carry dependences only along certain control paths, but where the control flow test is based on array values. Such loops can be parallelized with a simple inspector/executor that only determines control flow paths taken through the loop (a control inspector). • IE: Identifies loops that can probably only be parallelized with an inspec tor/executor model [39, 43]. These loops contain potential dependences on arrays with subscript expressions that include other arrays (i.e., index arrays). 56 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. • DD: Identifies loops where an inspector/executor model is probably not even suitable, because they contain dependences that occur only under certain con trol flow paths through the loop, and the control flow tests are based on loop- varying variables within the loop. The only approach we know that could parallelize such loops is a speculative inspector/executor model, where the loop is parallelized speculatively, and the inspector is run concurrently with executing the loop [43]. The eighteen programs contain a total of 141 additional parallelizable loops found by the ELPD test. (Note that this number contains only loops that were executed at run time and that were not nested inside SUIF-proven parallel loops. Also some of the loops found by the ELPD test are nested inside other ELPD-proven parallelizable loops. That is, not all loops found are outermost coarse-grain loops.) Because of the significant differences in the results for the three benchmark suites, we consider each one individually. 3.4.5.1 Specfp95 Benchm arks For the three of four programs with remaining parallel loops, the additional paral lelism appears to be significant. For mgrid, the additional parallelism found by the compiler has little effect on either coverage or granularity, suggesting that paralleliz ing these loops will have little effect on performance. 57 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. In apsi. the increased coverage is about 22%, increasing the maximum possible speedup from 4.5 to 500, according to Amdahl’s Law; granularity also increases. Apsi has 17 additional loops; the keys to parallelizing this program are to take control flow tests into account during analysis and derive simple run-time tests. All but one of its 17 additional loops are in the C F and C F + B C categories. Many of the loops in the C F column have compiler-assumed dependences on scalar variables only. Su2cor, which demonstrates a 3 order of magnitude increase in granularity, has 19 loops spanning several of the applicable techniques. More than half of the loops in su2cor require run-time testing for correct parallelization, and nearly half can be parallelized by taking control flow into account. Wave5 has two large loops that require analysis that incorporates control-flow tests and introduces some run-time testing. In the work presented in Chapter 4, we have extended the analysis in SUIF such that most of the loops in these three programs are now parallelized automati cally [35, 36]. We have demonstrated significant improvement in their performance on 4 processors of an SGI Origin 2000 and 4 and 8 processors of a Digital AlphaServer 8400 [35]. 3.4.5.2 N as B enchm arks Three of the four programs with remaining parallelism opportunities show little impact on coverage or granularity. Only fftpde appears to benefit from parallelizing its additional loops. The dramatic increase in coverage and granularity for the fftpde 58 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. benchmark - a 9% increase in coverage and more than an order of magnitude increase in granularity - can be realized by parallelizing three additional loops, all of which require taking control flow into account. Two of the loops also include nonlinear subscript expressions. Hand-parallelization experiments show impressive speedups when parallelizing the outermost of these loops [3]. 3.4.5.3 Perfect Benchm arks The results for S p e c f p 9 5 and N a s show only a few programs with remaining par allelism opportunities, and most of the significant remaining loops fall into just a few categories. The P e r f e c t benchmarks exhibit some remaining parallelism op portunities in most of the programs, and suggest that a broader range of techniques are required. For the ten programs in P e r f e c t with additional parallelism opportunities, cov erage increases significantly for three of the programs, adm, mdg, and ocean. Gran ularity increases by an order of magnitude or more in only three programs, arc2d, mdg, and qcd. Reasonably significant improvements in coverage and granularity are possible in dyfesm. For four of the ten programs, bdna, flo52, spec77, and track, it appears the additional parallelism will have little or no impact on performance. We consider first the three programs with dramatic increases in coverage. Adm is the same program as apsi, with one additional loop that is parallel under the run-time values arising in adm but not in apsi. The main loop in mdg requires an inspector/executor. Tu describes how to parallelize this with a pattern-matching 59 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. approach [49], but we know of no technique to parallelize the loop that is both gen erally applicable and practical. Most of the loops in ocean have nonlinear subscript expressions. To exploit the more than four orders of magnitude increase in granularity for qcd requires techniques to take control flow into account. The increased granularity for arc2d involves two loops that could be parallelized taking control flow into account, and one loop that we categorize as requiring an inspector/executor because subscript expressions access index arrays; however, a special-case array element symbolic anal ysis could parallelize this loop, as done in Polaris [49]. To obtain the more modest improvements in coverage and granularity, dyfesm requires parallelizing the largest loop with an inspector/executor. Many other loops can be parallelized with output dependence transformations, and one loop requires a simple breaking condition. Even though there are six programs in P e r f e c t that appear to have significant remaining parallelism opportunities, we observe that improved speedups on a coarse- grain multiprocessor such as our SGI Origin 2000 are only likely in at most three programs: adm, arc2d and mdg. The coverage for qcd is so low, that its ideal speedup is at most 1.7. The reason for qcd’s low coverage is that the inherently sequential portion of the computation is an initialization phase using a random number gen erator; results on qcd would improve if a parallel random number generator were used. Although they have reasonably high coverage, the granularity measurement 60 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. for both ocean and dyfesm is too low for profitable parallelization. Speedups on arc2d are limited to 3 or 4 on these outer coarse-grain loops because they have just this many iterations; only by exploiting multiple levels of parallelism can the compiler achieve higher speedups for arc2d. 3.4.5.4 Sum m ary If we look at the seven categories, two describe loops that could be parallelized statically. ST. and C F, making up 53 of the total loops (and this number is only so large because apsi and adm are the same program). For loops in the five remaining categories, making up more than 60% of the unparallelized loops, some sort of run time testing is required to verify the safety of parallelization, or, in the case of the O D category, to correctly transform the parallel loop. But rather than always reverting to a potentially expensive inspector/executor model, we see that in 70 of the 88 loops requiring run-time testing, a less expensive and more directed run-time test can potentially be derived. We also observe that taking control flow tests into account in analysis is very important, required for seven of the ten programs for which remaining parallelism appears to be significant. These results indicate that there is still some room for improvement in auto matic parallelization in two areas: incorporating control flow tests into analysis and extracting low-cost run-time tests wherever applicable instead of using an inspec tor/executor. Inspector/executor still appears to be a necessary tool for P e r f e c t : it is needed to parallelize mdg and dyfesm. 61 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 3.4.6 Evaluating ELPD To understand how well the ELPD test pinpoints available parallelism, we now consider the set of candidate loops that ELPD indicated were not parallelizable. Using ELPD to identify which variables were involved in dependences, we derived the results in Table 3.3. The table presents a count of the loops that fit into each of the following set of categories: • C D : Identifies loops that have complex dependences, and cannot be paral lelized in any straightforward way. • D A : Identifies loops with doacross parallelism, which we define as loops where all dependences have a distance in iterations that is a compile-time constant. Some of these loops could be parallelized with parallel prefix operations. Oth ers require communication of values from one iteration to the next and are unlikely to benefit from parallelization. • A D : Identifies loops that only contain anti-dependences. Such loops can easily be parallelized (although not necessarily profitably) by making a copy of the array involved in the anti-dependence prior to the loop's execution, and using the copy of the array to provide input values during the loop’s execution. • R E D : Identifies loops with reduction computations not recognized by SUIF. The reductions in this category are all performing MIN or MAX reductions, but what needs to be returned from the operation is not just the minimum 62 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Program C D D A A D R E D C l IN D B C LS Total S p e c f p 95 applu 0 12 0 0 0 0 0 0 12 apsi 2 9 0 2 0 0 0 0 13 hydro2d 0 2 0 2 0 1 0 0 5 mgrid 3 0 2 3 0 0 0 0 8 su2cor 7 9 1 0 0 0 0 2 19 tomcatv 0 2 0 0 0 0 0 0 2 wave5 8 5 0 0 2 0 0 1 16 N as applu 0 12 0 0 0 0 0 0 12 buk 2 2 0 0 0 0 0 0 4 cgm 3 5 0 0 0 1 0 0 9 fftpde 0 2 0 0 0 1 0 0 3 mgrid 6 3 2 3 0 0 0 0 14 Pe r f e c t adm 2 9 0 2 0 0 0 0 13 arc2d 0 9 0 0 0 0 0 0 9 bdna 1 10 0 0 0 7 0 0 18 dyfesm 5 7 1 0 0 1 0 0 14 flo52 0 4 0 2 0 0 0 0 6 mdg 2 3 0 0 0 0 0 0 5 mg3d 5 1 1 0 0 0 0 2 9 ocean 6 1 0 1 0 9 0 0 17 qcd 0 26 0 0 0 5 0 4 35 spec77 2 14 0 2 0 0 0 1 19 track 0 2 1 0 0 0 1 0 4 trfd 0 0 0 0 0 0 0 3 3 Total 54 149 8 17 2 25 1 13 269 Table 3.3: Characteristics o f non-parallel loops in S p e c f p 9 5 . N a s . and P e r f e c t . Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. or maximum value, but also some other data such as the element’s index. This additional data is what prevents SUIF from recognizing these reduction operations. By combining reduction recognition with analysis of control flow related to the reduction computation, the compiler should be able to parallelize such operations by sequencing the final updates of local values. • C l: This category considers any loop that is parallelizable under some control flow path, and the desirable control flow path occurs for some but not all of the loop’s invocations. A control inspector is needed to determine which control flow path is taken at run time. • IN D : Identifies loops with induction variables that arc not recognized by SUIF's induction variable analysis. Such induction variables, if not priva tized and rewritten by the compiler, will be reported as scalar dependences. There were two common cases in this category. The first case involves induc tion variables that are incremented only under certain control flow paths. The second case occurs in multiply nested loops, where the induction variable is incremented in an innermost loop, and the upper bound for the innermost loop varies for different invocations of the loop. In either case, such loops might be parallelized using an inspector to produce the initial value of the induction variable for each processor prior to execution of the loop. 64 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. • B C : Similar to breaking conditions in the previous subsection, this category represents a loop where a breaking condition can be derived from dependence testing, and the loop is parallel under some but not all invocations. • LS: Identifies loops for which some of the iterations can be parallelized using loop splitting. Opportunities for parallelizing such loops might be detected by analyzing the control flow tests inside the loop or considering the effect of peeling the first or last iteration of the loop. The first category C D represents loops that would be difficult to parallelize. For all remaining categories, we believe it is possible for the compiler to generate parallel code with extensions to its existing analysis, although in many cases, the code transformation required may result in slower code than the sequential version. We have organized the categories according to what we anticipate would introduce the most overhead, with the categories at the end representing the more promising opportunities to extend the compiler. We find 269 instrumented loops that are reported as not always parallelizable by ELPD. This table reflects two important results. First, we can further extend compilation techniques to parallelize even some loops that are identified as not par allelizable by the ELPD test. Again we see that combining run-time support with compile-time analysis could benefit in many of these cases, applying to the cate gories A D , R E D , C l, IN D , BC , and some of the loops in D A . Unlike our previous 65 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. categorization, we have no dynamic information about these loops, so we cannot discuss the relative importance of these techniques. Second, the results in the table point to some limitations in the ELPD test, par ticularly in identifying as dependences certain induction variables and reductions missed by SUIF’s analysis. We can also consider whether the LRPD test (which incorporates some testing for reductions) would be more successful at identifying parallelizable loops. Because they involve multiple variables, LRPD would not be able to identify the reductions in R E D category. It is possible that LRPD is appli cable to some of the loops in the C D category. 3.5 Chapter Summary This chapter has presented the results of an important experiment that addresses the question of whether there are remaining opportunities for improving automatic parallelization systems for a set of '29 programs from three benchmark suites. The model of available parallelism recognized by the experiment requires that a loop's iterations access independent memory locations, possibly after privatization of array or scalar data structures. Using this model, we have identified all the remaining loops not parallelized by the SUIF compiler for which parallelization is safe. To perform this experiment, we developed an implementation extending the au tomatic parallelization system that is part of the Stanford SUIF compiler. We have 66 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. augmented the LPD test, a run-time parallelization testing technique that is sim ilar to an inspector/executor run-time approach. Our ELPD test allows run-time parallelization on entire programs, across procedure boundaries and multiple loops in a nest. We have developed interprocedural instrumentation analyses to avoid unnecessary instrumentation of loops nested inside already parallelized loops. The results of this experiment have several key implications. We observe that there is still some room for improvement in automatic parallelization, particularly in two areas: incorporating control flow tests into analysis and extracting low-cost run-time tests from analysis results. Our experiments suggest that high-overhead inspector/executor run-time tests, which perform work and introduce extra storage on the order of the size of the arrays accessed in the loop, are unnecessary for most of the loops missed by the compiler. Instead, lower cost run-time tests that focus directly on the cause of the dependence should be derived automatically by the compiler and used to guard conditionally parallel versions of the program. The results of this experiment motivated us to develop a new analysis technique called predicated data-flow analysis [35, 36], which will be presented in the next chapter. As part of this experiment, we also considered the loops that were identified as not being parallelizable, ones that do not fit the model of parallel loops assumed by the ELPD test. Several of these loops are inherently parallel, pinpointing some of the limitations of the ELPD test. Of particular significance are loops containing induction variables that are conditionally incremented, and reductions involving 67 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. multiple variables. Through this exercise, we have identified further opportunities to enhance automatic parallelization technology, most of which also require focused run-time testing or better analysis support in the presence of control flow. 68 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 4 Predicated Array Data-Flow Analysis As evidenced by the results of the experiments presented in the previous chapter, the static nature of current parallelization analysis severely limits its effectiveness, basing optimization decisions solely on knowledge provable at compile time. As traditional compilers perform optimization based on conservative data-flow analysis results, which is guaranteed to be safe for all control flow paths taken through a program and all possible program inputs, they typically produce only a single optimized version of a computation. In this chapter, we refine traditional parallelization analysis to instead derive optimistic data-flow values guarded by predicates. Our approach, which we call predicated array data-flow analysis [35, 36], associates a predicate with each array data-flow value. Analysis interprets these predicate-value pairs as describing a rela tionship on the data-flow value when the predicate evaluates to true. This analysis is specifically formulated to increase the effectiveness of parallelizing compilers in the following two distinct ways. 69 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. • Increases precision o f co m p ile-tim e analysis. It enables the compiler to rule out data-flow values arising from control flow paths that cannot possibly be taken or to derive more precise data-flow values by combining them with their associated predicates. • D erives low -cost ru n -tim e te sts. It enables optimizations that cannot be proven safe at compile time, either because safety is input dependent or simply because of limitations in compile-time analysis. The compiler transforms a code segment optimistically, producing multiple versions of a computation with each version's execution guarded by an inexpensive run-time test derived from analysis. Thus, this technique can potentially parallelize all loops in the C F. B C . and CF-I-BC categories in Table 3.2, the bulk of the remaining unparallelized loops. The next two sections briefly overview our approach and present several exam ples that show the capabilities of predicated array data-flow analysis. We describe how SUIF’s existing interprocedural array data-flow analysis must be modified to support predicated array data-flow analysis. We also provide detailed treatment of predicate embedding and extraction, distinguishing features of our approach. We have implemented predicated array data-flow analysis in the Stanford SUIF compiler by extending its previous array data-flow analysis implementation. We discuss some features of our implementation. The experimental evaluation of our implementation is presented in the next chapter. 70 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 4.1 Overview Traditional data-flow analysis computes what is called the meet-over-all-paths (MOP) solution, which approximates data-flow values based on the conservative assumption that all control-flow paths through a program may be taken. On the other hand, predicated data-flow analysis integrates an existing data-flow analysis with an analy sis on a predicate lattice to eliminate the need for many of these approximations. At the same time, analysis maintains optimistic data-flow values guarded by predicates. Analysis uses these predicated data-flow values to derive run-time tests guaranteeing the safety of parallelization for loops that cannot be safely parallelized at compile time. To make these points more concrete, consider an analysis that derives predicates holding at a statement s by forming the conjunction of control-flow tests along all paths reaching s from the beginning of the program (or from program end, if solving a problem that analyzes backward data flow). Predicates incorporating control-flow tests allow the compiler to improve data-flow information by ruling out certain con trol flow as impossible. This point is illustrated by the example in Figure 4.1(a). The two control-flow tests are identical, so any path through the loop body that includes execution of the statement help[j] = . .. must also execute the statement . .. = help\j\. Incorporating the control-flow information into the data-flow anal ysis, the compiler can prove at compile time that data-flow values corresponding 71 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. to executing one but not the other statement within an iteration of the loop are infeasible. Considering only feasible data-flow values, our analysis can produce results of two kinds: • A meet-over-feasible-predicates (MOFP) solution eliminates data-flow values from the traditional meet-over-all-paths (MOP) solution that the compiler can prove are infeasible. Through the MOFP solution, the compiler can improve the precision of static, compile-time analysis. • A value-to-predicate (VPS) solution of a given analysis, when given a particu lar (optimistic) desirable data-flow value for the original problem, produces a predicate that, when true at run time, guarantees the correctness of the value. The VPS solution is derived from the MOFP solution, and can be applied by the compiler to derive run-time tests guarding safety of executing a particular optimized body of code. Related to these solutions are operators that use values from the predicate domain to enhance the data-flow values in the original problem domain, and vice versa. • Predicate embedding applies a predicate to the data-flow value it guards to produce a refined data-flow value. 72 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. • Predicate extraction derives a predicate from operations on predicated data flow values to produce a refined data-flow value guarded by the extracted predicate. Both operators are critical elements in our analysis and will be explained in detail later in this chapter. The notion of predicated data-flow analysis described above is not restricted to parallelization analysis. Nevertheless, our approach is particularly appropriate to this domain because parallelization is a high-payoff optimization (so that the addi tional overhead of run-time testing is merited) and because predicates can introduce additional constraints into the solution of systems of integer linear inequalities upon which many parallelization analyses are based (yielding a benefit beyond simply a more precise understanding of control flow). While most parallelizing compilers do not make explicit use of predicates in producing their optimizations, there are a few related techniques that we will describe in Chapter 6. The power of the predi cated analysis paradigm, when applied to parallelization analysis, is that it unifies these existing techniques in a single framework, and at the same time, introduces an im portant new capability - deriving low-cost run-time parallelization tests - to significantly increase the parallelism exposed by the compiler. 73 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 4.2 Capabilities of Array Data-Flow Analysis Before presenting the predicated array data-flow analysis framework, we first illus trate how predicated array data-flow analysis uses the MOFP and VPS solutions and predicated embedding and extraction with examples loosely based on code from our experiments. The MOFP solution and predicate embedding technique are used in our analysis to improve the precision of compile-time information, while the VPS solution and predicate extraction derive predicates that guard refined compile-time information and subsequently can be used as run-time tests. 4.2.1 Improving Com pile-Tim e Analysis 4.2.1.1 M O FP Solution For the example code in Figure 4.1(a), traditional data-flow analysis determines that there may be an upwards-exposed use of array help because there is a possible control-flow path through the loop that references help but bypasses the preceding assignment to it. An MOFP solution could discover that the predicates for the assignment and reference of help axe equivalent; thus, none of array help is upwards exposed and the loop can be privatized statically. 4.2.1.2 Predicate Em bedding In examples such as in Figure 4.1(b), most compilers would assume that the ele ment help[0] is upwards exposed because the loop assigns to only help[l : d] but 74 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. for i = 1 to c for j = 1 to d if (x > 5) th e n help{j] = ... for i = 1 to c for j = 1 to d help[j] = ... en d fo r for j = 1 to d if (x > 5) th e n ... = helpfj] endfor for j = 1 to d if (j = 1) th e n ... = help[j] else en d fo r ... = help[j-l] endfor endfor endfor (a) Benefits from MOFP solution (b) MOFP solution benefits from predicate embedding Figure 4.1: Improving compile-time analysis with predicated data-flow analysis. it possibly references help[j — 1] and j ranges from 1 to d. But observe that the data dependence and array data-flow analyses, particularly if based on integer linear programming techniques, make use of integer constraints on scalar variables used in array accesses to determine whether two accesses in different iterations can refer to the same memory location. The compiler can utilize the constraint (j > 1) inside the else branch of the second j loop to prove that help[0] is not accessed by the loop, and, as a result, help can be safely privatized and the loop parallelized. Predicated data-flow analysis that incorporates a predicate embedding operator can derive this result and parallelize the loop at compile time. 75 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. for i = 1 to c help[i] = help[i+m] en d fo r for i = 1 to c for j = 1 to d if (x > 5) th e n helpjj] = ... en d fo r for j = 1 to d if (x > 2) th e n • • • = help[j] endfor endfor (a) Breaking condition (b) Benefits from VPS solution for i = 1 to c for j = 2 to d helpp-1] = . . . helpQ] = . . . endfor for j = 1 to d . . . = help[j] endfor endfor (c) Benefits from predicate extraction Figure 4.2: Using predicated array data-flow analysis in run-time parallelization. 76 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. if (m > c or m = 0) then forall i = 1 to c help[i] = help[i+m] else for i = 1 to c help[i] = help[i+m] (a) From Figure 4.2(a) if (x < 5) then forall i = 1 to c /* privatization not needed */ else forall i = 1 to c local priv_help[] (b) From Figure 4.2(b) if (d > 2) then forall i = 1 to c local priv_helpQ else for i = 1 to c (c) From Figure 4.2(c) Figure 4.3: Parallelized versions of examples from Figure 4.2. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 4.2.2 Deriving Low-Cost R un-Tim e Tests 4.2.2.1 Breaking C onditions on D ata D ependences In some cases, it is safe to parallelize a loop only for certain input data. A few researchers have considered how to derive breaking conditions on data dependences, conditions that would guarantee a data dependence does not hold at run time [IS, 42]. In the example in Figure 4.2(a), the loop is parallelizable under the run-time condi tion ((m > c) or (m = 0)), as illustrated by the parallelized version in Figure 4.3(a). Predicated data-flow analysis that incorporates a predicate extraction operator can derive this predicate and use it to guard execution of a conditionally parallelized loop at run time. Further, breaking conditions derived from predicate extraction can be propagated during analysis and used in conjunction with computing the MOFP solution described above to refine the precision of compile-time analysis. 4.2.2.2 V P S Solution The VPS solution can go beyond previous work and be used in a more aggressive way, to enable optimizations that may only be performed conditionally. The VPS solution can provide run-time evaluable tests to guard execution of conditionally transformed code as would be useful in Figure 4.2(b). Here, help is upwards exposed for certain values of x. Deriving predicates during dependence and privatization testing on predicated data-flow values leads to the appropriate run-time test for this loop. First, we consider whether a dependence exists on help, which occurs if there is an 78 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. intersection of read and write array regions and their predicates both hold. Thus, there is a dependence only if both (x > 2) and (x > 5) hold, which is equivalent to (x > 5). Thus, if x < 5. the loop can be parallelized as written. Second, we compare the upwards exposed read array regions and the write array regions to determine if privatization is safe and discover that array help is privatizable if x > 5. the only condition where an upwards exposed read intersects with a write. These two cases enable the compiler to parallelize all possible executions of the loop. The two parallelized versions of the loop are shown in Figure 4.3(b). 4.2.2.3 Predicate Extraction for O ptim istic A ssum ptions A distinguishing feature of our approach is the use of predicate extraction to in troduce reasonable assumptions by the compiler to guard much more precise (i.e.. optimistic) data-flow values. An example illustrating this point is the code excerpt in Figure 4.2(c). In this example, help[ 1] may be upwards exposed if cl < 2, since in this case the first loop containing the writes to help would not execute. Predicate extraction can be used to derive the condition (d < 2) as a run-time test to deter mine whether to privatize help or leave it as written. Both versions of the loop can then be parallelized as shown in Figure 4.3(c). 79 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 4.3 Predicated Array Data-Flow Analysis Predicated array data-flow analysis extends the previous analysis by associating boolean-valued predicates with array data-flow values. At a given program point, the pair (p, u) describes a relationship between the data-flow value v when the predicate p evaluates to true at the current program point. 4.3.1 Predicate Dom ain Data-flow problems formulated on a join-semilattice (union or may problems) re quire a predicate representation conservative towards true; thus, such problems in corporate a predicate lattice ordered such that true < false. Data-flow problems formulated on a meet-semilattice (intersection or must problems) require a predi cate representation conservative towards false, where the predicate lattice is ordered such that false < true. The predicate representation VIZ has associated operations A, V. -i, false, true and <, where x < x. Operations may be inexact, but only in the appropriate direction for the given problem. Since predicated data-flow analysis is designed to support run-time testing of predicates, all predicates must be represented by variables visible in the scope of the predicated data-flow value’s associated program point. This feature makes it straightforward for the compiler to generate conditionally optimized code. During propagation of predicated data-flow values, analysis rewrites predicate variables from one scope to the outer enclosing scope. In particular, rewriting from a loop body to 80 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. a loop, variables that vary within the loop body are treated as unknown and others are rewritten in terms of values of variables on entry to the loop. Rewriting from a procedure body to a procedure call involves replacing formal parameters with actual parameters and removing local variables from the called procedure. The predicate domain we use in predicated array data-flow analysis combines two types of predicates: integer linear inequality constraints and arbitrary symbolic predicates representing run-time evaluable tests. • Integer linear inequality constraints: The integer linear inequality constraints fit nicely with the system of integer linear inequalities used by SUIF’s summary representation enabling straightforward incorporation of predicate embedding and predicate extraction, and also enabling combining of predicates for sim plification. While very useful, the integer linear inequalities are too restrictive to encompass all predicates that arise in practice. • Arbitrary symbolic predicates: When predicates are used to derive guards for conditional optimization, it is not important for analysis to interpret the pred icates since they are not being used in compile-time comparisons. For this purpose, hard-to-represent predicates are retained in an unevaluated form (as a symbolic constant) and used directly as run-time tests. During interpro cedural propagation, the projection operator renames the variables in these symbolic predicates across scopes, but no other manipulation of the predicates is performed. 81 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. if ((x > 10) || (x < 0)) th e n if (y > 5) th e n if (z % 2 = 0) th e n for i = 1 to n Figure 4.4: Predicate representation example. In our implementation, we represent general predicates in a disjunctive normal form (DNF); p = ti V f2 V ... V tn where each tt is a conjunction of simple predicates, that is. = /j A — (Zji A 1 % o A . . . A ZjTni) A (sji. A Sj2 A . . . A s u ^ where ltJ's are integer linear inequality predicates and s,j's are arbitrary symbolic predicates. For example, within the loop body in the Figure 4.4. the original predi cate is ((x > 10) V (x < 0)) A (y > 5) A (z mod 2 = 0). but it is represented in our representation as following: p = ((x > 10) A (y > 5) A (z mod 2 = 0)) V ((x < 0) A (y > 5) A (z mod 2 = 0)) Here. (z mod 2 = 0) is a symbolic predicate as it cannot be represented by an integer linear inequality. 82 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. The choice of DNF for the predicate representation is simply because SUIF pro vides an effective data structure to represent a conjunction of integer linear in equalities. Therefore, a conjunction of integer linear inequality predicates is rep resented with a single data structure for integer linear inequality. In fact, each ((x > 10) A (y > 5)) and ((x < 0) A (y > 5)) in the above example is represented by a single representation. 4.3.2 Deriving Predicates The derivation of the predicates is in some sense independent of the array data-flow analysis so the predicates may be derived any number of ways. Prior to performing the predicated array data-flow analysis, predicates can be computed from control- flow tests and from the solution of other data-flow problems. Additional predicates can be extracted during predicated array data-flow analysis. 4.3.2.1 Controi-FIow Predicates We derive control-flow predicates by solving a separate data-flow problem shown in Figure 4.5 prior to the array data-flow analysis. Given an analysis problem and a region in the control-flow graph, the edge predicate mapping C gives, for each control- flow edge in the region, a conservatively bounded condition under which the edge will be taken if the edge's source is reached during program execution. The mapping C can be obtained for branch edges by performing a data-flow analysis on branch 83 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. for each procedure P: if P is the root procedure in call graph. VTl(P) = true else VH(P) = false for each procedure P from top to bottom over call graph: let Rq be the region for the procedure body of P. V R {R 0) = VU[P) for each region R in P from outermost to innermost: let SRq be the root node in the subregion graph of R. VTl(SRo) =V1Z{R) for each subregion SR in the subregion graph of R from top to bottom: for each outgoing control-flow edge e of SR: if e is an unconditional branch edge. C(e) = Vn{SR) if e is the true branch of a condition p. C(e) = m { S R ) Ap if e is the false branch of a condition p. C(e) = V1Z{SR) A - p if SR # SRq, VH(SR) = V c (e) e€{incoming edges} if SR is a procedure call c invoking procedure P '. VH(P') = VTl{P') V Translatec(V n {R )) Figure 4.5: Algorithm for deriving control-flow predicates. 84 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. conditions; an edge contributes a true predicate if it is an unconditional branch edge, the predicate V1Z{p) if it is the true branch of a condition, or the predicate VIZ{-^p) if it is the false branch of a condition. Similarly, multiway branches such as case statements take on the predicate leading to their execution. Given the contribution of an individual edge, the analysis of predicates (for a forward data-flow problem) computes C{path) = /\ed 9e^pathC{edge). Our implementation derives control-flow predicates through an interprocedural data-flow analysis. During the analysis, predicates that cannot be propagated across procedure boundaries due to scoping rules are conservatively approximated by the Translate function, which also takes care of parameter mappings. For example, suppose that a loop with a predicate ((x > 5) A (y > 5)) has a call to a procedure proc and x is passed to proc which is mapped to a formal parameter n, as shown in Figure 4.6. Then, (x > 5) is propagated to proc as a translated predicate (a > 5). However, since y is not visible in proc. (y > 5) is approximated to true by the Translate function. Therefore, the resulting predicate that is propagated to proc becomes ((a > 5) A true) = (a > 5). When a procedure is invoked in several distinct call sites with different predicates, the resulting predicate of callee should combine (V in our framework) the predicates from all callers. In this case, the callee’s predicate can become too conservative. However, this approximation can be avoided using procedure cloning [10]. 85 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. if ((x > 5) & c & c (y > 5)) th e n call proc(x) su b ro u tin e proc(a) Figure 4.6: predicate propagation across procedure boundary. 4.3.2.2 P re d ic a te E x tra c tio n In some cases, predicates are directly extracted from array data-flow values such as with the derivation of breaking conditions described in Section 4.2.2. We have found this analysis technique particularly useful to deal with boundary cases such as the example in Figure 4.2(c). Here, the loop that writes A is only executed if d > 2, while the loop that reads A is executed if cl > 1. The array data-flow value will contain the constraint (d > 2) in the former case and (d > 1) in the latter. By extracting this differing constraint from the array data-flow values for the read and write during subtraction, analysis can derive the predicate-value pair (1 < d < 2, .4[1]} for the exposed read. Predicate extraction also gives the compiler an opportunity to introduce assump tions that increase the precision of the data-flow value. Whenever such assumptions can be described by run-time evaluable tests on variables in the current scope, pred icate extraction can be used for this purpose. The default data-flow values are also 86 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. double precision c(*). ch(*) call radf4(l, n/4, c, ch, ...) subroutine radf4(ido, 11, cc, ch, ...) double precision cc(ido,ll,4), ch(ido,4, 11) Figure 4.7: Reshape examples benefitting from predicate extraction. introduced in the event that the predicate evaluates to false at run time. In addi tion to the example from Figure 4.2(c), another example arose in our experiment involving a very complicated array reshape in apsi, shown in Figure 4.7. Here, 1- dimensional arrays c and ch are passed to 3-dimensional formals cc and ch. Given that radf4 always writes ch[l:tdo,l:/l,l:4], the M region at the call can be translated to the predicate-value pair (n mod 4 = 0, ch[l:n}) (otherwise, ch[l:n] can be added to the W region). These kinds of extracted predicates distinguish our work from any previous treat ment of predicated data-flow values. Predicate extraction allows the compiler to assume the common case and greatly enhance the precision of its results, whenever the common case can be verified with inexpensive run-time tests. 87 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 4.3.3 Extensions to Array Data-Flow Analysis Operations The formulation of array data-flow analysis is changed in only a few ways by incor poration of predicated data-flow values. Given that the previous analysis already represents summaries with a set of array regions, it is straightforward to augment each array region with a predicate. • In the set of array regions M. which represents the solution to an intersection problem, a pair (p. v) indicates that if p holds, the array region u is guaranteed to be written. • In the sets R, E, and W, which represent solutions to union problems, a pair (p. v) indicates that if predicate p holds, array region u may possibly be accessed. In other words, if predicate p does not hold, array region v is guaranteed not to be accessed. In addition to the introduction of predicates to the data-flow values discussed in the previous section, a few operations on the data-flow values must be redefined for predicated data-flow analysis. 88 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. PredMerge((x > 2,0 < d < 50). (x > 2,50 < d < 100)) = (x > 2,0 < d < 100) PredMerge((x > 2,0 < d < 100), (x < 3, 50 < d < 100)) = {(true, 50 < d < 100), (or > 2,0 < d < 50)} PredMerge({x > 2 ,0 < d < 50), (x > 2,60 < d < 100)) = {(ar > 2,0 < d < 50), (x > 2,60 < d < 100)} Figure 4.8: PredMerge operation examples. 4.3.3.1 M eet Function The meet function (A), as shown in Figure 2.3, is union for the the R, E and W sets, and intersection for the M set. For predicated data-flow analysis, the meet function is modified to: a A b = (Ra U Rb, Ea U Eb, w a U Wb, Ma U Mb) We can safely use union even on the M set because each (p, u) pair guarantees that access v occurs if p holds. 4.3.3.2 U nion Operations The implementation of union concatenates two sets, and subsequently merges redun dant regions in the set, similar to its previous implementation. The merge operation PredMerge augments the existing Merge operation. It relies on an auxiliary function CanMerge that returns true if and only if the two regions axe either adjacent, one 89 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. is contained in the other, or their bounds differ in only one dimension. PredMerge iteratively merges predicated data-flow values in two sets of array regions. Pred Merge also eliminates any data-flow values of the form (false, r) and (p, 0). The result of merging a pair of predicated data-flow values (pi, im .) and (p2.u2) is defined as follows: PredMerge((px,vx), (p2. Vo)) {(pi. Merge(ux,vo))}, if ((Pi = P 2) A CanMerge(vx, Vo)) {(Pi V p2, v2), (pi, vi - t/2)}, if (vo Q vx) {(pi v po, Vi), (p2, V o - t/t)}. if (ux C Vo) {(Pi; ui); (P2; ^2)}! otherwise Figure 4.8 presents PredMerge operation examples. The first two examples show the applications of the first two rules in the above equation. The third example corresponds to the last rule. The PredMerge operation appears to sometimes yield more precise results than that of Gu, Li and Lee when the second or third term applies - i.e., when one of the regions is strictly contained in the other. Guarded array data-flow analysis does not merge unless the regions are identical [22]. Because predicates must sometimes be approximated during analysis, such as when summarizing information at a loop or crossing procedure boundaries, early merging avoids loss of information when anal ysis must approximate predicates before they are used in subtraction or dependence 90 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. tests. As a simple example, consider the case where p2 = ->pi and u2 C vi. The result of PredMerge is then {{true, v2), (pi,Vi — 1 /2)}- If Pi is later approximated, the compiler still retains that region v2 is unconditionally accessed. 4.3.3.3 Subtraction O peration The composition function (o) from Figure 2.3 performs a subtraction to calculate upwards exposed reads: a o b = (Ra U Rb, Ea U [Eb — Ma), Wa U Wb, Ma U Mb). The existing subtraction operation in SUIF iterates over each read region e in Eb to determine if one or more written regions in Ma completely covers e. The definition of PredSubtract for subtracting a single predicated region from another. (pi, vi) — (p2,v2) is defined as follows: PredSubtract((pi, Vi), {p2, u2)) = (pls s) U (pi A -> p2, U i — s). where s = v t — v2 For instance. PredSubtract((x > 2,0 < d < 100). {x < 4.50 < d < 100)) = { (i > 2,0 < d < 50), (x > 4,50 < d < 100)} PredSubtract((x > 2,0 < d < 100), (x > 4,40 < d < 60)) = {(x > 2,0 < d < 40), (x > 2,60 < d < 100), (2 < x < 4,40 < d < 60)} 91 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. = 3*!, i2 e 1 1 (h ± i2) A (r|Jl n s|i2 # 0) f true. ifV(p-.s) € S, Intersect(r, s) = false \ ~ 1 ( V (p,3 )£SA[ntersect(r% s) p ) ; O th erw ise = A(p,r) G H t (^P V FPS(r,W L)) A A < P tU ,)ewt h p V KP5(u;; WL)) = A (p .e ^ C -’ P V K P S M ^ ) ) _ [ U ^ i e ^ f e e ) , t/ PrivatizableL # false 1 0. otherwise Figure 4.9: Dependence and Privatization test on Predicated data-flow values. Following the subtraction operation, merge is performed to reduce the resulting number of regions. This calculation brings together predicates from an intersection problem (M set) and predicates from a union problem (E set). This combination is safe since predicates in M are conservative towards false, and we are negating them. Thus, the negation will be conservative towards true. 4.3.4 Predicated D ependence and Privatization Tests Figure 4.9 presents the predicated versions of the dependence and privatization tests from Section 2.2. We introduce a test Intersect related to the dependence and privatization tests presented previously to determine if a region intersects another region on different iterations. 92 Intersect(r, s) VPS(r, S) IndependentL Privatizablei Initializel Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. When modifying the tests for predicated array data-flow analysis, we make use of the Value-to-Predicate Solution ( VPS(u,S)) to compare a data-flow set S to a single array region r and provide a predicate guaranteeing that v does not intersect with any array regions in S. This predicate is derived from the predicates on all regions that intersect with r. If VPS(v,S) returns false, the analysis must assume that v always intersects with the array regions in S. If VPS(v,S) returns true, the compiler has proven v never intersects with the array regions in S. If neither true nor false, the result corresponds to a run-time evaluable predicate that may be either compared with other predicates at compile time or tested at run time. To improve the precision of the VPS solution, we may derive additional predicates in the form of breaking conditions from the constraints arising from intersecting two regions. When lower and/or upper bounds on the accesses to a dimension of an array cannot be compared, a breaking condition tests whether the lower bound of one access is greater than the upper bound of another access. For simplicity, consider a one-dimensional array example. Let [lbr : ubr] and [lbw : ubw] be read and write array access regions and assume that more than one of the bounds are unknown at compile-time. Then, the conventional parallelization test would conservatively assume possible overlapping of two access regions. In predicated array data-flow analysis, during the parallelization test, the following run-time evaluable breaking condition is extracted that can guarantee non-overlapping of two access regions. B C = (lbr > ubw) V (lbw > ubr) V ((lbr = lbw) A (ubr = ubw )) 93 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. The third term in the above equation represents loop-independent dependences and requires additional checking that each iteration accesses the same element. Our com bined compile-time and run-time approach to dependence and privatization testing presents a natural opportunity to incorporate breaking conditions into the other run-time tests derived by predicated array data-flow analysis. The conjunction of these breaking conditions with the predicates derived by the VPS solution during the intersection derives more precise run-time tests than with the VPS solution alone. The dependence test uses the VPS solution to determine whether predicates guarding reads and writes of an array can be true simultaneously. As with the VPS solution, the dependence test returns either true, false or a predicate that can be tested at run time to guarantee independence. The privatization test is similar. The initialization computation returns the set of {p. e) elements from exposed reads where privatization for the current array is possible. The region described by v must be initialized if predicate p holds. As aii example, let's consider the loop in Figure 4.2(b). In this example, R = (x > 2 ,1 < helpi < d) W = (x > 5,1 < help\ < d) E — (2 < x < 5,1 < helpi < d) 94 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Then, Independent = (“'(a; > 2) V ->(x > 5)) A (->(x > 5) V -’(a; > 5)) = x < 5 Privatizable = ->(2 < x < 5) V ->(ar > 5) = true Initialize — (2 < x < 5 , 1 < helpi < d) The dependence test compares R and W and determines that the loop is indepen dent when x < 5. After that, the privatization test proves that the array help is always privatizable by examining E and W. Although help is privatizable only when x > 5, the result is safe as the privatization of array help is only performed when simple parallelization is not possible, that is, when x > 5. When applying array privatization, help[ 1 : d\ needs to be initialized if 2 < x < 5. 4.4 A Refinement of Predicated Data-Flow Values Predicate embedding and predicate extraction involve translating between the pred icate domain and the domain of data-flow values, with the goal of deriving more precise predicated data-flow values. To formalize this notion, let us first consider some predicate-value pair (p, v) that arises from predicated data-flow analysis as presented so far. Using predicate embedding or extraction instead derives a cor responding set of predicate-value pairs {(Pi,«,)} where Vi, Vi > v (the relationship between the p* and p is discussed below). The precision of the predicated data-flow 95 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. value is refined by predicate embedding or extraction only if the following holds: 3i, such that either u, > v or p * > p. It is most desirable to use these operators to produce a more precise data-flow value (i.e., 3i, ut > v). The intuitive explanation for this is that the dependence and privatization tests first perform an intersection between array regions, and examine predicates only if the intersection is non-empty; similarly, intermediate operations such as PredMerge and PredSubtract examine array regions first to determine if they overlap before considering predicates. Whenever a more precise data-flow value cannot be derived, it may be useful to improve the precision of the predicate (i.e., 3i. pi > p). A more precise predicate can potentially lead to successful compile time and run-time parallelization tests, as well as refining results of PredMerge and PredSubtract. As a slight complication to this discussion, predicate embedding and extraction derive Pi that are the same or closer to false than p. For most of the data-flow problems, R, E and W, this yields the desirable result Vi,p, > p (see Section 4.3.1). However, since M is conservative towards false, Vi.pi < p; thus for M, predicate embedding and extraction are only useful if 3i, u, > v. Predicate embedding and predicate extraction fit nicely with array region rep resentations based on integer linear programming techniques such as in SUIF and the PIPS system [28]. In a framework that supports both predicate embedding and extraction, it is equivalent for integer constraints to appear either in the predicate 96 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. or in the data-flow value. For predicate embedding, additional integer constraints are integrated into the integer linear inequalities representing an array region. For predicate extraction, constraints in the array regions can be combined with each of the integer linear inequality terms from the DNF representation of the predicate; additional symbolic terms can be added to the symbolic portion of the predicate. 4.4.1 Predicate Embedding A predicate embedding operator = > • on a simple predicated data-flow value (/. u). is defined as follows: = ► (I. u) = (/, v1 ) where I is a integer linear inequality and v' > u. Since our predicates may be a disjunction of multiple terms, the result of embedding may return a set of regions. For p = ft V ... V £„ = (Ii A s L ) V ... V (/„ A sn). predicate embedding is defined as follows: =>(p,t/) = .,(tn.vn)} = {{(li A si),=> ( li,t;) ) ,...,( ( U s „ ) ,= > (ln,v))} where V,, u, > v. Since t, is a single DNF term of p, tt is the same or closer to false than p. Thus, the embedding operator translates portions of the predicate from the predicate domain to the domain of the data-flow value and combines it with the 97 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. existing data-flow value v. The integer linear inequality portion of each term in DNF is combined with the array region representation. While it may appear from this definition that predicate embedding introduces many more predicated data-flow values, in fact, it does not fundamentally increase the amount of work for analysis. Predicate embedding is applied very infrequently, only when translating from one scope to another such as during projection out of a loop or across procedure boundaries. Also, the PredMerge operation is invoked immediately after embedding. If the constraints from a predicate do not refer to variables in the array regions, predicate embedding has no effect: in such a case, the data-flow values vl will be equivalent to u. and thus PredMerge will merge the regions derived by predicate embedding back to a single region. As an example, in Figure 4.1(b), the region of help representing the else branch of the second inner loop with and without embedding looks like the following: (p : (i > 1), v : (helpi = j - 1,1 < j < d)) => (p, v) = (p : (j > l), v : (helpi = j - 1,1 < j < d, j > 1)) In both, helpi is a dimension variable representing the first dimension of array help. When applying Fourier-Motzkin projection to this array region on exit from the inner loop to replace j with its upper and lower bounds, the additional constraint on j leads to the desired result that only help[ 1 : d] is accessed by the loop. 98 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 4.4.2 Predicate Extraction Related to predicate embedding, predicate extraction involves extracting predicates from operations on predicated data-flow values to produce more precise results. To present this more formally, let us first define the meaning of an operation on a predicated data-flow value in the absence of predicate extraction. Since our base compiler implementation represents an array data-flow value as a set of systems of inequalities, operations on two such sets involve a series of pairwise operations on the set elements, each of which may return a set as a result. (A few unary operations perform translation at loop and procedure boundaries, but these are similar to the binary operations.) Thus, we define a data-flow operation on predicated data-flow values as follows: (p l,i/l) op (p2. u2) = {(pi, u,)}. Then, in the presence of a predicate extraction operator, < £ = , the corresponding result is: < t= ({pi, ul) op (p2. v2)) = {(pj, Vj)} For each predicate-value pair {pj. vj) resulting from extraction, there is a correspond ing pair (pj, U i) from which it is derived, where vj > ut and predicate pj is the same or closer to false than p,. That is, we derive constraints from the data-flow values and combine them with the predicates. 99 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Let pv = (t[ V ... V tv m) be the predicate extracted during the operation on predicated data-flow values. Then, Pj is defined as follows: Pj = pi A pv = (U V ... V tn) A (i{ V ... V C ) V Hi A I] A s, A sjf) Thus, extracted predicates are combined with existing predicates to add more con straints on the refined data-flow values. For the example in Figure 4.2(c), the predicated data-flow value resulting from accesses in the two inner loops is as follows: w = (p : (true),v : (1 < helpi < d.cl> 2)) e = (p : (true), u : (1 < helpi < d, d > 1)) When composing the effects of the two loops for the outer loop, analysis performs a subtraction of the read and written regions using the PredSubtract operation. The results of subtraction with and without predicate extraction axe as follows: PredSubtract(e,w) = (p : (true), u : (helpi = 1, d = 1)) < * = (PredSubtract(e,w)) = (p : (d = 1), v : (helpi = 1)) 100 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Predicate extraction produces the more constrained predicate; now this predicate can be used to derive the run-time test for initializing upwards exposed reads as defined in Figure 4.9. Currently, extraction is used only at two points in our analysis. It is performed during PredSubtract as in the example above. It is also used during translation across procedure boundaries, where arrays are delinearized from a calling procedure to a called procedure such as an example shown in Figure 4.7. In this latter case, an entire array is written if the problem size is divisible by one of the dimension sizes in the callee. The Reshape operation returns two predicated data-flow values; an optimistic data-flow value guarded by a symbolic predicate ensuring this property, and the default value. It is also potentially useful to perform predicate extraction as part of the PredMerge operation, but this is not included in our current implementation. 4.5 Im plem entation Issues Theoretically, incorporating control-flow information into data-flow values may ex ponentially increase the number of data-flow values and analysis time. Also opera tions on linear inequality predicates can be compute- and memory-intensive, and an implementation must therefore carefully manage predicate operations. The imple mentation of guarded array region by Gu, Li, and Lee showed that path-sensitive array data-flow analysis can be performed efficiently [22]. Based on their experience, the main focus of our implementation efforts was put on the easy incorporation of 101 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. predicates into the SUIF’s existing array data-flow analysis framework. However, we needed to employ several cost-saving techniques to perform the analysis in reasonable time. 4.5.1 M anaging Predicates Two mechanisms for managing predicates are similar to what is used by Gu. Li. and Lee [22]. First, demand-driven predicate translation only performs translation to the predicate domain when a predicate is needed in an operation, such as when comparing predicates in merge operations or dependence and privatization testing. Second, memoization saves DNF predicates resulting from predicate operations in a table (i.e., performs value numbering), and rather than repeatedly recomputing the same complex predicate operations, reuses the cached values. The experience with our implementation showed th at these two mechanisms significantly reduced the analysis time and the memory requirement. Our implementation maintains two predicate tables, a predicate mapping table and a predicate operation caching table, shown in Figure 4.10. During predicated ar ray data-flow analysis, every time a new predicate is created, we also create a unique number representing the corresponding predicate, which we call a “value-numbered” predicate. The predicate mapping table maps this value-numbered predicate to the original predicate. Instead of propagating original complex DNF predicates, value-numbered predicates are mostly used throughout the analysis. The original 102 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. (a) Predicate mapping table V alue-num bered Predicate Original P redicate N egated P redicate Pi x > 2 x < 2 P2 x < 5 x > 5 P3 y > x + c (b) Predicate operation caching table Value-num bered Predicate Original Predicate N egated P redicate Pi A P2 2 < x < 5 (x < 2) V (x > 5) Pi V p 2 true false "•P i V p2 x < 2 Figure 4.10: Predicate tables. predicates are referenced only when the real values are needed, for instance, during the subtraction operation and parallelization test, while simple operations such as comparison and negation can be performed with value-numbered predicates. When a new predicate is created from control-flow tests or predicate operations, the following happens. • First, the predicate mapping table is looked up to check whether the same predicate already exists. Both the original predicate and its negated form are looked up, as they share the same entry in the table as shown in Figure 4.10. The negated predicate column in the table stores the negated DNF form of the original predicate so that the expensive negation operation is performed only once. 103 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. • If the table contains the predicate, the corresponding value-numbered predicate will be used instead. • If the table does not have an entry for the predicate, a new uniquely-numbered value-numbered predicate is created and its mapping is entered into the table. Again this newly-created value-numbered predicate will be used instead of the original predicate. Since predicates are attached to each summary and loops usually contain a num ber of array accesses, the same predicate operations may be performed many times. To avoid repeated computations of the same predicate operation, we memoize the result of predicated operations in the predicate operation caching table. During the analysis, predicated operations are performed with value-numbered predicates and retain the results in DNF of value-numbered predicates. The operation between original predicates, which is expensive, is delayed until the real value of the opera tion result is needed. When the real value of DNF of value-numbered predicates is needed such as when comparing predicates in merge operations or dependence and privatization testing, the following happens. • The predicate operation caching table is first looked up to check whether the same operation was already performed. If so, the corresponding result stored in the table will be used without recomputation. 104 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. • When the table does not have an entry for the necessary operation, the actual predicate operation is performed using the original integer linear inequality and symbolic predicates. The result of this operation is memoized in the table. Negation operations are frequently used in our analysis, but the negation of DNF form is very expensive. Thus, both tables have an entry for the negated predicate which can be reused without recomputation. In our implementation, negation op erations are deferred until needed, and the entry is filled only when the negation is actually necessary. 4.5.2 Scoping Rule In our approach, we utilize scoping rule of the variables used in predicates to greatly reduce the possibility of explosive growth of the number of predicated data-flow values without significantly affecting our ability to derive run-time tests. At loop boundaries, loop-varying variables that appear in the predicates are replaced with the unknown value (either false or true depending on the data-flow problem). At procedure boundaries, local variables are replaced by the unknown value. In the case of M. if scoping yields a unknown predicate, the region is moved into the W set with a true predicate, to signify the write may occur. Such approximations match well to the goal of deriving run-time tests that execute prior to the loop body: if a predicate is loop-varying, it will not be usable directly as a run-time test. 105 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 4.6 Code Generation SUIF employs a master/worker paradigm of parallelism, where a single master thread initiates the program, executes sequential portions of the program, awakens multi ple worker threads when a parallel loop is encountered, schedules the iterations of parallel loop on all the threads including itself, and waits for all the worker threads to complete before continuing execution. After completion of each parallel loop, the worker threads spin or go to sleep, waiting for additional work to be assigned from the master thread. While this paradigm is flexible and can easily handle sequen tial portions of the computation, it imposes two synchronization events per parallel loop, a broadcast barrier before the parallel loop body and another barrier after the loop body. This synchronization overhead can significantly degrade performance, especially when the size of a loop body or a loop iteration count is small. To address this issue, the SUIF parallel code generator produces both sequential and parallel codes for each parallel loop that are guarded by a run-time workload test. The run-time workload test checks if the amount of work to be performed is large enough to compensate for overall parallelization overhead. When this test fails, the loop is executed sequentially, even though it can be executed in parallel. Figure 4.11 illustrates how a parallel loop is translated by the SUIF parallel code generator. When a loop body and a loop iteration count are large enough, each thread will execute a portion of loop iterations through executing the procedure parailelJoop. Scalars and arrays that are not visible to paralleUoop are passed as 106 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. main() { double array [N]; parallelJoop(double parallel_array[]. ...) { double private_array[N]; if (enough work is available) { initialize private variables; broadcastQ; barrierQ; } else { parallel Joop (array. ...); NN = ceil(N/no_of_processors); for (i = (my_pid-l)*NN; i < min(N. my_pid*NN): W-+) loop body for (i = 0; i < N; i++) if (my_pid = = LAST_PID) finalize private variables', } } Figure 4.11: Example parallel code skeleton generated by SUIF. parameters so that the workers can access and modify them. If a variable is privati zable. a local variable (such as priv) is used after initialization if required. When a privatizable variable needs finalization, the processor that executes the last iteration of the loop will copy the value back to the original variable. This conditional parallel code generation capability in SUIF provides a straight forward platform to integrate code generation of conditionally parallel loops identi fied by predicated data-flow analysis. Basically, we can add a run-time parallelization test into the existing run-time workload test to guard the safety of parallel execution of the loop. The real challenge is how to deal with so many different versions of a parallel loop. Each variable involved in a data dependence within a loop can potentially be 107 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. either dependent, independent, or privatizable with a certain condition. Assuming the conditions are all different from each other, the number of different versions of a conditionally parallel loop can potentially grow exponentially to the number of variables that carry dependences across the loop iterations. Therefore, a straight forward extension of the existing code generation scheme can cause a huge code size increase and thus degrade the performance. Our solution to this issue is to produce a single parallel version of a loop which dynamically selects to use either independent variables (either globals or passed as arguments) or privatizable variables (local variables) depending on the run-time test results. And then all the conditions that let each variable be either independent or privatizable are gathered and combined to guard this single parallel version of loop. An example of a worker code for a conditionally parallel loop is shown in Fig ure 4.12. In this example, an array is independent when the condition p i holds and is privatizable when the condition p2 holds. The execution of this parallel loop will be guarded by the condition (p i V p2). A local pointer tmp_array dynamically selects a desirable array depending on the run-time condition and is used within the loop body in place of the original array. Initialization and finalization are performed if needed only when the condition p2 holds. While our code generation scheme greatly reduces code size compared to the straightforward approach, it obviously increases the run-time overhead slightly as additional run-time tests need to be perform by each slave. However, we believe this 108 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. main() { double array [N]; if ((pi || p2) & :& : enough work is available) { broadcast (); parallelJoop(array. ...) ; barrier(); } else { for (i = 0; i < N; i+ + ) loop body } } parallel_loop(double paralleLarray []. ...) { double private_array[N]; double *tmp_array; tmp_array = (pi) ? paralleLarray : private_array; if (p2) initialize private.airay with paralleLarray; NN = ceil(N/no_of_processors): for (i = (my_pid-l)*NN; i < min(N. my_pid*NN); i+ +) loop body /* tmp_array is used */ if (p2 & c & c (my.pid = = LAST_PID)) finalize private-array to paralleLarray, } Figure 4.12: Example parallel code for conditionally parallel loop. 109 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. overhead is relatively small and. in fact, our experiments with benchmark programs indicated that it is negligible. 4.7 Chapter Summary This chapter has presented a new analysis technique, called predicated array data flow analysis, that extends array data-flow analysis to associate predicates with data-flow values. We incorporate into our analysis predicate embedding and extrac tion operators that translate between the domain of predicates and data-flow values to produce more precise predicated data-flow values. This technique not only im proves the results of compile-time parallelization analysis, but it can also be used to derive low-cost run-time tests that guarantee the safety of parallelization for loops that cannot be parallelized statically. With this analysis the bulk of the remaining unparallelized loops identified in the previous chapter can be potentially parallelized. We have described how this analysis is incorporated in the SUIF compiler. Imple mentation techniques such as memoization and demand-driven predicate evaluation, which significantly reduce the analysis time, have been described. Our code gener ation scheme for conditionally parallel loops which produces only a single parallel version of a loop has also been introduced. 110 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 5 Experim ental Results This chapter summarizes a series of experimental results to measure the impact of predicated array data-flow analysis at finding additional parallelism. We applied our implementation to programs from three benchmark suites. S p e c f p 9 5 . the N as sample benchmarks, P e r f e c t , and one additional program, arc3d from NASA. Programs in three benchmarks appear in Table 3.1. For the same reason explained in Section 3.4, we omit from our experiment fpppp (S p e c f p 95) and spice ( P e r f e c t ). For the run-time measurements, we used the reference inputs from S p e c f p 9 5 , the small inputs for N a s ,the standard inputs for P e r f e c t , and 128 iterations for arc3d. 5.1 Increased Number of Parallel Loops As one measure of the value of predicated array data-flow analysis, we first consider how many more of the loops in our benchmark programs with inherent loop-level parallelism can be parallelized. By inherent loop-level parallelism, we refer to loops 111 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. that either have no data dependences at run time, or for which privatization elimi nates remaining dependences. To determine the remaining inherently parallel loops, we performed the experiment described in Section 3.4 where accesses to all arrays reported by the compiler as being involved in a dependence were instrumented with the ELPD test. Our system instruments 430 candidate loops that execute at run time and have more than a single iteration per invocation. Of these. 150 are found to be parallel by the ELPD test. (These numbers are different from that reported in Section 3.4, as we added one more program, arc3d.) Of these 150 parallelizable loops that were missed by the base SUIF system, our predicated array data-flow analysis implementation parallelizes 64 of them, across 9 programs, as shown in Table 5.1 and Table 5.2. In each table, the outer column lists the programs: three in Table 5.1 are from S p e c f p 9 5 , the first five in Table 5.2 are from PERFECT, and arc3d is the final program. The next column gives the subroutine and line number for the loop. The third column provides the number of lines of code in the loop body. Many of these loops span multiple procedure boundaries, so this line count is the sum of the number of lines of code in each procedure body invoked inside the loop. Even if a procedure is invoked more than once in the loop, it is counted only once. The column labeled Coverage is a measure of the percentage of sequential execu tion time of the program spent in this loop body. The column labeled Granularity provides a per-invocation execution time of the loop in milliseconds, indicating the 112 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. P ro g ra m Loop # of Lines Cover age G ra n u la rity C ateg o ry R e q u ire m en t run-909 1288 4.3 10.0 CF+BC 4=.R run-953 1288 4.3 10.0 CF+BC 4=.R run-1015 1288 4.3 10.0 CF+BC «£=.R run-1044 1218 3.2 7.5 CF+BC 4=,R run-1083 1287 4.3 10.0 CF+BC 4=,R run-1155 1268 3.5 8.0 CF+BC 4=,R apsi dcdtz-1331 235 7.0 16.1 CF =+ dtdtz-1448 281 10.4 24.2 CF => dudtz-1642 258 10.2 23.7 CF dvdtz-1784 261 9.9 22.8 CF setall-4128 14 0.0005 1.2 CF (scalar) setall-4130 10 CF (scalar) topo-4539 30 0.0006 0.6 CF (scalar) dkzmh-6265 218 7.6 17.6 CF 4=;R sweep-420 237 29.0 22.9 CF <= loops-1557 185 0.9 100.2 CF <= loops-1558 184 CF < £ = loops-1559 183 CF 4= loops-1613 265 2.6 292.7 CF 4= su2cor loops-1614 264 CF 4= loops-1659 573 22.5 2522.5 CF 4= loops-1660 572 CF 4= loops-1661 571 CF 4= trngv-2182 3 0.2 0.01 BC R trngvl-2266 3 0.001 0.009 BC R field-3087 4 0.002 0.1 BC R field-3118 4 0.002 0.09 BC R field-3367 26 0.4 22.9 BC R field-3396 4 0.001 0.07 BC R wave5 field-3420 27 0.5 25.8 BC R field-3450 4 0.001 0.06 BC R field-3465 25 0.3 18.4 CF field-3493 4 0.0005 0.03 BC R fftf-5064 1154 1.5 16.4 CF+BC 4=;R fftb-5082 1147 2.3 24.7 CF+BC 4=.R Table 5.1: Additional loops in Spec fp95 parallelized by predicated array data-flow analysis. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. P ro g ra m Loop # of Lines C over age G ran u la rity C ateg o ry R eq u ire m en t run-862 913 7.4 1.3 CF+BC <t=.R run-902 913 5.2 0.9 CF+BC +=.R run-960 913 5.1 0.9 CF+BC 4=.R run-985 852 3.5 0.6 CF+BC <=.R run-1020 912 5.2 0.9 CF+BC +=.R run-1084 893 3.8 0.7 CF+BC <f=.R dcdtz-1226 181 6.7 1.2 CF adm dtdtz-1321 219 8.3 1.4 CF dudtz-1484 196 8.9 1.5 CF dudtz-1506 12 CF dvdtz-1604 199 8.6 1.5 CF setall-3397 14 0.01 1.3 CF (scalar) setall-3399 10 CF (scalar) topo-3708 30 0.002 0.2 CF (scalar) dkzmh-4891 188 7.2 1.2 CF <=.R arc2d stepfx-2159 666 29.4 163.0 CF+BC =>.R stepfy-2314 249 27.9 154.5 CF+BC =>,R dyfesm matinv-4289 10 0.0003 0.003 CF =+ qcd linkbr-773 444 0.3 0.01 CF+BC R druk-499 280 0.02 0.7 BC R fixtet-1056 2 0.0005 0.0002 BC R spec77 fixtet-1068 2 0.0007 0.0003 BC R iminv-2296 6 0.01 0.001 BC R iminv-2305 6 0.01 0.001 BC R stepf3d-2290 363 10.3 5864.7 CF => stepf3d-2317 332 CF = > arc3d stepf3d-2504 310 6.3 58.3 CF = > stepf3d-2654 362 9.3 5294.4 CF stepf3d-2672 340 CF Table 5.2: Additional loops in P e r f e c t and arc3d parallelized by predicated array data-flow analysis. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. granularity of the parallelism.1 (In our experience, granularities on the order of 1 millisecond are high enough to yield speedup.) Both coverage and granularity re sults were obtained on a single processor of a 195MHz SGI Origin 2000. In the tables, granularity and coverage numbers are omitted for loops nested inside other loops also parallelized by predicated analysis, as SUIF only exploits a single level of parallelism. The next column in the tables specifies the requirement for parallelizing the loop corresponding to classifications defined in Section 3.4. CF refers to loops that can be parallelized statically by taking control flow into account, BC refers to loops with simple run-time tests that can be derived from constraints on the results of intersection during dependence testing, and CF+BC refers to loops that require both, where in this case the breaking conditions may also include those derived from predicate extraction. A few of the CF loops have scalar dependences only. The final column describes what components of the analysis are needed to par allelize each loop. The symbols => and <= refer to whether predicate embedding or extraction is required, and R refers to loops that are only parallelizable under certain run-time values and require a run-time test. All additional loops are au tomatically parallelized by our implementation of the analysis and parallel code 1 Coverage and granularity are slightly lower than previously reported results in [36] due to elimination of overheads inside parallel loops. 115 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. generation except the first six loops in apsi and the last two loops in wave5 which re quired transformations to the original source before performing our analysis: forward substitution and cloning and loop peeling to enable forward substitution. We point to a few key results in the tables. Embedding, extraction, and run time tests, distinguishing features of our analysis, are common requirements for these programs. At least one is needed in 57 of the 64 loops. Further, many of the loops parallelized by predicated array data-flow analysis have high granularity. Similarly, many of these loops have high coverage, and are quite large, particularly in apsi, su2cor, arc2d, and arc3d. Figure 5.1 shows the overall impact on program coverage and granularity after parallelizing the additional loops found by predicated array data-flow analysis. In the first graph, for each program, the first bar represents coverage for the program parallelized by the base SUIF system, and the second bar shows improvements after predicated array data-flow analysis. Coverage improvements are dramatic for apsi and adm, two versions of the same program. Even the smaller improvements in coverage for su2cor, wave5 and arc2d still have potential to improve the programs’ performance since, by Amdahl’s law, speedup is limited by the sequential portion of the program. The second graph shows the difference in granularity between the base system and the results of predicated array data-flow analysis. In this figure, we use a weighted granularity such that the granularity of each loop is multiplied by its coverage. Note 116 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. □ Base SUIF ■ Predicated Analysis 10 sec E w 1 sec ]5 2 100 ms C O Q_ 10 ms o 1 ms *w C O 3 100 us C C O 10 us o 1 us Figure 5.1: Coverage and granularity improvements with predicated array data-flow analysis 117 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. that the Y-axis is a log scale to represent the wide range of granularities. We see th at granularity is improved by more than an order of magnitude for apsi and su2cor, nearly an order of magnitude for arc2d, and by smaller but still significant amounts for wave5, adm and arc3d. 5.2 Speedup Im provem ents The tables of results and graphs in the previous section showed that nine programs have increased numbers of parallel loops due to predicated array data-flow analysis. We find that for five of these programs, apsi. su2cor, wave5, arc2d, and arc3d. the additional parallel loops have high granularity and high coverage, and thus have the potential to yield speedup improvements on a moderate-scale multiprocessor system. While adm also has significant improvements due to predicated array data flow analysis, its low coverage (90%) and relatively low granularity were not adequate to yield a speedup. For five programs, we measured the speedup on the SGI Origin 2000 as well as on a 300MHz Digital AlphaServer 8400, as presented in Figure 5.2. Each graph contains four lines. For each of the two machines, there is a line for the base SUIF system and a line for the predicated analysis version. Speedups are compared against a sequential version of the program. For apsi, su2cor and arc3d, we report 2, 4 and 8-processor speedups on the DEC Alpha, and 2 and 4-processor speedups on the SGI. For the remaining two programs, we omit 8-processor speedups 118 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. apsi su2cor Q. 3 T3 _ CD 2 CD a. to 1 a. 3 T 3 0 ) CD CL C O B B B ■o CD a ) CL c o 1 2 4 8 Number of Processors wave5 4 3 2 1 0 2 1 4 Number of Processors arc3d B 1 2 4 8 Number of Processors T3 CD a > CL CO 4 3 2 1 0 1 2 8 4 Number of Processors arc2d T3 Q J Q > CL CO 4 3 2 1 0 2 3 1 4 Number of Processors 0 —0 DEC-Base • —• DEC-Pred □•••□ SGI-Base ■ - • • a SGI-Pred Figure 5.2: Speedups due to predicated array data-flow analysis. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. as they are roughly the same as for 4 processors; for arc2d, we also report 3-processor speedups, as explained below. On the DEC Alpha system, we obtained solid speedups for apsi, su2cor and arc3d, and more modest speedups for the second two. On the SGI, we see that the speedups are not nearly as high as on the DEC, but predicated array data-flow analysis yields improvements on every program.2 Su2cor does not yield an improvement on the DEC due to a more restrictive stack size limitation than the SGI that requires us to allocate privatized arrays on the heap rather than on the stack. Certainly, the most dramatic speedup improvement comes from apsi. This pro gram has 14 loops parallelized by predicated array data-flow analysis, comprising roughly 70% of the program’s execution time. The base version does not speed up at all, while speedups for the optimized version scale to 8 processors. Improvements for arc3d are also significant. This program is the 3-dimensional version of the P e r f e c t benchmark arc2d. Arc3d yields more speedup improvements than arc2d due to its large number of iterations on outer loops and its larger input size. Results for predicated array data-flow analysis show improvement in the DEC Alpha at 4 processors, but improvements are more significant at 8 processors as the optimized version continues to scale up to larger numbers of processors while the speedup for the base version remains the same as the 4-processor result. 2These speedups are lower than our previously reported results in [36] due to aggressive tuning of both the sequential and parallel versions of the programs. 120 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. In su2cor, a single parallel loop with a granularity of 2.5 seconds per invocation is primarily responsible for the improved speedup on the SGI. On wave5, improvements are more modest because the loops parallelized by predicated array data-flow analysis comprise only about 4% of the program’ s execution time. Nevertheless, the program goes from little or no speedup to modest speedups on both systems. We report 3- processor speedups on arc2d because the two loops parallelized by predicated array data-flow analysis, comprising more than 55% of the program’s execution time, only have 3 iterations; speedup is nearly 2 on 3 processors, and could be improved by exploiting multiple levels of parallelism in this loop nest rather than just parallelizing outermost loops. Overall, we observed speedups much lower than what we expected, for both the base and the optimized version, considering the high coverage and granularity of these programs. Other than the program-specific problems described above, the key factor limiting speedup is the overhead of standard parallelization in SUIF. Parallel code is generated in SUIF by encapsulating a parallel loop in a separate function and making parallel function calls. In previous experiments with SUIF, the function representing a parallel loop has been implemented by C code where the variables referenced by the function are accessed through pointers [24], and by Fortran code where the variables in the parallel loop are accessed directly [25]. We chose the former approach because it is more straightforward, but we found that the resulting parallel programs ran much slower on 1 processor than their sequential counterparts. The 121 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. major source of this overhead is the increase in instructions and associated pipeline stalls due to indirect accesses to frequently accessed array variables and integers used in subscript expressions. We should also mention that these results were obtained without the benefit of the locality optimizations available in SUIF that have been used in other experi ments: unimodular loop transformations, data and computation co-location, data transformations, and compiler-directed page coloring. Previous experience suggests that these locality optimizations could improve the speedup results significantly. While we did not find cache misses to be a significant problem overall, we did ob serve a few loops with lower than expected speedup due to cache misses, particularly in apsi. 5.3 Case Studies of M issed Loops Among the 79 loops categorized as C F , BC , and C F + B C in Table 3.2, the cur rent implementation of predicated array data-flow analysis is able to automatically parallelize 59 of them.3 In this section, we investigate a couple of remaining loops in C F, B C , and C F + B C categories that predicated array data-flow analysis could not parallelize. Out of 20 remaining loops missed by predicated array data-flow analysis, 7 of them should have been automatically identified as parallel loops. However, there 3Note that the results for arc3d are not included in this section 122 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. is some interference between existing symbolic analysis in SUIF and our predicated array data-flow analysis implementation, which makes it difficult for our analysis to identify them as parallel. SUIF’ s symbolic analysis translates some uses of a loop index variable into a representation different from the original one. which results in different representations for the same index variable uses. This happened in some of the loops in our benchmark programs that should be parallelizable with breaking conditions, but our analysis was not able to derive breaking conditions for those loops since it does not understand different representations for the same index variable used by symbolic analysis. We believe that minor modifications to SUIF’s symbolic analysis implementation will enable our analysis to parallelize those loops. All the other remaining loops are too complicated to analyze or require more powerful symbolic analysis than what is in SUIF. Ignoring loops missed due to the reason described above, the only loop missed by our analysis in the SPECFP95 benchmark suite is setall-4048 in apsi (refer to the table in the Appendix). The simplified version of this loop is shown in Figure 5.3. The loop contains 6 variables that the original SUIF determines dependent. Our analysis proves 4 of them as independent or privatizable, and array C and Q as dependent. In the code, array C and Q are privatizable when m = 1 and m / 1 , respectively, and our analysis is able to extract these conditions. However, since the conditions contain the loop index variable, they cannot be used as a run-time guard outside the loop. Therefore, our analysis determines that they are not privatizable. 123 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. D O 110 M=1,2 D O 100 L=1, N SO U IFCIN.EQ.O.AND.M.EQ.1) T H E N Q(LX,LY,LZ)=VAL ELSE C(LX,LY,LZ)=VAL ENDIF IF(IN.GT.O) TH E N D O 60 K=K1,K2 D O 50 J=J1,J2 D O 40 1=11,12 IF(M.EQ.l) T H E N Q(I,J,K)=Q1 ELSE CCI,J,K)=Q1 ENDIF 40 CON TIN U E 50 CONTINUE 60 CONTINUE D O 90 K=K1,K2 D O 80 J=J1,J2 D O 70 1=11,12 IF(M.EQ.l) T H E N Q (I, J,K)= VAL*Q(I , J ,K)/TOT ELSE CCI,J,K)= VAL*C(I, J ,K)/TOT ENDIF 70 CONTINUE 80 CONTINUE 90 CONTINUE ENDIF 100 CONTINUE 110 CONTINUE Figure 5.3: Simplified setall-4048 loop in apsi. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. When a run-time test has a simple condition on a loop index variable, it is possible to extend our implementation such that the loop is split into two according to the derived condition. But, in this particular case, loop splitting is not beneficial as the loop has only two iterations. Furthermore, its coverage turned out to be very small (less than 0.001%), so missing this loop is not critical to the speedup improvement of apsi. The program fftpde in the N as benchmark suite has 3 loops in C F category, but all of them are missed by our analysis. Among them, loop main-126, partially shown in Figure 5.4, is the most important loop to improve the speedup of the program, as it comprises more than 85% of the program’ s execution time. This loop contains 3 variables that cannot be identified as independent or privatizable by our analysis. Here, we will examine one of them, array X2, as an example. In the first loop nest, all the array elements from X 2 (l,l,l) to X2(N1,N2,2*N3) are written and the second loop nest reads and writes the same set of elements. Therefore, if the compiler can prove that all of the array elements read by the third loop nest are already written in the previous loop nests, the array X2 can be privatized. To prove this, the compiler needs to know the value ranges of subscripts, which requires using the property of the MOD operation. In other words, the analysis needs symbolic information of 3 subscript; that is, that the value ranges of II, JJ, and KK are from I to N l, from 1 to N2, and from 1 to N3, respectively. W ith this information, the array data-flow analysis can prove that the third loop nest reads only elements that are already 125 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 200 210 220 230 240 250 260 270 DO 270 KT = 1, NT DO 220 K = 1, N3 DO 210 J = 1, N2 DO 200 I = 1, N1 T1 = X3(I,J,K) ** KT X2(I,J,K) = T1 * X1(I,J,K) X2(I,J.K+N3) = T1 * X1(I,J.K+N3) CONTINUE CONTINUE CONTINUE DO 250 K » 1, N3 DO 240 J = 1, N2 DO 230 I = 1, N1 X2(I,J,K) = RN * X2(I,J,K) X2(I,J,K+N3) = RN * X2(I,J.K+N3) CONTINUE CONTINUE CONTINUE T1 = 0.DO T2 = O.DO DO 260 I = 1, 1024 II = I - 1 II = MOD (II, Nl) + 1 JJ = MOD (3 * II, N2) + 1 KK = MOD (5 * II, N3) + 1 T1 « T1 + X2CII.JJ.KK) T2 = T2 + X2(II,JJ.KK+N3) CONTINUE CONTINUE Figure 5.4: Simplified main-126 loop in fftpde. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. written. However, the current implementation of symbolic analysis in SUIF does not provide this kind of information so that our analysis could not privatize this array. 5.4 Chapter Summary This chapter has presented results from an implementation in the Stanford SUIF compiler demonstrating that predicated array data-flow analysis parallelizes many more loops than the baseline implementation and it yields speedup improvement; 64 loops out of 150 parallelizable loops by the base SUIF system in programs from three benchmark suites and arc3d, were parallelized by predicated array data-flow analysis. The current implementation is not able to parallelize 20 loops which could be parallelizable by predicated array-data flow analysis. Most of these are due to either an interference with symbolic analysis or insufficient symbolic information. 127 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 6 Related Work This chapter presents previous work on automatic parallelization that is closely related to our study. One way compiler researchers discover new optimization op portunities is through evaluating existing compilers on many benchmark programs. Much of the recent progress in automatic parallelization resulted from the early 90’s experiments on the effectiveness of parallelizing compilers. Some of these earlier experiments are the topic of the next section. Since then, inspired by the discoveries from those experimental studies, many new parallelization techniques have been developed and implemented, and their effective ness has been demonstrated. As parallelizing compiler technology has matured, the remaining opportunities in parallelization are either too difficult or impossible to be exploited at compile-time. So compiler researchers started to investigate techniques that defer parallelization after gathering run-time information. Section 6.2 presents some previous run-time parallelization techniques, upon which our ELPD test is based. Several parallel programming tools are presented 12S Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. which may utilize the instrumentation results using run-time parallelization tests like the ELPD test. We also present previous work on parallelization related to predicated array data flow analysis. Among them, two techniques, the guarded array regions [20] and constraint-based array dependence analysis [53], are comparable to our predicated array data-flow analysis, so we will discuss and compare them with our approach in detail. 6.1 Parallelization Experim ents A number of experiments in the early 90’s performed parallelization of benchmark programs to identify opportunities to improve the effectiveness of parallelizing com pilers [6. 13, 33, 46]. In some cases, these experiments compared hand-parallelized programs to compiler-parallelized versions, pointing to the large gap between inher ent parallelism in the programs and what commercial compilers of the time were able to exploit. Blume and Eigenmann performed such a comparison with a modified version of the Kap restructurer and the Vast restructurer for 13 programs from the P e r f e c t benchmark suite on the Alliant/FX80 computer [6]. They further measured the per formance gains of individual restructuring techniques. Their most important findings were that commercial compilers often cause insignificant performance gains in real programs, and only a few restructuring techniques contribute to this gain. They also 129 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. cited the need for compilers to incorporate array privatization and interprocedural analysis, among other things, to exploit a coarser granularity of parallelism. Eigenmann, Hoeflinger, and Padua applied a set of advanced parallelization tech niques by hand to the programs from the P e r f e c t benchmark suite, and measured performance gains on the Cedar multiprocessor compared to commercial parallelizing compilers [13]. The most important techniques they found were array privatization, parallel reductions, generalized induction variable recognition, and symbolic or run time data-dependence tests, all with powerful interprocedural analysis capabilities. They cited that many of these techniques could be implemented in a new generation of parallelizing compilers and required the development of more powerful analysis techniques. These early studies focused developers of commercial and research compilers to investigate incorporating these techniques, and now they are beginning to make their way into practice. As stated earlier, our experiment goes beyond these previous studies because it measures parallelism potential empirically using run-time testing. Further, now that these previously missing techniques are performed automatically by SUIF and other parallelizing compilers, a new experiment can identify the next set of missing analysis techniques such as predicated array data-flow analysis proposed in this dissertation. 130 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 6.2 Run-Tim e Parallelization Techniques The literature describes run-time parallelization techniques based on an inspec tor/executor model, where an inspector tests subscript expressions in the loop at run time and an executor executes the loop in parallel if the inspector determines it is safe [39, 43, 44]. Rauchwerger and Padua’s definition of the LPD test, extended by our approach, was subsequently refined and implemented in Polaris by Lawrence [30]. The LPD test is originally designed for speculative parallelization where it is used for run time checking of the correctness of speculative parallel execution of loops. In this dissertation, we extended and used the LPD test to find parallelizable loops. While the LPD only tests a single loop in a loop nest, the ELPD test presented in Chapter 3 is able to test multiple loops in a nest simultaneously. It is also able to test loops that contain functions calls. Although the ELPD test uses one additional shadow array for simultaneous testing of nested loops, it reduces space requirements through the space-saving techniques explained in Section 3.3.1.2. Rauchwerger and Padua also describe an extended version called the LRPD test that locates array reductions [43]. We ignore array reductions beyond the ones al ready parallelized by SUIF. Since SUIF already has a powerful reduction recognition implementation, the LRPD test would only locate additional reductions where an array’s subscript expression on the left hand side is different from that on the right 131 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. hand side but the expressions refer to the same location. We anticipate that such reductions, if located, will be rare. Previous work on inspector/executor techniques provide optimizations to re duce the overhead of introducing extra memory and computation into the program. Rauchwerger and Padua suggest executing the inspector in parallel, and even exe cuting the entire loop in parallel speculatively and restoring the original state if it turns out not to be parallelizable [43]. Ponnusamy et al. avoid re-computation of the inspector on subsequent executions of a loop when values of variables accessed by the inspector have not changed [39]. Our implementation of ELPD test does not use these techniques, but it does introduce new analyses to avoid unnecessary in strumentation by examining the results of interprocedural array data-flow analysis. Incorporation of other optimizations in the literature is complementary to what we have done. Previous work in run-time parallelization uses specialized techniques not based on data-flow analysis. An inspector/executor technique inspects array accesses at run time immediately prior to execution of the loop. The inspector decides whether to execute a parallel or sequential version of the loop. An inspector/executor introduces several auxiliary arrays per array possibly involved in a dependence, and run-time overhead on the order of the aggregate size of the arrays. By comparison, predicated data-flow analysis instead derives run-time tests based on values of scalar variables that can be tested prior to loop execution. Thus, our approach, when applicable, 132 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. leads to much more efficient tests than inspecting all of the array accesses within the loop body (for example, evaluating control flow paths that will be taken or testing for boundary conditions). 6.3 Parallel Programming Tools A number of parallel programming tools, including ParaScope [9] and PIPS [29]. allow a user to examine the data dependences in loops in their program and assert whether the dependences should be ignored by the compiler. Recently, the SUIF Explorer has been developed to provide a programmer interface to export the com piler's knowledge about why it failed to parallelize a loop, and allow the user to make higher-level assertions such as whether arrays are privatizable or can be trans formed to parallel reductions [32, 31]. The SUIF Explorer incorporates run-time parallelization testing to pinpoint loops that are potentially parallelizable, by trac ing memory accesses to dependent arrays: the instrumentation resulting from the ELPD test could provide this functionality at lower cost. ELPD instrumentation could also be useful to users in developing and debug ging parallel programs. For example, the !HPF$INDEPENDENT directive in High Performance Fortran (HPF) allows users to assert to the compiler that a loop is parallelizable. A system such as ours could be used to verify correctness of these assertions at run time. 133 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 6.4 Analysis Exploiting Predicates Analysis techniques exploiting predicates have been developed for specific data-flow problems including constant propagation [51]. type analysis [48]. symbolic analysis for parallelization [23, 7], and array data-flow analysis [22, 49]. Tu and Padua present a limited sparse approach on a gated SSA graph that is demand-based, only examining predicates if they might assist in loop bounds or subscript values for parallelization [49]. Related to these array analysis techniques are approaches to enhance scalar symbolic analysis for parallelization. Haghighat describes an algebra on control flow predicates [23] while Blume presents a method for combining control flow predicates with ranges of scalar variables [7]. The PIPS system incorporates integer constraints into its array region representations, as we do with predicate embedding, but it does not use predicates in its analysis in any other way [28]. As compared to this previous work, our predicated data-flow analysis approach is distinguished in several ways: (1) it is capable of deriving low-cost run-time tests, consisting of arbitrary program expressions, to guard conditionally optimized code; (2) it incorporates predicates other than just control flow tests, particularly those derived from the data-flow values using predicate extraction: and, (3) it unifies a number of previous approaches in array data-flow analysis. There are some similarities between our approach and much earlier work on data-flow analysis frameworks. Holley and Rosen describe a construction of qualified data-flow problems, akin to the MOFP solution, but with only a fixed, finite, disjoint 134 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. set of predicates [27]. Cousot and Cousot describe a theoretical construction of a reduced cardinal power of two data-flow frameworks, in which a data-flow analysis is performed on the lattice of functions between the two original data-flow lattices, and this technique has been refined by Nielson [11, 37]. Neither of the latter two prior works were designed with predicates as one of the data-flow analysis frameworks, and none of the three techniques derives run-time tests. Recently, additional approaches that, in some way, exploit control-fiow infor mation in data-flow analysis have been proposed [2, 8, 45]. Ammons and Larus's approach improves the precision of data-flow analysis along frequently taken control flow paths, called hot paths, by using profile information [2]. Bodik et al. describe a demand-driven interprocedural correlation analysis that eliminates some branches by path specialization [8]. Both approaches utilize code duplication to sharpen data flow values but are only applicable if the information is available at compile time. Deferred data-flow analysis proposed by Sharma et al. attempts to partially per form data-flow analysis at run time, using control-flow information derived during execution [45]. Predicate extraction is also related to deriving breaking conditions, conditions th at would guarantee a data dependence does not hold at run time, from dependence testing, but breaking conditions have never been obtained in the context of general array data-flow analysis [17, 41]. 135 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. A significant portion of our work has involved defining an approach that safely combines data-flow values with predicates in a way that can derive predicates suit able for run-time tests. Array data-flow analysis is only one of the applications of our framework, albeit an important one. as our framework can easily incorporate other data-flow problems. In the following subsections, we describe in detail two recent approaches that utilize predicates during the analysis of arrays for parallelization, the guarded array regions [20] and constraint-based array dependence analysis [53], and compare them with our approach. 6.4.1 Guarded Array Regions The interprocedural array data-flow analysis scheme proposed by Gu, Li, and Lee summarizes array access information using guarded array regions and propagates such regions over a hierarchical control-flow graph. The use of guards incorporates the information in IF conditions to make array data-flow analysis more accurate. A guarded array region (GAR) is a tuple [P, R ] which contains a regular array region R and a guard P. where P is a predicate that specifies the condition under which R is accessed. A GAR is similar to a predicate-value pair in our analysis. They represent predicates in a conjunctive normal form (CNF) such as C\ A C2 A ... A Cn. where each C,, called a conjunct, is further represented as rex Ure2U... U rem. Each re, is a relational expression. 136 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Difference operations used primarily for computing the upwards exposed read set are expensive as they involve the negation of CNF. They may cause an exponential increment in the number of GAR’s and in the number of conjuncts of guards. More over. if the GAR’s of a difference operation contain symbolic variables, the result of the difference may be unknown, which causes loss of information. To overcome these problems, they only perform the difference operation if it can be solved exactly, and its result is a single GAR. Otherwise, they keep the difference and propagate it during the summarization. The delayed difference is not performed until they have to do so. For this purpose, they introduce a new data structure called G A R ’ s with a difference list (GARWD), to represent a difference operation whenever it cannot be easily performed. GARWD is a set defined by two components: a source GAR and a difference list. The source GAR is an ordinary GAR, while the difference list is a list of GAR’ s. The set denotes all the members of the source GAR which are not in any GAR of the difference list. Since the difference operator is only involved in the computation of upwards exposed read sets, only they are represented by GARWD’s. They also use a two-level hierarchical approach to predicate handling to reduce expensive negation operations. At the higher level, a predicate is represented by a predicate tree, PT{V. E .r ), where V is the set of nodes. E is the set of edges, and r is the root of PT. The internal nodes of V are NAND operators except for the root, which is an AND operator. The leaf nodes are divided into regular leaf nodes and negative leaf nodes. A regular leaf node represents a predicate such as an IF 137 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. condition or a DO condition in the program, while a negative leaf node represents the negation of a predicate. During the propagation and manipulation of regular array regions, GAR’s, and GARWD’s, each of these data structures may be duplicated several times. Opera tions involving these identical copies can potentially be time-consuming if straight forward pattern-matching is conducted. Therefore, they introduce a region number ing scheme for pattern-matching. Our analysis also employs a similar value num bering scheme for efficient predicate manipulations. Overall, the array data-flow analysis with guarded array regions is very similar to our predicated array data-flow analysis. Both attach predicates derived from IF conditions to existing summary notations to perform path-sensitive array data-flow analysis. Both use similar implementation techniques to efficiently manage predi cates. However, there are several important differences in our technique. One major difference is the ability to derive run-time tests to guard conditionally parallel loops. O ur analysis is able to generate multiple versions of a loop, each guarded by a simple run-time test derived during data dependence and array privatization tests. Since predicates in GAR are used only during compile-time analysis, they are restricted to some domain of values that the compiler understands. By contrast, our approach can derive more general predicates, run-time evaluable predicates consisting of ar bitrary program statements. Further, our analysis framework uniquely supports 138 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Guarded array regions M = {(p, [1 : 50]). <-.p, [1 : 100])} R = (true. [1 : 50]) E — {(“’Pi [1 : 50]). (p, [1:50])} Predicated array data-flow analysis M = {(true, [1 : 50]), (-.p, [50 : 100])} R = (true, [1 : 50]) E = 0 Figure 6.1: An example showing the difference of merge operations in guarded array regions and predicated array data-flow analysis. operations on predicated data-flow values that greatly enhance its precision: pred icate embedding folds predicates into the data-flow value, and predicate extraction derives predicates from operations on data-flow values. The definition of the merge operation in guarded array regions is slightly different from that of our analysis as explained in Chapter 4.3.3. The example in Figure 6.1 illustrates this difference. In guarded array data-flow analysis, array regions are not merged unless they are identical, so it cannot prove that read accesses of array help are not upwards exposed. On the other hand, our merge operation produces more precise results when one of the regions is strictly contained in the other such as the example in Figure 6.1. for i = 1 to c if (p) then for j = 1 to 50 help[j] = ... endfor else for j = 1 to 100 help[j] = ... endfor for j = 1 to 50 ... = help[j] endfor endfor 139 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. The main emphasis of the work by Gu, Li, and Lee was to show that an effective path-sensitive array data-flow analysis can be efficiently implemented. To demon strate that, they applied their implementation to the programs from the Perfect benchmark suite to show its efficient compilation time. They also presented a list of loops that can be parallelized by their analysis. However, they did not show the overall performance improvement of the program obtained using their analysis. On the other hand, in Chapter 5, we presented a comprehensive evaluation of our anal ysis including improvements in granularity, coverage, and speedup across programs from 3 benchmark suites. 6.4.2 Constraint-Based Array Dependence analysis Traditional array data dependence analysis is a memory-based dependence analysis which attem pts to determine whether or not two array references access the same element of memory. In order to perform array privatization analysis, however, a compiler requires analysis of the flow of values in arrays, which is called value-based dependence analysis. Pugh and Wonnacott [42] describe dependences in terms of sets of constraints that define a mapping between integer tuples, and perform value-based data depen dence analysis using extended Omega test [40], which is based on an extension of Fourier's method of variable elimination to integer programming. The constraints 140 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. they use are Presburger formulas that can be constructed by combining affine con straints on integer variables with the logical operations V . and A. and the quan tifiers V and 3. with limited uses of uninterpreted function symbols. They also perform conditional data dependence analysis to derive conditions under which a data dependence can be eliminated which can be extracted from a dependence relation by existentially quantifying the variables that correspond to the loop index variables. These conditions can be either proved at compile-time or used as a run-time test to guard a conditionally parallel loop. However, investigating every dependence that can be eliminated in this fashion is impractical due to too many false alarms, so they use a gist operation to determine only the interesting conditions under which a dependence exists. The gist of a set of constraints p with respect to a set of given constraints q is a simple system of constraints such that (gist p given q) A < 7 = pAq. The given conditions are either inferred from a program or added by user assertions. The most fundamental difference between our work and theirs is that our ap proach is based on a data-flow analysis framework using array summaries, while constraint-based array dependence analysis is based on a data-dependence analy sis framework which performs pairwise comparison of all the array accesses within a loop. It is well-known that there are accuracy/performance tradeoffs associated with these two different approaches, particularly in the presence of interprocedural 141 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. for i = 1 to c for j = 1 to d if (pH > 0 ) th e n help = ... else if (x > 5) th e n for i = 1 to c for j = 1 to d if (x > 5) th e n help[j] = ... help = ... en d fo r ... = help endfor endfor for j = 1 to d ... = help[j] endfor endfor (a) (b) Figure 6.2: Examples showing the difference between constraint-based conditional data-dependence analysis and predicated array data-flow analysis. analysis and arbitrary control flow. Although both approaches have similar capabil ities. we believe that a data-flow analysis approach which can summarize multiple accesses with a single array region is a more efficient solution for array privatization, especially when the analysis is to be performed interprocedurally. Figure 6.2 presents two examples which show the difference between the two ap proaches. In constraint-based dependence analysis, non-affine terms can be treated as '‘unknowns” or uninterpreted functions to improve the analysis quality for certain cases. When they simplify a Presburger formula containing function symbols, the structure of the formula can affect the accuracy of the result, which results in a conservative approximation for the flow dependence involving the zero-dimensional array help in Figure 6.2 (a) [42]. Predicated data-flow analysis easily handles this 142 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. case by treating a scalar variable as an one-element array. In Figure 6.2 (b). the if condition inside the loop must be true once the loop is entered, which can be induced from the surrounding context for the loop. Their approach could parallelize the loop with a run-time test, but not statically as our approach can. In our approach, this kind of surrounding context (e.g. if conditions) is represented with predicates and propagated using interprocedural data-flow analysis. The key advantage of their dependence analysis framework is that it can represent non-affine regions, which is not present in the SUIF system, and extract these to derive useful predicates in some cases. In fact, we think the best of both worlds would be to combine this representation of non-affine regions with our data-flow framework. Another difference is that we have performed fairly extensive experimentation with an interprocedural implementation in a complete compiler. In their work, they only implemented a prototype of their analysis to measure the accuracy and speed of their approach. Again, as the predicated array data-flow analysis is implemented as a part of a complete parallelizing compiler system, our system compiles and analyzes a whole program, and has the ability to generate working parallel codes with conditionally parallel loops with guards. It also enables us to measure the impact of our analysis on the overall speedup improvement of applications, which is the main goal of parallelization. 143 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 6.5 Chapter Summary In this chapter, we have presented several related works on parallelization. There exist many data-flow analysis techniques that gather path-sensitive data-flow values. Among these, two previous works which closely related to our approach, the guarded array regions and constraint-based array dependence analysis, have been compared in detail with our predicated array data-flow analysis. Our approach would not only produce more precise data-flow information for many cases, but also provide a effective framework to derive run-time conditions. We also have presented a com prehensive evaluation of our approach with a complete implementation in a fully functional parallelizing compiler. 144 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Chapter 7 Conclusion This dissertation has presented the results of the timely studies that address the same questions raised in the early 90:s whose answers have driven the advances in autom atic parallelization over the last decade. In this chapter, we summarize the im portant results of our studies and discuss future research directions in parallelizing compilers. 7.1 Summary of the D issertation First, in Chapter 3, we performed a series of experiments to evaluate the effectiveness of the Stanford SUIF parallelizing compiler. For these experiments, we developed the ELPD test which augments the LPD test, a run-time parallelization testing tech nique that is similar to an inspector/executor run-time approach. The ELPD test allows run-time parallelization testing on entire programs, across procedure bound aries and multiple loops in a nest. We instrumented loops in 29 programs from three 145 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. benchmark suites that are left unparallelized by the SUIF compiler to identify the remaining loops for which parallelization is safe. Among 410 instrumented loops. 141 loops are found to be parallel by our test. In addition, these potentially par allel loops missed by SUIF have significant impact on the parallelism coverage and granularity of many benchmark programs. The result of this experiment indicates that there is still some room for improvement in automatic parallelization. In par ticular, incorporating control-flow information into analysis and extracting low-cost run-time tests from analysis to guard conditionally parallel versions of the loops are the two major areas that are needed to exploit the remaining opportunities. These two requirements can be met with a single new analysis technique, the predicated array data-flow analysis presented in Chapter 4, whereby predicates are associated with data-flow values and are used to refine the original data-flow values. This technique not only improves the results of compile-time parallelization anal ysis, but it can also be used to derive low-cost run-time parallelization tests that guarantee the safety of parallelization for loops that cannot be parallelized statically. We incorporate into our analysis predicate embedding and extraction operators that translate between the domain of predicates and data-flow values to produce more precise predicated data-flow values. The predicated array data-flow analysis is fully implemented in the Stanford SUIF compiler and the experimented results from the implementation have been presented in Chapter 5. It demonstrated that the pred icated array data-flow analysis parallelized many more loops than the SUIF:s base 146 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. implementation (64 loops across 9 programs out of 150 parallelizable loops missed by SUIF), and yielded speedup improvement on 5 programs. 7.2 Future work In our experiments across three benchmark suites, we see additional opportunities that might be parallelized with extensions to our current definition of the predicated data-flow analysis. In particular, it could be used to conditionally parallelize certain loops with exits on error conditions, currently not considered candidates for paral lelization, with run-time tests guaranteeing the error condition is not met. Other opportunities for increasing the number of parallelized loops arise if we derive a predicated version of other data-flow analyses in a parallelizing compiler, particu larly scalar induction variable recognition. Another interesting extension is parallelizing loops with run-time tests involving certain classes of loop-varying variables. Such a run-time test would execute a small portion of the loop body prior to the loop’s execution. For example, this would allow the compiler to verify that a nonlinear subscript expression was monotonically increasing, even if this could not be proven at compile time. Loops requiring a control inspector could also be parallelized in this way. Neither of these techniques requires the use of shadow arrays, so this type of inspector can be significantly more efficient than the more general inspectors used by the LPD test. 147 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Another interesting area of future study is how to integrate run-time tests derived from predicated array data-flow analysis with an inspector/executor approach. We observed a few inherently parallel loops in our experiments where predicated array data-flow analysis is not applicable, such as when arrays axe used in subscript expres sions. The run-time tests arising from predicated analysis and an inspector/executor approach are complementary, and a parallelization system that combines the two techniques and uses array data-flow analysis to optimize run-time tests as much as possible is desirable to exploit all of the remaining inherent parallelism in these programs. 7.3 Final Remarks Over the last decade, automatic parallelization technology has become increasingly successful at finding loop-level parallelism in many benchmark programs. These dramatic advances in parallelization technology can be attributed to the earlier work on identifying parallelization opportunities. This dissertation provides a similar information on future parallelization opportunities based on today's state-of-the-art technologies. We anticipate that the results of this dissertation provide a useful information to the compiler researchers and engineers in the future development of new parallelization technologies. Each compiler optimization can be viewed as a two-step process, that is, an analysis phase to identify the opportunities and a transformation phase to make 148 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. them happen. Automatic parallelization also involves an analysis phase which finds regions of a program that can be safely executed in parallel and a transformation phase which modifies the original program to be efficiently executed on a given parallel machine. While this dissertation focused mainly on how to improve the quality of the former phase of automatic parallelization, there are many important issues which require further investigations in the latter phase, such as exploiting nested parallelism, data transformations for better cache behavior and less inter processor communication, reducing synchronization overhead, and so on. Finally, in this dissertation, we have concentrated on the application of the pred icated data-flow analysis on array data-flow analysis. Array data-flow analysis is an im portant analysis technique not only for automatic parallelization of loops but also for many other optimization problems such as communication optimization, array data partitioning, and so on. Therefore, it is critical for those optimizations to have more precise information using predicated data-flow analysis. 149 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Reference List [1] Saman P. Amarasinghe. Parallelizing Compiler Techniques Based on Linear Inequalities. PhD thesis. Dept, of Electrical Engineering. Stanford University. January 1997. [2] Glenn Ammons and James R. Larus. Improving data-flow analysis with path profiles. In Proceedings of the ACM SIG PLAN ’ 98 Conference on Programming Language Design and Implementation, pages 72-84. Montreal. Canada. June 1998. [3] David Bailey. Tim Harris. William Saphir, Rob van der W ijngaart. Alex Woo. and Maurice Yarrow. The NAS parallel benchmarks 2.0. Technical Report NAS-95-020. NASA Ames Research Center. December 1995. [4] Vasanth Balasundaram and Ken Kennedy. A technique for summarizing data access and its use in parallelism enhancing transformations. In Proceedings of the ACM SIGPLAN ’ 89 Conference on Programming Language Design and Implementation, pages 41-53, Portland. Oregon. June 1989. [5] William Blume, Ramon Doallo, Rudolf Eigenmann, John Grout. Jay Hoeflinger. Thomas Lawrence, Jaejin Lee, David Padua, Yunheung Paek, Bill Pottenger, Lawrence Rauchwerger, and Peng Tu. Parallel programming with Polaris. IEEE Computer, 29(12) :78— 82, December 1996. [ 6 ] William Blume and Rudolf Eigenmann. Performance analysis of parallelizing compilers on the Perfect Benchmark programs. IEEE Transactions on Parallel and Distributed Systems, 3(6):643-656, November 1992. [7] William J. Blume. Symbolic Analysis Techniques for Effective Automatic Par allelization. PhD thesis, Dept, of Computer Science, University of Illinois at Urbana-Champaign, June 1995. 150 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [8] Rastislav Bodxk, Rajiv Gupta, and Mary Lou Soffa. Interprocedural conditional branch elimination. In Proceedings of the ACM SIG PLAN ’ 97 Conference on Programming Language Design and Implementation, pages 146-158. Las Vegas, Nevada, June 1997. [9] Keith D. Cooper, Mary W. Hall, Robert T. Hood, Ken Kennedy, Kathryn S. McKinley, John M. Mellor-Crummey, Linda Torczon, and Scott Warren. The ParaScope parallel programming environment. Proceedings of the IEEE, 81(2):244-263, February 1993. [10] Keith D. Cooper, Mary W. Hall, and Ken Kennedy. A methodology for proce dure cloning. Computer Languages, 19(2):105-117, April 1993. [11] Patrick Cousot and Radhia Cousot. Systematic design of program analysis frameworks. In Conference Record of the Sixth Annual ACM Symposium on Principles of Programming Languages, pages 269-282, San Antonio, Texas, Jan uary 1979. [12] Beatrice Creusillet and Francois Irigoin. Interprocedural array region analyses. In Proceedings of the 8th International Workshop on Languages and Compilers for Parallel Computing, pages 1-15, Columbus, Ohio, August 1995. [13] Rudolf Eigenmann, Jay Hoeflinger, and David Padua. On the automatic par allelization of the Perfect Benchmarks. IEEE Transactions on Parallel and Distributed Systems, 9(l):5-23, January 1998. [14] Paul Feautrier. Array expansion. In Proceedings of the 1988 ACM International Conference on Supercomputing, pages 429-441, Saint Malo, France, July 1988. [15] Paul Feautrier. Parametric integer programming. RAIRO Recherche Operarionnelle, 22:243-268, September 1988. [16] Paul Feautrier. Dataflow analysis of scalar and array references. International Journal of Parallel Programming, 20(l):23-53, October 1991. [17] Gina Goff. Practical techniques to augment dependence analysis in the presence of symbolic terms. Technical Report TR92-194, Dept, of Computer Science, Rice University, October 1992. 151 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [18] Gina Goff. Practical Techniques to Augment Dependence Analysis in the Pres ence of Symbolic Terms. PhD thesis, Dept, of Computer Science, Rice Univer sity, April 1998. [19] Elana D. Granston and Alexander V. Veidenbaum. Detecting redundant ac cesses to array data. In Proceedings of Supercomputing ’ 91. Albuquerque, New Mexico, November 1991. [20] Junjie Gu. Interprocedural Array Data-Flow Analysis. PhD thesis, Dept, of Computer Science, University of Minnesota, December 1997. [21] Junjie Gu, Zhiyuan Li, and Gyungho Lee. Symbolic array dataflow analysis for array privatization and program parallelization. In Proceedings of Supercom puting ’ 95, San Diego, California, December 1995. [22] Junjie Gu, Zhiyuan Li, and Gyungho Lee. Experience with efficient array data flow analysis for array privatization. In Proceedings of the Sixth ACM SIG PLAN Symposium on Principles & Practice of Parallel Programming, pages 157-167, Las Vegas, Nevada, June 1997. [23] Mohammad R. Haghighat. Symbolic Analysis for Parallelizing Compilers. PhD thesis, Dept, of Computer Science. University of Illinois at Urbana-Champaign, August 1994. [24] Mary W. Hall, Saman P. Amarasinghe, Brian R. Murphy, Shih-Wei Liao, and Monica S. Lam. Detecting coarse-grain parallelism using an interprocedural parallelizing compiler. In Proceedings of Supercomputing ’ 95, San Diego, Cali fornia, December 1995. [25] Mary W. Hall, Jennifer M. Anderson, Saman P. Amarasinghe, Brian R. Murphy, Shih-Wei Liao, Edouard Bugnion, and Monica S. Lam. Maximizing multipro cessor performance with the SUIF compiler. IEEE Computer, 29(12) :84-89, December 1996. [26] Mary W. Hall, Brian R. Murphy, Saman P. Amarasinghe, Shih-Wei Liao, and Monica S. Lam. Interprocedural analysis for parallelization. In Proceedings of the 8th International Workshop on Languages and Compilers for Parallel Computing, pages 61-80, Columbus, Ohio, August 1995. 152 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [27] L. Howard Holley and Barry K. Rosen. Qualified data flow problems. In Con ference Record of the Seventh Annual ACM Symposium on Principles of Pro gramming Languages, pages 68-82. Las Vegas, Nevada, January 1980. [28] Frangois Irigoin. Interprocedural analyses for programming environments. In Proceedings of the NSF-CNRS Workshop on Environment and Tools for Parallel Scientific Programming, September 1992. [29] Frangois Irigoin, Pierre Jouvelot, and Remi Triolet. Semantical interprocedural parallelization: An overview of the PIPS project. In Proceedings of the 1991 ACM International Conference on Supercomputing, pages 244-251, Cologne, Germany, June 1991. [30] Thomas Lawrence. Implementation of run time techniques in the Polaxis For tran restructurer. Master’s thesis, Dept, of Computer Science, University of Illinois at Urbana-Champaign, 1996. [31] Shih-Wei Liao. SUIF Explorer: an interactive and interprocedural parallelizer. PhD thesis, Dept, of Electrical Engineering, Stanford University, August 2000. [32] Shih-Wei Liao, Amer Diwan, Jr. Robert P. Bosch, Anwar Ghuloum, and Mon ica S. Lam. SUIF Explorer: an interactive and interprocedural parallelizer. In Proceedings of the Seventh ACM SIGPLAN Symposium on Principles & Prac tice of Parallel Programming, pages 37-48, Atlanta, Georgia, May 1999. [33] Kathryn S. McKinley. Automatic and Interactive Parallelization. PhD thesis, Dept, of Computer Science, Rice University, April 1992. [34] John Mellor-Crummey. Compile-time support for efficient data race detection in shared-memory parallel programs. In Proceedings of ACM /O NR Workshop on Parallel and Distributed Debugging, pages 129-139, May 1993. [35] Sungdo Moon and Mary W. Hall. Evaluation of predicated array data-flow analysis for automatic parallelization. In Proceedings of the Seventh ACM SIG P LA N Symposium on Principles & Practice of Parallel Programming, pages 84-95, Atlanta, Georgia, May 1999. [36] Sungdo Moon, Mary W. Hall, and Brian R. Murphy. Predicated array data-flow analysis for run-time parallelization. In Proceedings of the 1998 ACM Interna tional Conference on Supercomputing, pages 204-211. Melbourne, Australia, July 1998. 153 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [37] Flemming Nielson. Expected forms of data flow analysis. In Harald Ganzinger and Neil D Jones, editors, Programs as Data Objects, volume 217 of Lecture Notes on Computer Science, pages 172-191. Springer-Verlag, October 1986. [38] Yunheung Paek, Jay Hoeflinger, and David Padua. Simplification of array ac cess patterns for compiler optimizations. In Proceedings of the ACM SIG PLAN ’ 98 Conference on Programming Language Design and Implementation, pages 60-71, Montreal, Canada, June 1998. [39] Ravi Ponnusamy, Joel Saltz, Alok Choudhary. Yuan-Shin Hwang, and Geoffrey Fox. Runtime support and compilation methods for user-specified irregular data distributions. IEEE Transactions on Parallel and Distributed Systems, 6(8):815-831, August 1995. [40] William Pugh. A practical algorithm for exact array dependence analysis. Com munications of the ACM, 35(8): 102-114, August 1992. [41] William Pugh and David Wonnacott. Eliminating false data dependences using the Omega test. In Proceedings of the ACM SIGPLAN ’ 92 Conference on Pro gramming Language Design and Implementation, pages 140-151, San Francisco, California, June 1992. [42] William Pugh and David Wonnacott. Constraint-based array dependence anal ysis. ACM Transactions on Programming Languages and Systems, 20(3) :635- 678, May 1998. [43] Lawrence Rauchwerger and David Padua. The LRPD test: Speculative run time parallelization of loops with privatization and reduction parallelization. In Proceedings of the ACM SIG PLAN ’ 95 Conference on Programming Language Design and Implementation, pages 218-232, La Jolla, California, June 1995. [44] Joel H. Saltz. Ravi Mirchandaney, and Kay Crowley. Run-time parallelization and scheduling of loops. IEEE Transactions on Computers, 40(5):603-612, May 1991. [45] Shamik D. Sharma, Anurag Acharya, and Joel Saltz. Deferred data-flow analy sis: Algorithms, proofs and applications. Technical Report UM D-CS-TR-3845, Dept, of Computer Science, University of Maryland, November 1997. 154 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [46] Jaswinder Pal Singh and John L. Hennessy. An empirical investigation of the effectiveness and limitations of automatic parallelization. In Proceedings of the International Symposium on Shared Memory Multiprocessing, April 1991. [47] Byoungro So, Sungdo Moon, and Mary W. Hall. Measuring the effectiveness of automatic parallelization in SUIF. In Proceedings o f the 1998 ACM Inter national Conference on Super computing, pages 212-219, Melbourne, Australia, July 1998. [48] Robert E. Strom and Daniel M. Yellin. Extending typestate checking using conditional liveness analysis. IEEE Transactions on Software Engineering, 19(5):478-485, May 1993. [49] Peng Tu. Automatic Array Privatization and Demand-driven Symbolic Anal ysis. PhD thesis, Dept, of Computer Science, University of Maryland, May 1995. [50] Peng Tu and David Padua. Automatic array privatization. In Proceedings of the 6th International Workshop on Languages and Compilers fo r Parallel Computing, pages 500-521, Portland, Oregon, August 1993. [51] Mark N. Wegman and F. Kenneth Zadeck. Constant propagation with condi tional branches. ACM Transactions on Programming Languages and Systems, 13(2):180-210, April 1991. [52] Michael Wolfe. High Performance Compilers for Parallel Computing. Addison- Wesley, 1995. [53] David G. Wonnacott. Constraint-Based Array Dependence Analysis. PhD the sis, Dept, of Computer Science, University of Illinois at Urbana-Champaign, August 1995. 155 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. A ppendix A D etailed Results of Instrum ented Loops In this appendix, we provide detailed results of individual loop in programs from 3 benchmark suites that are instrumented in our experiments presented in Chapter 3. These are the loops that SUIF could not parallelize and are not nested inside other parallel loops. Since none of the loops from swim. turb3d from S p e c f p 9 5 and appbt. appsp, embar from N as are instrumented, they are not included here. In each table in the following, the first column gives the subroutine and line number of the loop. The second column indicates whether the loop is parallelizable. D indicates that the loop contains variables that carry dependences across loop iter ations that cannot be removed with privatization. I indicates that all the variables within the loop are either independent or privatizable. The next three columns provide a list of variables that are instrumented and proven to be dependent, inde pendent, or privatizable by the ELPD test The final column in the tables describes a category for the loop corresponding to classification defined in Chapter 3. 156 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Loop D /I D ependent In d ep e n d e n t P rivatizable C a teg o ry S pecfp95 applu blts-551 D V DA blts-553 D V DA blts-555 D V DA blts-580 D v,tmat DA blts-604 D V DA blts-606 D V DA buts-657 D V DA buts-659 D V DA buts-661 D V DA buts-686 D tv,traat DA buts-710 D tv DA buts-712 D tv DA apsi run-909 I c,savey help,helpa,savex CF-t-BC run-953 I pott,savey help,helpa,savex CF+BC run-1015 I ux,savey help,helpa,savex CF+BC run-1044 I help,savex CF+BC run-1083 I vy,savey help,helpa,savex CF+BC run-1155 I savey help,savex CF+BC dcdtz-1331 I helpa CF dtdtz-1452 I helpa,cn,an,dks CF hyd-1531 D P DA dudtz-1656 I unu, helpa, cn, an,dks,dum CF dvdtz-1798 I unv,helpa,cn, an,dks,dvm CF trid-3182 D an,fn DA trid-3195 D gn DA tridc-3221 D fn DA tridc-3234 D gn DA strech-3384 D h DA smth-3436 D f,fl DA smthf-3462 D f,fl DA setall-4048 I c,q,cmass,zn,yn,xn CF(+loop split) setall-4097 D tot cmass,zn,lz,xn,lx yn,iy RED setall-4099 D tot cmass,zn,lz,xn,bc yn,ly RED setall-4128 I cm ass CF (scalar) setall-4130 I cmass CF(scalar) topo-4539 I height, rufnes CF (scalar) rfftil-5030 D 11,is DA rfftbl-5084 D c,ch,iw,ll,na CD rfftfl-5668 D c,ch,12,na,iw CD dkzmh-6269 I u,t CF dkzmh-6347 I u,t ST topbl-6432 I klev CF (scalar) Table A .l: Detailed result of instrumented loops (continued on the next pages). Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Loop D /I | D ep endent In d ep en d en t P rivatizable C a te g o ry Specfp95 hydro2d gridco-934 D zb DA gridco-966 D rb DA ismin-3954 D smin imin RED i0max-4005 D smax imax RED wnfle-4109 D nval IND mgrid MAIN-75 D r,u CD rag3p-108 D r CD mg3p-117 D u r CD norm2u3-357 D rnmu RED norm2u3-358 D rnmu RED norm2u3-359 D rnmu RED setup-376 D mm AD setup-382 D ir AD zran3-435 I z IE su2cor MAIN-243 D u,ndat,f,ndat, h,w llsum ,w llint, CD sum,sav,ireg, w22int,w22sum ifreq,isa,ipr sweep-407 D u,ireg c2,w2,wl,w0,rl,r2 CD sweep-411 D u,ireg c2,w2,wl,w0,rl,r2 CD sweep-420 I wl,w2 CF sweep-440 D ireg wO,Iacpt,rl,r2 CD geom-592 D ibase DA bestab-950 D rd DA bestab-961 D si DA bestab-971 D rd,f DA corr-1344 I h,f,ndat,sum,sav BC int4v-1092 D chidml,chid besl,bes2,bes3 DA int2v-1190 D phid,chid cl,c2 DA loops-1426 D Ili2 LS loops-1427 D 1112 LS loops-1444 I wloopl,wloop2,wl, OD w2,wl,w,r,s,wp loops-1452 I s,r wloopl,wl,w2,wp OD loops-1453 I s,r wloopl,wl,w2,wp OD loops-1463 I s,r wloopl,wl,w2,wp OD loops-1557 I wloop2,wl,w2,wp CF loops-1558 I wloop2,wl,w2,wp CF loops-1568 I wloop2,wl,w2,wp CF loops-1613 I r,s wl,w2 CF loops-1615 I r,s wl,w2 CF loops-1636 I r,s IE 158 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Loop D /I D ep en d en t In d ep en d en t Privatizable C ateg o ry S p e c f p 95 su2cor (continued) loops-1659 I wloopl,wloop2,wl, CF w2,wl,w,wp loops-1660 I wloop 1 ,wloop2, wl, CF w2,wl,w,wp loops-1670 I wloopl,wloop2,wl, CF w2,wl,w,wp accum2-1918 D f AD stat2-1978 D iflag.u,s2,si DA trngvO-2161 D ireg DA trngvO-2172 I ireg BC trngvO-2182 I ireg BC init-2225 D ireg CD init-2226 D ireg CD init-2227 D ireg CD trngvl-2250 D ireg DA trngvl-2261 I ireg BC trngvl-2266 I ireg BC tomcatv MAIN-141 D d,rx,ry DA MAIN-152 D rx.ry DA wave5 init-550 D dny,yn,xn DA init-552 D yn,xn Cl init-612 D dnx.ysav jsx, yn.xnjsy.nply DA nplxa.ixl.ixO init-625 D dnx yn.xn Cl init-628 D dnx LS init-681 I yn,xn ksp DD i nit-683 I yn,xn ksp DD bcnd-1281 D y epinb,epinl,ixO,ixl, lb CD epinr,epint,xh,yh,x, epoutb.epoutl.elost, epoutr.epoutt.istr, qlostb.qlostl.lstr, qlostr,qlostt,scr field-3087 I w BC field-3118 I ss BC field-3367 I q BC field-3396 I q BC field-3420 I q BC field-3450 I q BC field-3465 I q CF field-3493 I q BC field-3543 I ex,ey,ez,bx,by,bz BC field-3564 I ex,ey,ez,bx,by,bz BC Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Loop D /I D ep en d en t In d ep en d en t P rivatizable C ateg o ry Specfp95 wave5 (continued) vslvlp-4600 D tmpl,tm p2,q DA vslvlp-4648 D q DA fftf-5064 I temp,coef CF+BC fftb-5082 I temp,coef CF+BC smooth-5143 D q,tem p j2 jl CD smooth-5156 D q ,tem p j2 jl CD smooth-5197 D q,temp,i2,il CD smooth-5210 D q,temp,i2,il CD parmvr-5504 D jb,ib CD rfftb1-6323 D c,ch,iw,ll,na CD rfftfl-6395 D ch,c,12,na,iw CD rfftil-6495 D 11, is DA radf3-7352 I ch OD Nas applu blts-525 D V DA blts-527 D V DA blts-529 D V DA blts-553 D v,tmat DA blts-575 D V DA blts-577 D V DA buts-622 D V DA buts-624 D V DA buts-626 D V DA buts-650 D tv,tmat DA buts-672 D tv DA buts-674 D tv DA buk MAIN-41 D t23,r23,t46,r46, DA ks,seed MAIN-56 D key rank,keyden CD MAIN-81 I key2 IE bucksort_-131 D keyden DA bucksort_-137 D keyden CD cgm MAIN-246 D x,zeta,zetal r,q,zetapr,resid DA makea-381 D size,tran,nnza arow,acol,aelt,ks, iv,v,rowidx DA t46,t23,r46,r23 sparse-458 D colstr DA sparse-472 D colstr CD sparse-484 D colstr DA sp arse-500 D a,rowicbc,mark, nzloc CD xjajpl,nza sparse-507 D mark,x,nzcol nzloc CD sparse-521 D nza mark,x END Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Loop D /I D ependent In d ep en d en t P riv atizab le C ateg o ry Nas cgm (continued) spmvc-596 I mark OD vecset-629 I set OD cgsol-693 D Pir rho q DA fftpde MAIN-45 D t46,t23,r46,r23,ks,tl DA MAIN-54 I t46,t23,r46, CF r23,ks,x0 MAIN-126 I u x2,y CF vranlc-314 D X DA cffts-380 I u y CF cfftz-453 D ln,kn,ku IND mgrid MAIN-62 D r,u CD mg3p-97 D r CD rag3p-106 D u r CD norm2u3-343 D rnmu RED norm2u3-344 D rnmu RED norm2u3-345 D rnmu RED setup-362 D mm AD setup-368 D ir AD zran3-405 D t23,r23,t46,r46,ks,x DA zran3-406 D t23,r23,t46,r46,ks,x DA zran3-415 D jlj2 j3 ,te n CD zran3-416 D jlj2 J3 ,te n CD zran3-417 D j l j2 j3,ten CD zran3-446 I z IE 1 vranl-663 D X DA P e r f e c t adm run-862 I c,savey help, helpa, savex CF+BC run-902 I pott,savey help,helpa,savex CF+BC run-960 I ux,savey help,helpa,savex CF+BC run-985 I help,savex CF+BC run-1020 I vy,savey help,helpa,savex CF+BC run-1084 I savey help,savex CF+BC dcdtz-1226 I helpa CF dtdtz-1325 I helpa, cn, an, dks CF hyd-1395 D P DA dudtz-1498 I unu,helpa,cn, CF an,dks,dum dudtz-1506 I conv CF dvdtz-1618 I unv,helpa,cn, CF an,dks,dvm trid-2680 D an,fn DA trid-2689 D g n DA Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Loop D /I D ependent In d ep en d en t P rivatizable C ateg o ry P erfect adm (continued) tridc-2707 D fh DA tridc-2716 D gn DA strech-2796 D h DA smth-2835 D f,fl DA smthf-2852 D f,fl DA setall-3317 I c,q,cmass ,zn,yn,xn CF(+loop split) setall-3366 D tot cmass,zn,lz,xn,lx yn.iy RED setall-3368 D tot cmass,zn,lz,xn,lx yn.ly RED setall-3397 I cmass CF(scalar) setall-3399 I cm ass CF (scalar) topo-3708 I height,rufnes CF(scalar) rffti1-4022 D 11,is DA rfftb1-4058 D c,ch,iw,ll,na CD rfftfl-4467 D c,ch,12,na,iw CD dkzmh-4895 I u, t CF dkzmh-4969 I u,t ST topbl-5045 I klev CF(scalar) arc2d clustr-1646 D r DA filerx-1967 I work IE stepfx-2159 I s work,sn CF+BC stepfy-2314 I s work,sn CF+BC xpenta-3148 D x,y,f DA xpenta-3186 D f DA xpent2-3230 D x,y,f DA xpent2-3277 D f DA ypenta-3726 D x,y,f DA ypenta-3764 D f DA ypent2-3808 D x,y,f DA ypent2-3855 D f DA bdna indexl-930 D ix DA indexl-932 D ix DA indexl-934 D ix DA indexl-936 D ix DA indexl-938 D ix DA indexl-940 D ix DA indexl-961 D jx DA indexl-963 D jx DA indexl-970 D kx DA mpoles-1330 I k CF+BC water-1361 D n IND cnvert-1508 I efact,kolle OD cnvert-1513 I rfact,kolll OD predic-1611 D t CD Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Loop D /I D ep en d en t In d ependent P riv atizab le C ategory P e r f e c t bdna (continued) correc-1730 D i IND correc-1731 D i IND correc-1764 D i IND correc-1778 I t IE correc-1875 I t IE correc-1929 D ns DA correc-1945 I ksjs,is CF+BC jacobi-2133 I a,v,m OD jacobi-2135 I a,v,m OD input-2363 k IND restar-2774 I icheck OD restar-2800 I t IE restar-2811 D i IND actfor-3049 I foxp,foyp,fozp,flxp, flyp,f2xp,f2yp,f2zp, fpxp,fpyp,flzp,fpzp, ind,xdt,ydt,zdt DD actfor-3084 D ind,l IND dyfesm abc-233 I block OD assemm-427 I suu OD assemm-432 I suu OD chosol-861 D b DA chosol-877 D b AD compl-968 D ht as,tr,els,cl DA drassm-1166 D pO pe,mO,ma CD drbc-1305 I m O OD drelmn-1466 D idedon pe,fe,qdwght wtdet ,ndx,ndy,xy,p CD formrO-2183 D rO CD hop-3119 I xdplus,xplus,xd OD hopO-3227 I xdplus,xplus OD mass-4079 D ht DA matinv-4289 I a CF matinv-4302 I a IE mxmult-4601 D mx CD save-5177 I ivalid OD sethm-5377 D node iwherd IND sethm-5414 D iptr DA sethm-5422 D iptr DA setup-5510 D m l DA setup-5518 D id DA solvh-6433 I he xe OD solxdd-6713 I r OD solxdd-6748 I z OD solxdd-6793 I xdd OD tranf-7364 D cl CD Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Loop D /I D ependent In d ep e n d e n t P rivatizable C ate g o ry P e r f e c t flo52 MAIN-83 MAIN-131 I D ny,nx mstage OD DA MAIN-166 D xj2,i2jl,il, DA geom-478 mesh-542 D I ny,nx,kx t.angl X d,c,v,u DA OD mesh-560 D angl v,u DA euler-984 D rtmax,hmax jrt,irt jh,ih RED euler-985 D rtmax,hmax jrt,irt jh,ih RED grid-1803 graph-1869 I I iflag jmax OD OD m dg mdmain-295 D var,tvir,tkin,ttmv elpst vm CD initia-709 D r3,rl,i2 iw,s,rmc t DA initia-710 D rl,i2 r3,iw,s,rmc t DA rand-785 D rl,ir DA predic-818 predic-819 interf-982 D I I V V gg,ff,xl,yl,zl,rl CD ST IE interf-986 I gg,ff,xl,yl,zl,rl IE m g3d migrat_-120 march_-247 D D sin45,sin60,sin72, sn3672,sq54,iflag pml,cdpml, sin36 ppl,pm l, aw,work aw,work LS CD cfft9x_-550 D ppl,cdppl ar,ai,la work CD cfft9x_-576 D ar,ai,la work CD fft991_-1415 D sin45 ,sin60 ,sin72, vin,sin36 work LS rfft_-1448 D sn3672,sq54,iflag c,a,ii,la,ma, sin60,iflag sin36,sin45,sq54 sn3672,sin72 CD rfft.-1478 D c,a,ii,la,ma,sq54, sin45,sin60,iflag, sin72,sn3672 sin36 CD fax_-2717 D ifax DA fax_-2719 D ifax AD ocean MAIN-424 D nxxxin,nxxout lfirst,txxxin,txxout equiv,cwork IND MAIN-537 D nxxxin,nxxout, iexshf lfirst,txxxin,txxout, lxtime,npts,tftvmt, txacac,exacac rm,log,eftvmt equiv,lxstep IND MAIN-569 D nxxxin,nxxout, iexshf lfirst,txxxin,txxout, lxtime,npts,tftvmt, txacac,exacac rm,log,eftvmt equiv,lxstep IND 164 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Loop D /I D ep endent In d ep en d en t P rivatizable C ateg ory P e r f e c t ocean (continued) MAIN-647 D nxxxin,nxxout lfirst,txxxin,txxout equiv,cwork IND MAIN-711 D nxxxin,nxxout nn,lfirst,txxxin,txxout equiv IND MAIN-744 D nxxxm,nxxout rm,lfirst,txxxin,txxout equiv,ca IND MAIN-802 D equiv,ca, nxxxin,nxxout nn,lfirst,txxxin,txxout CD MAIN-830 D nxxxin,nxxout rm,lfirst,txxxin,txxout equiv IND MAIN-854 D nxxxin,nxxout rm,lfirst,txxxin,txxout equiv,ca IND MAIN-874 D equiv,ca, nxxxin,nxxout rm,lfirst,txxxin,txxout CD MAIN-928 D cwork2, nxxxin,nxxout rm,lfirst,txxxin,txxout equiv,cwork CD MAIN-959 D nxxxin,nxxout lfirst,txxxin,txxout equiv,cwork IND MAIN-1037 I valid OD ftrvmt-4223 D exj data DA ftrvmt-4226 I data ST ftrvmt-4227 I data ST ftrvmt-4239 I data ST ftrvmt-4240 I data ST ftrvmt-4248 I data ST ftrvmt-4249 I data ST ftrvmt-4266 D datajrev CD ftrvrat-4270 D data CD ftrvrat-4278 D data CD rcs-3856 I a ST rcs-3861 I a ST csr-3930 I a ST csr-3935 I a ST scsc-4023 D y RED qcd choos-331 D mat,irandl, irandO plaq DA getpos-495 D j DA hit-544 D cnt DA init-607 D pu IND legmak-716 D site,plegs DA legmak-718 D site,plegs DA legmak-720 D site DA linkbr-767 D pt mat2,st,pstr, irandl, irandO coord IND linkbr-769 D pt mat2,st,pstr, irandl,irandO coord IND linkbr-771 D pt mat2,st,pstr, irandl,irandO coord IND linkbr-773 I mat2,st,pstr, irandl,irandO coord CF+BC 165 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Loop D /I D ep endent In d ep en d en t Privatizable C a teg o ry P e r f e c t qcd (continued) measur-993 I top,down,legs, bottom,up, strb,stru,strt, coord,stemp IE measur-995 I top,down,legs, bottom,up, strb,stru,strt, coord,stemp IE measur-1010 D coord, site, plegs up,top,stemp, down,bottom DA measur-1012 D coord, site,plegs up,top,stemp, down,bottom DA measur-1014 D site stemp DA qqqlpc-1346 D coord,site bits,ul,u2,u3, temp,pstr DA qqqlpc-1348 D coord,site bits,ul,u2,u3, temp,pstr DA qqqlpc-1350 D site bits,ul,u2,u3, temp,pstr DA qqqlpc-1358 D cnt DA qqqlpc-1371 D cnt DA qqqlpc-1385 D cmt temp, u3, bits DA qqqlpc-1386 D cnt DA qqqmea-1654 I legs,lpstr CF+BC qqqraea-1658 I lpstr CF+BC qqqlps-1564 D ctr IND rotmea-1864 D lpstr down,top, bot,up,legs, pstr,stemp, wloop,coord DA rotmea-1866 D lpstr down,top,pstr, bot,up,wloop, stemp,legs DA rotmea-1889 D coord,site,plegs up, top,stemp, down,bottom, DA rotmea-1891 D coord,site,plegs up,top,stemp, down,bottom, DA rotmea-1893 D site stemp DA setcol-1940 D index4,iptr,nblack,nred lisblk,lisred LS setcol-1942 D index3,iptr,nblack,nred lisblk,lisred LS setcol-1944 D index2,iptr,nblack,nred lisblk,lisred LS setcol-1946 D indexl ,nblack,nred lisblk,lisred LS update-2301 D u l, irandl, irandO DA update-2302 D ul,irandl,irandO DA update-2306 D u l,coord,irandl,irandO DA update-2313 D ul,irandl,irandO DA update-2317 D 1 ul,coord,irandl,irandO DA 166 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Loop D /I D ependent In d ep en d en t P riv atizab le C ateg o ry P e r f e c t spec77 conkuo-254 D tt,a,b,c,d,lui, qfmthe,icall, thetae,tlast, tfmthe,last es press,tin,qin,esat, tmst,qrast, dqkuo,dtkuo DA druk-499 I ps BC fixtet-1019 D teta DA fixtet-1028 I tet,q BC fixtet-1056 I q BC fixtet-1068 I tet BC fixtet-1090 D tg DA glats-1334 D rad DA gwater-1875 D tt,a,b,c,d,lm, qfmthe,icall, thetae,tlast, tfmthe,last es b,eef,uf,tau,qf, af,rtg,cg,ps,a DA iminv-2209 D biga RED iminv-2211 D biga RED iminv-2225 I a ST iminv-2237 I a CF iminv-2258 I a CF iminv-2261 I a CF iminv-2296 I a BC iminv-2305 I a BC lrgscl-2377 D prec,rhsat DA MAIN-2571 D rq,te,tt,icall, tfmthe,last, q,di,ze,dt, a,b,c,d,x,lm, ifir,thetae, qfmthe,tlast es,rm g,f,b,reivor,plnwcs, dum,dvp,dvm,absvor, dpdlam,rmsdot,ekin, rqf,tef,tf,eef,uf,tau, qf,dphif,dlamf,af,rtg, z,w,ef,gf,vf,ff,bf,zef, vqf,uqff,vqff,uqf,t2,u2, q2, v2 ,dphi ,t 1 ,ul ,q l,v l, dot,dup,dpdphi,rt,y,a, dlam,cg,ps,dif,rqff CD MAIN-2608 D tem,te,a,b,x di,ze,qm,zem, tlast,last,x, xhour,rm,q, dim,ifir,rq es,tt,c,d,tfmthe, qfmthe,thetae, lm,icall g,f,b,relvor,plnwcs, dum,dvp ,dvm,absvor, dpdlam,rmsdot,ekin, rqf,tef,tf,eef,uf,tau, qf,dphif,dlamf,af,rtg, z,w,ef,gf,vf,ff,bf,zef, vqf,uqff,vqff,uqf,t2,u2, q2,v2,dphi,tl,ul,ql,vl, dot,dup,dpdphi,rt,y,a, cUam,cg,ps,dif,rqfF, chour,thour CD 167 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Loop D /I D ependent In d ep en d en t Privatizable C a teg o ry P e r f e c t spec77 (continued) mstadb-2733 D t,tlast,last tt,a,b,c,d,icall,lm DA mstadb-2754 D t last, last tt,a,b,c,d,icall,lm t DA mstadb-2759 D tlast,last tt,a,b,c,d,icall,lm t DA pln2-2897 D prod,b,a DA pln2-2909 D alp2,alpl DA poly-2925 D y2,yi y3 DA satvap-3099 D t last,last DA setsig-3227 D ci DA sicdkd-3433 D dm,111 z,y,x mv,lv,em LS sicdkd-3451 I z,y,x ST track finit4-418 D fil4cm tm l DA filtr4-717 D ihits AD fptrakO-942 I itrold OD extendO-1325 D isprec,nmhits, BC scorel.score2, covtrk,prmtrk, iddsit,ihits,xhl ,xh2 nlfiltO-1623 D isprec,nmhits, DA scorel,score2,kt, covtrk,prmtrk, iddsit,ihits,xhl ,xh2 truthp-3494 I dp,pinv,pp OD matmlt-3548 I m3 ST trfd olda_-220 D mijkl LS olda_-221 D mijkl LS olda.-241 D lm ax,lmin,mijkl LS 168 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Asset Metadata
Creator
Moon, Sungdo (author)
Core Title
Combining compile -time and run -time parallelization
Contributor
Digitized by ProQuest
(provenance)
School
Graduate School
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
Computer Science,OAI-PMH Harvest
Language
English
Advisor
Hall, Mary W. (
committee chair
), [illegible] (
committee member
), Horowitz, Ellis (
committee member
)
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c16-224653
Unique identifier
UC11339259
Identifier
3073818.pdf (filename),usctheses-c16-224653 (legacy record id)
Legacy Identifier
3073818.pdf
Dmrecord
224653
Document Type
Dissertation
Rights
Moon, Sungdo
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the au...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus, Los Angeles, California 90089, USA
Linked assets
University of Southern California Dissertations and Theses