Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Compiler optimizations for architectures supporting superword-level parallelism
(USC Thesis Other)
Compiler optimizations for architectures supporting superword-level parallelism
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
C O M P IL E R O P T IM IZ A T IO N S F O R A R C H IT E C T U R E S S U P P O R T IN G SU P E R W O R D -L E V E L PA R A L LELISM by Jaew ook Shin A D issertatio n P resen ted to th e FA CU LTY O F T H E G R A D U A T E SC H O O L U N IV E R S IT Y O F S O U T H E R N C A L IFO R N IA In P a rtia l Fulfillm ent of th e R equirem ents for th e D egree D O C T O R O F P H IL O S O P H Y (C O M P U T E R SC IE N C E ) A ugust 2005 C opyright 2005 Jaew ook Shin Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. UMI Number: 3196891 INFORMATION TO USERS The quality of this reproduction is dependent upon the quality of the copy submitted. Broken or indistinct print, colored or poor quality illustrations and photographs, print bleed-through, substandard margins, and improper alignment can adversely affect reproduction. In the unlikely event that the author did not send a complete manuscript and there are missing pages, these will be noted. Also, if unauthorized copyright material had to be removed, a note will indicate the deletion. ® UMI UMI Microform 3196891 Copyright 2006 by ProQuest Information and Learning Company. All rights reserved. This microform edition is protected against unauthorized copying under Title 17, United States Code. ProQuest Information and Learning Company 300 North Zeeb Road P.O. Box 1346 Ann Arbor, Ml 48106-1346 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. D ed ication This dissertation is dedicated to my father Jungsik and m other Bokhee for their patience, prayer and encouragement. ii Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. A ck n ow led gm en ts I would like to give my special thanks to my advisor, Dr. M ary W. Hall, for her guidance, enthusiasm, and support which m ade the completion of this dissertation possible. Her insight and suggestions proved to be invaluable in this work. I was fortunate to have Dr. Ulrich N eum ann and Dr. Tim othy Pinkston as comm ittee members of the dissertation committee. I would like to thank Dr. Jacqueline Chame, who has been guiding this research together w ith my advisor. Dr. Sam an Am arasinghe and Sam Larsen at M IT deserve a special thanks. Not only their work on SLP has been the basis of this research, but also they provided their im plem entation and expertise. My office m ate Chun Chen was friendly and insightful. He helped me w ith my m ath questions. I also have enjoyed working with Dr. Pedro Diniz, Tim B arrett, Heidi Ziegler, Yoonju Lee, N astaran B aradaran and Spundun B hatt, who have been supporting me both technically and emotionally. Finally, I would like to thank my wife Eunlim for her love, support and encouragement without which I wouldn’t be writing this at this moment. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. C on ten ts D edication ii A cknowledgm ents iii List Of Tables vii List O f Figures viii A bstract xi 1 IN T R O D U C T IO N 1 1.1 SLP Compilation T e c h n o lo g y ................................................................................ 2 1.2 SLP vs. Vectorizing C o m p ile r s ............................................................................. 3 1.3 O pportunities to Improve SLP ............................................................................... 4 1.4 Target Applications ................................................................................................. 5 1.5 Target A rc h ite c tu re s................................................................................................. 6 1.5.1 M ultimedia Extension: A ltiV e c ............................................................... 6 1.5.2 Processing-In-Memory: D IV A ................................................................... 7 1.6 M otivating E x am p les................................................................................................. 9 1.6.1 SLP in the Presence of Control F lo w ..................................................... 9 1.6.2 Superword-Level L o c a lity .......................................................................... 10 1.7 C o n trib u tio n s............................................................................................................... 11 2 B A C K G R O U N D 15 2.1 D ata D ependence........................................................................................................ 15 2.2 D ata R e u s e .................................................................................................................. 17 2.3 The SLP a lg o rith m ..................................................................................................... 18 2.3.1 Alignment Analysis .................................................................................... 19 2.3.2 Distance Analysis ........................................................................................ 20 2.3.3 P a c k in g ............................................................................................................ 21 2.3.4 S u m m a r y ........................................................................................................ 22 2.4 Predicate A n a ly s is ..................................................................................................... 22 2.4.1 Predicated Execution ................................................................................. 23 2.4.2 If-conversion: RK-Algorithm ................................................................... 24 iv Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 2.4.3 Predicate Hierarchy G raph ( P H G ) ........................................................... 25 2.4.4 M utually E xclusive......................................................................................... 26 2.4.5 Predicate C o v e rin g ......................................................................................... 27 2.4.6 Predicate CFG G e n e r a to r ............................................................................ 28 3 SUPERW O RD -LEV EL PA R A LLELISM IN T H E P R E SE N C E OF C O N TROL FLO W 29 3.1 Overview of the Algorithm .................................................................................... 31 3.2 Elim inating Superword Predicates ...................................................................... 35 3.3 U n p re d ic a te ................................................................................................................... 37 3.4 Branch-On-Superword-Condition-Code (B O S C C )........................................... 41 3.4.1 The Characteristics of B O S C C ............................................................... 41 3.4.2 BOSCC M o d e l .............................................................................................. 44 3.4.3 Profiling Support to Com pute P A F S ..................................................... 47 3.4.4 Identifying BOSCC P r e d ic a te s ............................................................... 49 3.4.5 Inserting BOSCC Instructions ............................................................... 51 4 SUPERW O RD-LEVEL LO C A LITY 52 4.1 Background and M o tiv a tio n ................................................................................... 53 4.2 Overview of Superword-Level Locality A lg o rith m ........................................... 57 4.3 Modeling Register Requirem ents & Num ber of M emory A c c e s s e s............ 59 4.3.1 Com puting the Superword F o o tp rin t...................................................... 60 4.3.1.1 Superword Footprint of a Single R e f e r e n c e .......................... 61 4.3.1.2 Superword Footprint of a Group of R e fe re n c e s.................... 63 4.3.2 Registers for Reuse Across Ite ra tio n s ...................................................... 69 4.3.3 P utting It All T o g e t h e r .............................................................................. 70 4.4 Determ ining Unroll F a c t o r s .................................................................................... 72 4.5 Code T ra n sfo rm a tio n s.............................................................................................. 74 4.5.1 Index Set S p littin g ........................................................................................ 76 4.5.2 Superword R e p la ce m en t.............................................................................. 79 4.5.3 Packing in Superword R e g isters................................................................ 80 4.5.4 An Example: Shifting for P artial R e u s e ................................................. 83 5 CODE G EN ER A T IO N 85 5.1 Type Size Conversion .............................................................................................. 85 5.2 R e d u c tio n ...................................................................................................................... 87 5.3 Alignment O p tim iz a tio n ........................................................................................... 88 5.4 Prepacking to Optimize Parallelization O v e rh e a d ........................................... 90 5.5 S u m m a r y ..................................................................................................................... 93 6 E X PE R IM E N T S 94 6.1 B en ch m ark s.................................................................................................................. 95 6.2 Im p lem en tatio n........................................................................................................... 98 6.3 Experim ental M e th o d o lo g y .................................................................................... 99 6.4 Overall P e rfo rm a n c e ......................................................................................................101 6.5 Packing for Low Parallelization O v e r h e a d ............................................................. 103 v Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 6.6 SLP in the Presence of C ontrol F l o w ........................................................................105 6.6.1 Branch-On-Superword-Condition-Code (B O S C C )...................................107 6.7 Superword-Level L o c a li ty ........................................................................................... 113 6.8 S u m m a r y ..........................................................................................................................116 7 DIVA A N D PIM -SPE C IFIC O PT IM IZ A T IO N S 117 7.1 The DIVA I S A ............................................................................................................... 118 7.2 Page-M ode Memory A c c e s s ........................................................................................119 7.2.1 Motivation .......................................................................................................... 120 7.2.2 The Page-Mode M emory Access A lg o r ith m .............................................122 7.2.3 Experim ents for the Page-M ode Memory Access Algorithm . . . . 126 7.3 DIVA-Specific Code G e n e ra tio n .................................................................................131 7.4 Prelim inary Bandw idth D e m o n stratio n ................................................................... 133 7.5 S u m m a r y ..........................................................................................................................135 8 RELATED W O R K 136 8.1 Exploiting SLP in the Presence of Control F lo w .................................................... 136 8.2 Superword-Level L o c a li ty ........................................................................................... 138 8.3 DIVA-specific O ptim izations........................................................................................139 8.4 S u m m a r y ..........................................................................................................................140 9 C O N C LU SIO N 141 9.1 C o n trib u tio n s................................................................................................................... 141 9.1.1 SLP in the Presence of Control F lo w .......................................................... 142 9.1.2 Compiler Controlled Caching in Superword R eg isters............................143 9.1.3 Implementation and Evaluation of the Proposed Techniques . . . . 143 9.1.4 DIVA-Specific O p tim iz a tio n s ........................................................................144 9.2 Future Work ................................................................................................................... 144 R eference List 146 vi Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. List O f T ables 1.1 Differences between m ultim edia extensions and vector architectures. . . . 4 2.1 Transfer functions for the dataflow analysis to find alignments when a in + b\ and a^m + 62 are given.................................................................................. 19 2.2 Transfer functions for distance analysis when two term s T\ = (X ,c i,b \) and T2 = (X, 02, 62) of two linear expressions for the same variable X are given.......................................................................................................................... 20 4.1 Num ber of array accesses under different optim ization p a th s............... 56 6.1 Benchmark program s.......................................................................................... 95 6.2 R untim e percentage of three functions from UCLA M ediaBench......... 96 6.3 Input d ata size...................................................................................................... 97 7.1 Memory latency com putation.............................................................................121 7.2 DIVA sim ulation param eters.............................................................................. 126 7.3 Benchmark program s............................................................................................127 7.4 Experim ental environm ents.................................................................................133 vii Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. List O f F igures 1.1 Example: Parallelization by the SLP compiler................................................... 3 1.2 AltiVec register file...................................................................................................... 6 1.3 DIVA node architecture.............................................................................................. 8 1.4 Example: SLP in the presence of control flow.................................................... 10 1.5 M otivating example for exploiting superword-level locality........................... 11 2.1 An example code to show dependence vectors and unroll-and-jam .............. 16 2.2 An example showing the packing algorithm ......................................................... 22 2.3 An example showing construction of a PH G ....................................................... 26 3.1 Overview of the algorithm to exploit SLP in the presence of control flow. 30 3.2 Exam ple illustrating steps of SLP compilation in the presence of control flow.................................................................................................................................... 32 3.3 Merging two superwords using a s e le c t instruction......................................... 33 3.4 Merging two superword definitions......................................................................... 35 3.5 An algorithm to generate s e le c t instructions.................................................... 36 3.6 Restoring control flow.................................................................................................. 37 3.7 Unpredicate algorithm ................................................................................................. 38 3.7 Unpredicate algorithm (Continued)........................................................................ 39 3.8 Run tim e of synthetic kernels.................................................................................... 42 3.9 Autom atic instrum entation to com pute PAFS in profiling phase.................. 46 3.10 Algorithm to identify a predicate for instructions............................................... 48 3.11 BOSCC insertion algorithm ....................................................................................... 50 viii Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 4.1 Exam ple code for SLL................................................................................................... 55 4.2 Reuse across iterations.................................................................................................. 58 4.3 Superword footprint of a single reference................................................................ 62 4.4 Superword footprint of a group of references......................................................... 64 4.4 Superword footprint of a group of references (Continued)................................. 65 4.5 Code generation exam ple............................................................................................. 77 4.5 Code generation example (Continued)..................................................................... 78 4.6 Operations used for packing in registers.................................................................. 82 4.7 Shifting............................................................................................................................... 83 5.1 Parallelization of type size co n v e rsio n s................................................................... 86 5.2 Parallelization of reduction sum .............................................................................. 87 5.3 Parallelization of unaligned memory references ............................................... 88 5.4 Parallelization by p r e p a c k in g ................................................................................ 89 5.5 D ata dependence graphs for the loop body of Figure 5.4(b) ....................... 91 5.6 M ultiple packing choices generated by unrolling m ultiple loops ................ 92 6.1 Im plem entation.............................................................................................................. 98 6.2 Experim ental flow............................................................................................................. 100 6.3 Overall speedup breakdown (large d a ta )................................................................... 102 6.4 Overall speedup breakdown (small d a ta )...................................................................103 6.5 Effect of prepacking..........................................................................................................104 6.6 An SLP-based compiler th a t supports BO SCC.......................................................107 6.7 Example: BOSCCs generated in E P IC ...................................................................... 108 6.8 Speedups over scalar version for real d a ta .................................................................110 6.9 TM: % taken BOSCCs.................................................................................................... I l l 6.10 Speedups over scalar version for random ly generated d a ta .................................. 113 6.11 Speedups over M IT-SLP..................................................................................................115 7.1 The superword data flow.................................................................................................118 ix Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 7.2 Unroll-and-jam and reordering...................................................................................... 120 7.3 The page-mode memory access algorithm ..................................................................122 7.4 Sorting offset addresses.................................................................................................... 126 7.5 Experim ental flow for page-mode mem ory access....................................................128 7.6 SLP versions of VMM and M M M ................................................................................ 129 7.7 Normalized execution tim e..............................................................................................129 7.8 Percentage of page-mode accesses.................................................................................131 7.9 Speedup breakdow n.......................................................................................................... 131 7.10 Code generation for conditional execution in DIVA............................................... 132 7.11 S tre a m A d d ....................................................................................................................... 133 7.12 Run tim e of floating point Stream A dd....................................................................... 134 x Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. A b stra ct The increasing im portance of m ultim edia applications in em bedded and general-purpose computing environments has led to th e development of m ultim edia extensions in most commercial microprocessors. At the core of these extensions is support for single in struction multiple data (SIMD) operations on superwords, th a t is, aggregate d ata objects larger than a machine word. Several compilers have been developed to generate the SIMD instructions for multi- m edia extensions automatically. However, m ost are based on conventional vectorization technology. More recently, a technique called superword-level parallelization (SLP) was developed to exploit unique features of m ultim edia extensions, such as short vectors and single cycle instruction latency. Instead of finding parallelism from loops, SLP finds parallelism between instructions m aking this approach simple and more robust th an the vectorization technique. We propose a new compiler framework based on SLP where a num ber of optim iza tions are performed in a seamless fashion. First, we describe how to extend the concept of SLP in the presence of control flow constructs to increase its applicability. A key in sight is th at we can use techniques related to optim izations for architectures supporting predicated execution, even for m ultim edia instruction sets th a t do not provide hardware predication. Second, we treat the large superword register file as a compiler-controlled cache, thus avoiding unnecessary memory accesses by exploiting reuse in superword reg isters. This approach also targets a research prototype, the DIVA processor-in-memory xi Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. (PIM ) device. We describe DIVA-specific optim izations including a technique to exploit a DRAM mem ory characteristic autom atically. We implemented the new techniques in a complete compiler th a t generates SIMD instructions autom atically from sequential programs. We describe the evaluation of our implementation on a set of 14 benchm arks. The speedups range from 1.05 to 19.22 over sequential performance. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. C hapter 1 IN T R O D U C T IO N The increasing im portance of m ultim edia applications has led to the development of mul tim edia extensions in most commercial microprocessors. At the core of these extensions is support for short single instruction multiple data (SIMD) operations on superwords, th a t is, aggregate d ata objects larger th an a machine word. Initially, the conventional wisdom was th a t the appropriate compiler technology for m ultim edia extensions would borrow heavily from autom atic vectorization [64, 15, 19]. More recently, Larsen and Am arasinghe at M IT developed a new technique to paral lelize codes specifically targeting m ultim edia extensions [39]. To make the distinction between the parallelism in m ultim edia extensions and vector parallelism, they define superword-level parallelism (SLP) as fine-grained SIMD parallelism in a superword. The new technique is simple and robust com pared to the existing vectorization techniques. Still, there rem ain open issues for m ultim edia extensions: how to exploit parallelism across basic block boundaries and how to exploit locality in superword registers. This thesis re search was initiated as part of the D ata IntensiVe Architecture (DIVA) project [31, 21], DIVA employs processing-in-memory (PIM ) technology by combining processing logic and DRAM in a single chip. To exploit high internal memory bandw idth, the DIVA PIM devices support SIMD operations on 256-bit superword registers. Since DIVA is Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. a new architecture, there exist new compiler optim ization opportunities. In this the sis, we describe our approach to address the two open issues and several DIVA-specific optimizations. The rem ainder of this chapter is organized as follows. In the next section, we overview the SLP compilation technology. In Section 1.2, we compare this technology w ith the con ventional vectorization techniques. To m otivate our approaches in this thesis, we present the remaining opportunities to improve SLP in Section 1.3. O ur target applications and target architectures are described in Section 1.4 and Section 1.5, respectively. In Sec tion 1.6, we use simple examples to explain our approaches to the two open issues. The contributions of this thesis are summ arized in the last section. 1.1 SLP C om p ilation T echnology The MIT SLP compiler finds parallelism from a basic block, a block of sequentially ex ecuted instructions. Given a basic block, it first identifies isomorphic statem ents which refer to statem ents with the same corresponding operations. Then, th e isomorphic state ments are packed into a superword statem ent, th at is, collected and replaced by a su perword statem ent, unless dependences prevent doing so. Packing memory references should satisfy further restrictions in order to support hardw are requirem ents; data ele ments to be referenced are contiguous in memory and the address of the first element is aligned to superword boundaries, i.e., its runtime addresses are congruent with respect to superword width. W hile the steps described so far are enough to exploit parallelism within a basic block, for loop nests, the innermost loop is unrolled to convert loop level parallelism into basic block level parallelism. The unroll am ount is determ ined by di viding superword w idth by the smallest d ata type size so th a t even the operations with the smallest operands can exploit SLP fully when packed into a superword. Figure 1.1 illustrates these steps using an example loop in (a). First, the loop is unrolled by 4 as 2 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. for (i=0; i<16; i+ + ) for (i=0; i<16; i+ = 4){ a[i] = b[i] + c[i]; a[i+ 0] = b [i+ 0] + c[i+0] a[i+ l] = b[i+ l] + c[i+l] a[i+ 2] = b [i+ 2] + c[i+2] a[i+3] = b[i+3] + c[i+3] } (a) Original (b) Unrolled (c) Parallelized Figure 1.1: Example: Parallelization by the SLP compiler. shown in (b) assuming four array elements fit in a superword register. Then, the four isomorphic statem ents are packed into a superword statem ent as shown in (c). Here, a [ i:i+ 3 ] represents four array elements from a [ i ] to a [i+ 3 ], 1.2 SLP vs. V ectorizin g C om pilers To understand the differences between vector and SLP compilers, we first consider the architectural differences between vector and m ultim edia extension architectures. W hile there are many similarities, m ultim edia extensions are different from vector architectures in several aspects as listed in Table 1.1. For m ultim edia extensions, strided memory accesses are not supported, vector length is short, instruction latency is roughly one cycle per instruction and memory accesses are usually required to be aligned to superword boundaries. From the compiler’s perspective, these differences m ean th a t superword instructions can mix better with scalar instructions th an vector instructions but strided or unaligned memory accesses are more costly. Consequently, as compared to vector architectures, parallelization overhead for m ultim edia extensions is less a function of vector length and more a function of alignment requirements and cost of packing data elements. Vectorizing compilers have been used to generate vector instructions autom atically for vector supercomputers. A set of loop transform ations are applied to expose SIMD parallel operations suitable for vectorization [43]. Because such transform ations are not 3 for (i=0; i<16; i+ = 4) a[i:i+3] = b[i:i+3] + c[i:i+3]; Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. M ultim edia extensions Vector architectures Strided mem ory access Vector length Instruction latency Aligned memory accesses Not supported < 32 ~ 1 cycle / instruction Required Supported > 64 ~ 1 cycle / data element Not required Table 1.1: Differences between m ultim edia extensions and vector architectures. always applicable, vectorization technology is fragile. Small changes in application code greatly affect the compiler’s ability to recognize vector operations. Compared to vector ization technology, the transform ations used in SLP compilers, as discussed in Section 1.1 are quite simple and always applicable. Instead of complex loop transform ations, SLP compilers apply only unrolling, scalar renaming, and packing of d ata for isomorphic state ments. For SLP, failure to parallelize a single statem ent may not affect parallelization of other statements. 1.3 O p portu n ities to Im prove SL P The SLP compiler developed by Larsen and Amarasinghe only identifies parallelism within a basic block. As a result, the simple and inherently parallel loop in Figure 1.4(a) would not be parallelized. Superword-level parallelization in the presence of control flow is still an open issue. Yet, support for parallelizing control flow is im portant to m ultim edia applications. As one d ata point, control flow appears in key com putations in 6 of the 11 codes in the UCLA MediaBench [41], comprising on average over 40% of their execution time. Parallelizing com putation using SLP can stress the memory system, since sometimes compute-bound programs can become memory bound when com putation costs are re duced [57]. Thus, an additional optim ization opportunity involves reducing the cost of memory accesses. An im portant feature of all m ultim edia extension architectures is a reg ister file supporting SIMD operations (e.g., each 128 bit wide in an AltiVec), sometimes 4 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. in addition to the scalar register file. A set of 32 such superw ord registers represents a not insignificant am ount of storage close to the processor. Accessing d ata from superword registers, versus a cache or m ain memory, has two advantages. T he most obvious advan tage is lower latency of accesses; even a hit in the L I cache has at least a 1-cycle latency. Accesses to other caches in the hierarchy or to m ain memory carry much higher latencies. Another advantage is the elim ination of memory access instructions, thus reducing the num ber of instructions to be issued. W hile the two optim ization opportunities described above can also be exploited in DIVA, it offers other DIVA-specific compiler optim ization opportunities. One such op portunity is to exploit the DIVA ISA features such as conditional execution and perm u tation instruction. Also, DRAM device characteristics can be exploited to further reduce the memory latency. Since DRAM access tim e is a large factor in the memory latency of DIVA, this is an opportunity for significant performance gain. 1.4 Target A p p lication s As described above, m ultim edia extension architectures are specially designed for mul tim edia application characteristics such as abundant d ata parallelism, short iteration counts and small data types [20]. Therefore, these are the prim ary target applications for our compiler technology. Scientific applications are the m ain focus of vector archi tectures, and for the purpose of comparison, they form another im portant class of target applications for our compiler. Since our locality algorithm uses array subscript expres sions in computing register requirem ent, its m ain target is array-based loops. However, our extension to exploit SLP in the presence of control flow is effective on some pointer- based applications as well since the SLP algorithm can find alignment and adjacency of pointer-based memory accesses. Nevertheless, applications w ith regular memory accesses 5 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. SRO SRI SR2 SR3 SR4 SR5 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 I 6 I 7 I 16 8-bit operands 8 16-bit operands 4 32-bit operands 127 Figure 1.2: AltiVec register file. of stride 1 are the best candidates for th e SLP algorithm to satisfy the underlying hard ware requirements. Our locality algorithm is m ost effective in data-intensive applications w ith a large amount of data reuse. 1.5 T arget A rchitectu res O ur approach described in this thesis can be applied to all architectures supporting SLP. For evaluation purposes, however, our compiler im plem entation targets two machines: the PowerPC AltiVec and the DIVA processing-in-memory architecture. 1.5.1 M u ltim ed ia E xten sio n : A ltiV e c W hile most commercial microprocessors have m ultim edia extensions, the m ajority ex ploit SIMD parallelism within a machine word, called subword parallelism [42, 67, 56]. However, there are a few multim edia extension architectures th a t have a separate SIMD register file whose width is larger th an a machine word, including the Intel SSE and Pow erPC AltiVec. Figure 1.2 shows the AltiVec register hie. AltiVec has separate 32 128-bit superword registers in addition to the scalar register hie. Each superword register can be used as either 16 8-bit operands, 8 16-bit operands, or 4 32-bit operands. In addition to the common features of m ultim edia extensions listed in Table 1.1, there are several details of the AltiVec th a t im pact our compiler approach. First, there 6 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. are 162 instructions beyond the standard PowerPC ISA [49]. However, some instructions are designed to perform very specialized operations, and not all general operations are supported for all data types. As a result, certain operations cannot be parallelized and should be executed in scalar functional units. Second, AltiVec requires memory accesses to be aligned to superw ord boundaries by ignoring the last four bits of address operands of memory accesses. Because of this requirem ent, we need an analysis to find alignments of memory references and additional operations are generated for unaligned superwords in memory. Third, AltiVec does not support data movement between the scalar register file and the superword register file. To move data in a scalar register to a superword register, a scalar data m ust be w ritten into the memory address range of a superword which is, in turn, loaded into a superword register. Because of these architectural features, autom atic generation of the superword instructions by compilers is not easy and sometimes leads to high overhead. 1.5.2 P ro cessin g -In -M em o ry : D IV A The increasing gap between processor and memory speeds is a well-known problem in computer architecture. As one of the solutions to bridge the gap, processing-in-memory (PIM) is suggested. Because PIM internal processors can be directly connected to the memory banks, the memory bandw idth is dram atically increased (up to 2 orders of mag nitude). Latency to on-chip logic is also reduced, down to as little as one-fourth th at of a conventional memory system, because internal mem ory accesses avoid the delays associated with communicating off chip. The Data-IntensiVe Architecture (DIVA) project is developing a system, from VLSI design through system architecture, systems software, compilers and applications, to take advantage of this technology for applications of growing im portance to the high- performance com puting community [31, 21]. DIVA combines PIM memory chips with one or more external host processors and a PIM -to-PIM interconnect (see Figure 1.3). To 7 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Scalar registers, 32x32b(128B) Scalar functional unit Wide Functional Unit Wide registers, 32x256b(1KB) I-Cache (4KB) Of f-Chip Coimnunication DRAM Array (-32 MB) Figure 1.3: DIVA node architecture. exploit the high d ata bandw idth effectively, DIVA contains 32 256-bit superword registers in addition to 32 32-bit scalar registers. DIVA contains in-order execution processor cores. Because of short memory latency, d ata caches are not included in DIVA. However, a small instruction cache is added so th a t instruction stream s do not interfere with d ata stream s [31]. DIVA is a memory coprocessor th a t requires separate host processors for running the m ain operating system. As a result, it can be viewed as a standard DRAM to host applications. Due to its low memory latency and high data bandw idth, d ata intensive applications are the main target applications [14]. In many ways, the DIVA ISA is similar to th a t of the AltiVec. However, there are several differences. First, DIVA allows d ata movement between register files. As a result, packing and unpacking scalar values to and from superword registers is cheaper th an in the AltiVec. Second, DIVA allows conditional execution for most superword instructions. The result of an operation on a field of source registers is com m itted to the corresponding field of the destination register conditionally depending on the value of the corresponding 8 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. bit of the specified condition register. The research of this thesis focuses on developing a DIVA parallelizing compiler th a t exploits various features of th e DIVA processor. 1.6 M otivatin g E xam p les As shown in Section 1.1, the M IT SLP compiler unrolls loops to increase the am ount of parallelism within a loop body and packs isomorphic statem ents. In this section, we use simple examples to illustrate how we exploit the opportunities described in Section 1.3. 1.6.1 SL P in th e P r e se n c e o f C o n tro l F low In this thesis, we describe how to extend SLP to parallelize com putations across basic block boundaries. W hen a loop body has control flow, unrolling m ay not increase the basic block size and therefore, may not expose opportunities for SLP parallelization as described by Larsen and Am arasinghe since the M IT SLP compiler parallelizes statem ents within a basic block. Consider the example loop in Figure 1.4(a). W hen the loop is unrolled as in (b), the basic block size does not increase because of the if-statem ents thus preventing the SLP compiler from parallelizing the loop. However, such a loop can be parallelized as shown in (c). Both the comparison and the statem ent guarded by the if-statem ent are parallelized. Then, the old values of b [ i : i+3] are combined with the new values in Vtemp according to the results of the parallel comparison. In the original scalar code, the statem ent guarded by the if-statem ent is bypassed whenever the conditional expression is false. For the parallel version, the instructions in all control flow paths are always executed. In some cases such as when the con dition usually evaluates to false, this overhead leads to performance degradation over sequential execution. An optim ization th at sometimes reduces this overhead is shown in 9 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. for (i= 0; i<16; i+ + ) if (a[i] != 0) b[i]H— h; (a) Original w ith control flow for (i=0; i<16; i+ = 4 ){ Ycond = a[i:i+3] != (0, 0, 0, 0); Vtem p = b[i:i+3] + (1, 1, 1, 1); b[i:i+3] = Combine b[i:i+3] and \ V tem p according to Vcond (c) Parallelized for (i=0; i<16; i+ = 4){ if (a[i+0] != 0) b[i+ 0]-|— h; if (a[i+l] != 0) b[i+l]H— if (a[i+2] != 0) b[i+ 2]H — F; if (a[i+3] != 0) b[i+3]H— H ; } (b) Unrolled for (i=0; i< 16; i+ = 4){ Vcond = a[i:i+3] != (0, 0, 0, 0); branch to LI if Vcond is all false V tem p = b[i:i+3] + (1, 1, 1, 1); b[i:i+3] = Combine b[i:i+3] and \ V tem p according to Vcond LI: (d) Overhead reduced Figure 1.4: Example: SLP in the presence of control flow. Figure 1.4(d). We can bypass the parallel code when all fields of the parallel comparison are false. 1.6.2 S u p erw ord -L evel L o ca lity Reducing memory references is even more im portant when com putations are parallelized as discussed in Section 1.3. In this section, we use a simple example to illustrate our approach to reduce memory references by storing data in superword registers. In the sequential code shown in Figure 1.5(a), A [i] [ j] and A[1— 1] [j] access the same d ata in memory after 32 iterations. Also B [j ] accesses the same memory address after 32 itera tions. If we can keep a data element in a register until it is used again, the later memory access can be eliminated. To exploit superword registers, th e loop is first parallelized as shown in (b). Assuming four array elements fit in one superword register, the num ber of memory accesses is reduced by 4X when the j-loop is parallelized. This reduction is the result of accessing four adjacent array elements in one superword memory access. Still, 10 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. fo r(i= l;i< = 3 2 ;i+ + ) fo r(i= l; i< = 3 2 ;i+ + ) fo r(j= 0 ;j< 3 2 ;j+ + ) fo r(j= 0 ; j< 32; j+ = 4 ) A[i] [j] = A[i-l]p] + Bp]; A[i][j:j+3] = A[i-l][j:j+3]+B[j:j+3]; (a) Original (b) SLP exploited fo r(i= l;i< = 3 2 ; i+ = 2) fo r(j= 0 ; j <32; j+ = 4 ) { fo r(i= l; i<= 32; i+ = 2 ) SV1 = B[j:j+3]; for(j= 0; j <32; j+ = 4 ) { SV2 = A[i-l][j:j+3] + SV1; -^[i] D : J+3] = A [i-l]p:j+3]+B[j:j+3]; A[i+l][j:j+3] = SV2 + SV1; A [i+l][j:j+3] = A[i][j:j+3]+B[i:j+3]; A[i]p:j+3] = SV2; } } (c) Unroll-and-j am applied (d) M emory accesses reduced Figure 1.5: M otivating example for exploiting superword-level locality. we can reduce memory accesses further by keeping a superw ord w ritten by A [i] [j : j+ 3] in a superword register until read by A [i-1 ] [j : j+ 3 ] . Since the two superword memory accesses are apart by one iteration of the outer loop (i-loop), we apply unroll-and-jam as shown in (c) so th at the two superword mem ory references access the same data w ithin the same iteration. Most existing compilers fail to remove the redundant memory accesses because they do not allocate registers for array references. For this reason, we replace the redundant superword memory accesses w ith superw ord variables. Subsequently, a backend compiler will allocate superword variables to superword registers. The num ber of memory accesses is consequently reduced as shown in (d). The loop body in (c) has six memory references th a t are reduced to four in (d) achieving the reduction of memory accesses by 1.5X in addition to the previous reduction of 4X. Overall, a 6X reduction in memory accesses is achieved from (a) to (d). 1.7 C ontributions The contribution of this thesis is the new optim izations for the architectures supporting superword-level parallelism (SLP) as follows. 11 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. An algorithm to exploit SLP in th e presence o f control flow. SLP exploits p ar allelism within a basic block lim iting its applicability. We describe how to extend the concept of SLP in the presence of control flow constructs. A key insight is th at we can use techniques related to optim izations for architectures supporting predicated execution, even for m ultim edia ISAs th a t do not provide hardw are pred ication. We derive large basic blocks w ith predicated instructions to which SLP can be applied. We describe how to minimize overheads for superword predicates and re-introduce control flow for scalar operations. We observe speedups on 8 m ultim e dia codes ranging from 1.97 to 15.00 as compared to both sequential execution and SLP alone. As an optim ization on the code parallelized for control flow, we also evaluate the costs and benefits of exploiting branches on the aggregate condition codes associated with the fields of a superword such as the branch-on-any instruction of the AltiVec. Branch-on-superword-condition-codes (BOSCC) instructions allow fast detection of aggregate conditions, an optim ization opportunity often found in multim edia appli cations such as image processing and pattern matching. Our experimental results show speedups of up to 1.40 on 8 m ultim edia kernels when BOSCC instructions are used. Compiler controlled caching in superword registers. Accessing d ata from super word registers, versus a cache or m ain memory, has two advantages, i.e., removing memory access instructions and their latencies. We treat the large superword regis ter hie as a compiler-controlled cache, thus avoiding unnecessary memory accesses by exploiting reuse in superword registers. This research is distinguished from previous work on exploiting reuse in scalar registers because it considers not only tem poral but also spatial reuse. As compared to optim izations to exploit reuse in 12 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. cache, the compiler m ust also m anage replacem ent, and thus, explicitly name regis ters in the generated code. In a study on 14 benchm arks, our results show speedups ranging from 1.40 to 8.69 as compared to using the original SLP compiler, and we eliminate the m ajority of memory accesses. Im plem entation and evaluation o f th e proposed techniques. The techniques pre sented in this thesis are fully im plem ented and evaluated on 14 benchmarks. Our im plem entation includes additional code generation techniques not supported by the original SLP compiler. The autom atically generated parallel C programs are com piled by the backend compiler and run on the PowerPC G4. The overall speedups achieved by our implem entation combining all optim izations range from 1.05 to 19.22. D IV A -specific optim izations. We developed a compiler algorithm and several opti mization techniques to exploit a DRAM m em ory characteristic(pape-mode) auto matically. A page-mode memory access exploits a form of spatial locality, where the d ata item is in the same row of the memory buffer as the previous access. Thus, ac cess tim e is reduced because the cost of row selection is eliminated. The algorithm increases frequency of page-mode accesses by reordering d a ta accesses, grouping together accesses to the same memory row. We implem ented this algorithm and present speedup results for four m ultim edia kernels ranging from 1.25 to 2.19 over the SLP algorithm alone for a PIM embedded DRAM device, called DIVA. The rem ainder of this thesis is organized as follows. The next chapter provides defini tions and background on the existing techniques we build upon in this work. These tech niques include the algorithm to exploit SLP and predicate analysis used in our approach to exploit SLP in the presence of control flow. In C hapter 3, we describe our approach to exploit SLP in the presence of control flow. Our technique to exploit superword registers as a compiler-controlled cache is described in C hapter 4. Several optimizations related 13 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. to code generation are presented in C hapter 5. The im plem entation of the techniques in the previous chapters and its evaluation are described in C hapter 6. C hapter 7 describes the DIVA ISA and DIVA-specific optim izations. Related work is described in C hapter 8 followed by our conclusion in C hapter 9. 14 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. C hapter 2 B A C K G R O U N D The techniques presented in this thesis leverage large body of work on analyses for par allelizing compilers. In this chapter, we describe the existing analyses and code transfor m ations used in our approach. D ata dependence information described in Section 2.1 is crucial in applying code transform ations which are not always legal. Both the superword- level parallelization (SLP) and superword-level locality (SLL) algorithms require d ata dependence analysis. In Section 2.2, we describe data reuse which is a core concept in the SLL algorithm. In Section 2.3, Larsen and Am arasinghe’s SLP compiler [39] is presented. All our techniques presented in this thesis are based on their SLP compiler. In the last section of this chapter, we describe a predicate analysis necessary for our extension of SLP in the presence of control flow. 2.1 D a ta D ep en d en ce There exists a data dependence between two instructions if the two instructions access the same data and at least one of them writes to the data. Given two instructions I\ and I 2 , I 2 cannot be executed before I\ if there is a dependence from I \ to I 2. Three kinds of data dependences can prevent the reordering of the two instructions. True dependence I\ writes to a data item which is read by L2. A nti-dependence I\ reads a data item which is w ritten by I 2. 15 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. fo r(i= l;i< = 3 2 ;i+ = 2 ) fo r(j= 0 ;j< 3 2 ;j+ + ){ fo r(i= l;i< = 3 2 ;i+ + ) A[i][j] = A[i-1] [j] + Bp]; fo r(j= 0 ;j< 3 2 ;j+ + ) A[i+l]p] = A[i][j] + Bp]; A[i][j] = A[i-l]p] + Bp]; } (a) D ata dependence (b) Unroll-and-jam on i-loop by 2 Figure 2.1: An example code to show dependence vectors and unroll-and-jam . O utput dependence I\ writes to a d a ta item which is also w ritten by A;. Input dependence exists between two instructions when the two instructions read the same datum . However, input dependence does not impose ordering constraints among instructions. The iteration space of an n-deep loop nest is an n-dimensional polyhedron where a value on each dimension represents the value of the loop index variable of the corresponding loop. Each point in the iteration space represents an iteration of the loop nest whose loop indices are denoted by the position vector of the corresponding point in the iteration space. D ata dependence between two distinct array references in an n — deep loop nest can be represented in a form of dependence vector, d — {d\, ofo, • • •, dn) [4]. A dependence vector captures the vector distance, in term s of the loop iterations, such th at the two references may map to the same memory location. Each vector element dj may be either a constant integer, + (a positive direction where the distance is not fixed), — (a negative direction), or * (the direction and distance are unknow n). We refer to a dependence vector as being lexicographically positive if the first non-zero di is + or a positive integer. A dependence vector is said to be consistent if the dependence distance in the iteration space is constant. Figure 2.1(a) shows an example loop nest which contains three array references. There is a true dependence from A[i] [j] to A[i-1] [j] and the dependence vector is (1,0). This means th a t 1 iteration of i-loop after A[i] [j] accesses a d ata element in memory, A[i-1] [j] accesses the same data. 16 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. A data dependence is loop-independent if the associated pair of instructions access the same data from the same iteration. Otherwise, the d ata dependence is loop-carried. All loop-carried dependence vectors are lexicographically positive. A code transform a tion preserves the semantics of a program if the ordering constraints imposed by the dependence vectors are not violated. D ata dependences for a set of instructions can be represented by a directed graph called data dependence graph. In this graph, a node represents an instruction and a di rected edge from one node to another represents a data dependence between the associated instructions. 2.2 D a ta R eu se A datum in memory is said to be reused if used m ultiple times. Reuse distance is defined as the number of iterations between two uses of the same data. D ata dependence and data reuse are similar by nature because both look for the instructions th a t use the same datum . However, not all d ata dependences translate to data reuse and vice versa. On one hand, anti-dependence prevents code reordering but is not a reuse opportunity. On the other hand, input dependence does not prevent code reordering but is a reuse opportunity. In an output dependence between two instructions, the datum itself is not reused. However, we consider output dependences as reuse opportunities since we can eliminate the earlier store instruction. Reuse can be categorized in two different ways. The first concerns whether the same or distinct parts of a datum are reused. If distinct d a ta elements are used from a su perword register, it is called spatial reuse in the superword register. If the same datum is used repeatedly from a superword register, we call it temporal reuse. T he other cat egorization concerns whether two accesses to the same datum are originated from the same or different static instructions. If two dynamic accesses to the same datum are from 17 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. one static instruction, it is called self reuse, or otherw ise group reuse. The two orthogo nal categorizations of reuse can be combined to generate four more detailed reuse types: self-spatial, self-temporal, group-spatial, and group-temporal. Our analysis is for array references whose array subscript expressions are an affine function of loop index variables. In other words, each array subscript expression is a linear function of loop index variables f ( L \ , ..., L n) = a iL i + ayL2 + ••• + anL n + b where ai and b are constants and L, are loop index variables. For the array references with non-afhne array subscript expressions (e.g., A [B [z]]), we make conservative estimations. Two array references with affine array subscript expressions are uniformly generated if each array subscript expression of one array reference is different from the corresponding array subscript expression of the other array reference only by a constant term [69]. In Figure 2.1(a), A[i][j] and A[i-1][j] are uniformly generated. Related data structures are use-definition (UD) chain and definition-use (DU) chain. Use-definition (UD) chain is a list of definitions of a variable th a t reach a particular use of the variable. Similarly, definition-use (DU) chain is a list of uses of a variable reached by the same definition. UD-chains and DU-chains are conveniently used in the SLP algorithm described in the next section. 2.3 T h e SLP algorithm The SLP algorithm finds SIMD parallelism w ithin a basic block. In Section 1.1, we have presented the SLP algorithm using an example. In this section, we describe the algorithm in detail. In the next section, we present alignment analysis, which is used by the algorithm to find alignment offsets of m em ory accesses. In Section 2.3.2, we describe distance analysis, which is used to find adjacency between memory references. The information obtained from these two analyses is used in the main algorithm presented in Section 2.3.3. 18 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. n a = gcd(ai,a 2 , &i — &2I), b = b\ m od a + a = gcd{a\, G 1 2 ), fr = (iq + 62) mod a — a = gcd(a\,a 2), b = (b\ — 62) mod a X a = gcd(a\a2 , a±b2 , a2b\, M ), b = b\b2 m od a Table 2.1: Transfer functions for the dataflow analysis to find alignments when a \n + b\ and a2 m + b2 are given. 2.3.1 A lig n m en t A n a ly sis The architectures supporting SLP either require memory accesses to be aligned or allow unaligned memory accesses but at a higher cost. As a result, finding alignment offsets of memory references is at least a perform ance issue and sometimes, a correctness problem. Alignment analysis described in this section is used to find an alignment offset of each memory reference. For aligned mem ory references, the result of this analysis is a constant alignment offset representing all runtim e addresses. O ther mem ory references are either known to be unaligned or the alignment offset is unknown for the lack of necessary information. The result of this analysis is used by the SLP compiler to pack only aligned memory references. Consequently, only aligned memory references are parallelized. To determine the alignment offsets of memory references, an iterative dataflow analysis is used [40]. Variables and constants are associated with a linear expression an + b to represent the set of values they can have. Here, a is a stride, b is an offset, and n is a set of non-negative integers. Initially, a constant D is associated with M n + d where M is the superword width and d = D mod M . Since we have a way to allocate array objects to superword boundaries, array base addresses are initialized with M n + 0. All other variables are initialized with T. To propagate element values, the transfer functions listed in Table 2.1 are used. The meet operator (fl) is used to merge control flow. All operations not listed in the table result in n + 0 which is equivalent to _ L . 19 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. n T\ = = T2 ? Ti : ± + bi = = b2 ? b = hi, c = ci + C 2 : X — bi == b2 ? b = bi, c = ci - C 2 : X X T\ is constant N ? b = b \,c = c2 x N : T Table 2.2: Transfer functions for distance analysis when two term s T\ — (X . c ;\, bi) and T2 = (X , C ‘ 2 , bo) of two linear expressions for the same variable X are given. 2 .3 .2 D ista n c e A n a ly sis Two memory references must be adjacent w ith each other to be packed in a superword memory reference. However, the adjacency of two mem ory references cannot be deter mined by the alignment analysis. Distance analysis employs an iterative dataflow analysis to find distances in memory address among m em ory references. The result of this analysis is a partitioning of memory references into groups, in each of which memory references are assigned a constant representing an offset from a reference address. Two memory references in a group are adjacent to each other if the assigned constants are different by the type size of the operand. Each variable used in address com putations is represented by a linear expression of all such variables. Each term in the linear expression is a tuple of three values representing a variable symbol, a coefficient, and a basis indicating the initialization point of the variable. Initially, all variables have only one nonzero term which is the variable itself associated with a coefficient 1 and a basis of zero. In other words, for each variable X , X is initialized w ith a tuple (X, 1,0). As the instructions are processed, the linear expression of the destination variable is replaced by the linear expression resulting from evaluating the right hand side of the instruction. Table 2.2 shows transfer functions used in this analysis. After each instruction is processed based on the transfer function, all variables associated w ith bottom (±) are reassigned by the variable itself, a coefficient of 1, and the current instruction number as a new basis. 20 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. At the end of the dataflow analysis, the address operands of mem ory references are associated w ith a linear equation consisting of variables which cannot be replaced further by other variables. All mem ory references whose address operands are different by a constant term are grouped together and assigned a unique group ID. Each instruction in a group is annotated with a pair of group ID and the constant term in the linear expression for the address operand. Two memory references in a group are guaranteed to always keep a constant distance during run time. 2.3.3 P a ck in g The SLP algorithm starts packing instructions from memory references [39]. Two memory references are packed together if both are aligned, adjacent, and their alignment offsets do not cross any superword boundary. Next, UD-chains and DU-chains are followed to pack the instructions defining the source operands or using the destination operands of already packed instructions. T he instructions packed at this tim e inherit th e alignment offsets from the instructions packed already. Since there is a limit in num ber of d ata elements th a t can be packed in a superw ord register, to represent the maximum am ount of parallelism we define superword size (SWS) as the num ber of d ata elements th a t fit in a superword. The SLP algorithm packs at most SWS instructions for each superword instruction. As a last step, the algorithm schedules instructions sequentially. At this point, instruc tions can be either packed into a superword instruction or left as a scalar instruction. In any case, all instructions on which the current instruction is dependent should be sched uled already before the current instruction is scheduled. Sometimes, cyclic dependences prevent further scheduling of the unscheduled instructions. To break th e dependence cycles, the first unscheduled packed instructions are unpacked into scalar instructions. Figure 2.2 shows an example code after each step of the algorithm. 21 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. al = bl + x [ 0 ] ; al = bl + x [ 0 ] a2 = b2 + x[l] ; a2 = b2 + x[l] a3 = b3 + X [2 ] ; a3 = b3 + x [2 ] a4 = b4 + x [ 3 ] ; a4 = b4 + x [3 ] (a) Original (b) After packing aligned memory references al = bl + x [ 0 ] a2 = b2 + x[l] a3 = b3 + x [2 ] a4 = b4 + x [3 ] (c) After following (d) After instruction UD-chain and DU-chain scheduling Figure 2.2: An example showing the packing algorithm. 2.3 .4 S u m m ary The SLP algorithm presented in this section works within a basic block. To generate SIMD parallel instructions, the SLP algorithm looks for the isomorphic scalar instruc tions th at can be replaced by a superword instruction. Packing mem ory references should satisfy further constraints imposed by the target architecture, i.e., alignment to super word boundaries and packed d ata elements in memory. Alignment analysis and distance analysis provide the information necessary to facilitate the requirements. 2.4 P red ica te A nalysis Predicates are introduced when if-conversion is applied to control flow constructs. The predicates guard the execution of the instructions th at used to be guarded by conditional statem ents. However, the analyses based on CFG cannot be used to extract, from a basic block of predicated instructions, the necessary information such as data dependence and reaching definition. Instead, we borrow analyses developed for the architectures supporting predicated execution. 22 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. In this section, we describe predicated execution and if-conversion in Section 2.4.1 and Section 2.4.2 respectively. After if-conversion is applied, instructions may be guarded by predicates. For the following passes to extract th e necessary information, we use Scott M ahlke’s predicate analysis based on predicate hierarchy graph (PHG) [44]. Predicate hierarchy graph is described in Section 2.4.3. Mutually exclusive and predicate covering are two im portant relations among predicates described in Section 2.4.4 and Section 2.4.5 respectively. Since our algorithm to restore control flow from a sequence of predicated instruction is based on M ahlke’s predicate CFG generator, it is described in Section 2.4.6. 2.4.1 P r e d ic a te d E x e c u tio n Recently, several architectures supporting predicated execution are developed [36]. Since predicated execution is one of the core concepts in our approach, its notation and seman tics are described in this subsection. In the architectures supporting predicated execution, instructions are first executed and then the result is com m itted if the guarding predicate is true, or otherwise nullified. In this thesis, an instruction guarded by a predicate p re d is denoted as follows. d s t = o p e ra tio n ; <pred> If p re d is true, d s t is updated by the operation’s result. If p red is false, however, d s t remains unchanged. p s e t is predicate defining instruction and the syntax is shown below. pT, pF = p s e t(c o n d ); <pred> p s e t takes one source operand and two destination operands. The source operand is the result of a previous comparison operation and the two destination operands are predicate variables th at can be used to guard the subsequent instructions. A p s e t itself can also be guarded by another predicate ju st like any other instructions. The semantics of p s e t 23 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. is th a t pT = cond and pF = ! cond when pred is true. If pred is false, both pT and pF remain unchanged. 2.4 .2 If-conversion: R K -A lg o r ith m If-conversion is a process of removing control flow by introducing predicates to instruc tions. For if-conversion, we use Park and Schlansker’s RK-algorithm [55]. Their algorithm consists of two main functions, R and K. R function associates each node in CFG to a predicate and K function finds the locations to insert predicate defining operations for each predicate. The detailed explanation of R function requires the following definition. D efinition 1 Let (X, Y, label) be an edge in a CFG such that Y does not postdominate X. The nodes control dependent on this edge are those and only those of the unique path starting (excluding the first) from the immediate postdominator of X to Y in the postdominator tree. R function is best described as a partitioning function of basic blocks under a certain equivalence relation. Two basic blocks x and y are in an equivalence class if they are control dependent on the same set of basic blocks. R function is obtained as follows. First, for each basic block b, a set of basic blocks on which b is control dependent are obtained. Then, the basic blocks th a t are control dependent on the same set of basic blocks are grouped into an equivalence class. A unique predicate variable is assigned to each equivalence class. A control dependence set for a basic block is a set of edges on which the basic block is control dependent. For each predicate, K function defines a control dependence set, on the element edge of which an equivalence class of basic blocks represented by the predicate are control dependent. Predicate defining operations for the associated predicate are generated for each edge in the control dependence set. 24 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Because of the semantics of their predicate defining operations, predicates may not be defined along all possible paths. To always define predicates before being used, the predicates th a t m ay have undefined paths are initially set to false. This algorithm achieves optim ality in term s of the num ber of predicates being used and the num ber of predicate defining instructions. 2 .4 .3 P r e d ic a te H ierarch y G raph (P H G ) The predicate analysis starts by building a predicate hierarchy graph (PHG) defined as follows. D efinition 2 A predicate hierarchy graph (PHG) is a directed acyclic graph representing nesting relations among predicates in a predicated basic block. A PHG consists of two types of nodes, predicate nodes and condition nodes, and is con structed as follows. Starting with a single predicate node representing a constant true, each instruction is examined in textual order. For each instruction th at defines predi cates, such as, for example pT, pF = pset (comp) <pParent>;, at most one condition node is created. For this example, a condition node for comp would be created. An edge is inserted from the predicate node for the predicate guarding the instruction to the condition node just created; for the example, an edge from predicate node pParent to condition node comp is added. Two predicate nodes for pT and pF are also created if they do not already exist. They may have been introduced into the PHG by a prior definition, in cases where multiple control flow paths merge. Then, edges are inserted from the condition node to the two predicate nodes; in the example, there would be two edges inserted from condition node comp to predicate node pT and pF, representing the true and false values of a comparison. This process is repeated for each instruction th at defines predicates. The resulting PHG perm its analysis to reason about the relationship 25 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. <TRUE> <TRUE> Cl C2 <pTl> C3 pFl pF3 pT2 pF2 pT3 TRUE pTl Figure 2.3: An example showing construction of a PHG. among predicates. Figure 2.3 shows an example of predicated instructions and a PHG built from the code sequence. 2 .4 .4 M u tu a lly E x clu siv e M utually exclusive relation between two predicates is defined as follows. D e fin itio n 3 Two predicates p i and p i are mutually exclusive if they are never simul taneously true, i.e., p i Ap2 = false. The information on m utually exclusive relation is useful in various situations. For ex ample, it can be used to discern dependences more accurately. Consider the following instructions defining a variable. There is no output dependence between them if the two guarding predicates, p i and p 2, are m utually exclusive. a = 5; <pl> a = 7; <p2> To find whether two given predicates p i and p i are m utually exclusive, the PHG is traversed backward along all paths from the predicate nodes for p i and p i. Then, a set of merge nodes is obtained by picking the nodes where two backward traversals first meet. p i and p i are m utually exclusive if the two backward traversals from p i and p i merge 26 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. from complementary edges at all merge nodes. In the example of Figure 2.3, pT3 and pF l are m utually exclusive because there is only one merge node which is condition node Cl and the backward traversals merge from two com plem entary edges. 2 .4 .5 P r e d ic a te C overin g Predicate covering is a relation between a predicate and a set of predicates. This informa tion is used in both restoring control flow and inserting s e le c t instructions. Predicate covering is defined as follows. D efinition 4 A predicate p is said to be covered by a set of predicates G if p — true => 3p' e G such that p' = true. Predicate covering relation between a set of predicates G and a predicate p is determ ined as follows. For each predicate in G, m ark the corresponding predicate node in the PHG as covered. Then, apply the above definition repeatedly to propagate predicate covering to adjacent nodes until no further changes can be made. If the predicate node for p in the PHG is m arked as covered, p is covered by G. A related definition, predicate-covering predecessor is used to restore control flow and defined as follows. D efinition 5 An instruction I guarded by a predicate p is a predicate-covering prede cessor of a later instruction /' guarded by a predicate p' iff p and p' are not mutually exclusive and neither p nor p' is covered by a set of predicates associated with the instruc tions between I and I'. For a given instruction I associated with a predicate p, its predicate-covering predeces sor instructions are obtained by scanning backward the given instruction sequence. An instruction I' associated with a predicate p', in the instruction sequence, is a predicate- covering predecessor of I if p and p' are not m utually exclusive and p' is not already 27 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. marked as covered in the PHG. After placing I' into a predicate-covering predecessor set of I, the node for p' is m arked as covered in the PHG and the newly covered predicate is propagated in the PH G to m ark other covered predicates. This backward scan stops when the predicate node for p is covered. Based on mutually exclusive relation among predicates and the definition of predicate covering, reaching definitions can be found as follows. D efinition 6 A definition d guarded by a predicate p reaches a later use u guarded by a predicate p' in the same basic block if p and p' are not mutually exclusive and neither p norp' is covered by a set of predicates associated with the instructions, defining the same variable, between d and u. 2.4.6 P r e d ic a te C F G G en era to r Given a sequence of predicated instructions, predicate CFG generator is used to restore the embedded control flow. The key idea is to find a set of predicate-covering predecessors of each predicated instruction successively and make connections in the CFG from each of the predicate covering predecessors to the current instruction being processed. W hile the original M ahlke’s predicate CFG generator scan forward to find predicate- covering successors, we modify it to scan backward. This modification was necessary for our improvement described in the next chapter. 28 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. C hapter 3 SU P E R W O R D -L E V E L P A R A L L E L ISM IN T H E P R E S E N C E OF C O N T R O L FL O W Many multimedia applications have control flow in their key com putations as discussed in Chapter 1. However, the SLP algorithm cannot exploit parallelism when the loop body has control flow because it works only within a basic block. Thus, exploiting SLP in the presence of control flow is an im portant issue th a t still needs to be addressed. In this chapter, we describe our approach employed to address the issue. A key insight is th a t we can borrow heavily from optim izations developed for architectures supporting wide-issue instruction-level parallelism and predicated execution, such as, for example, the Itanium family of processors [36], even for architectures such as the AltiVec that do not support predicated execution. T here are two reasons why similar optim ization techniques can be used for these two distinct classes of architectures: • SLP and ILP optimizations operate within basic blocks. Control flow limits the size of basic blocks, and thus lim its optim ization opportunities. We derive large basic blocks with predicated instructions to which SLP can be applied. • A commonality in m ultim edia extension ISAs is w hat we will call a select operation for merging the results of different control flow paths. Based on the value of a boolean superword, individual fields from two different inputs are combined and 29 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. reduce parallelization overheads (BOSCC) remove superword predicates (SELECT) parallelize remove scalar predicates (unpredicate) if-conversion Figure 3.1: Overview of the algorithm to exploit SLP in the presence of control flow. com m itted to a final result. Thus, s e le c t instructions appear similar to predicated instructions, even though the underlying hardw are mechanisms to implement the two are very different. We derive a large basic block of predicated instructions by applying if-conversion. Then, the SLP algorithm is applied to parallelize isomorphic instructions. Following if-conversion and parallelization, the resulting basic block m ay contain both scalar and superword instructions, and in some cases, the instructions are predicated. The compiler’s job is to remove these predicates. We discuss how superw ord predicates are removed by inserting s e l e c t operations in Section 3.2 and how scalar predicates are removed through an algorithm we call unpredicate in Section 3.3. In Section 3.4, we describe how parallelization overhead can be reduced by introducing Branch- On-Superword- Condition- Code (BOSCC). Prior to these descriptions, we outline our approach in the next section. 30 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 3.1 O verview o f th e A lg o rith m The algorithm to exploit SLP in the presence of control flow consists of five steps which are applied in sequence as shown in Figure 3.1. Each of the five steps are summ arized below. We use an example of Figure 3.2(a) to show the changes m ade at different steps. Step 1: A pplying if-conversion. As in the original SLP algorithm , first the inner m ost loop is unrolled by superword size. W hen there is control flow in the loop body, however, the basic block size is not increased even after unrolling because of the if- statem ents. To apply if-conversion, we look for the largest acyclic control flow structure of single entry and exit from the innerm ost loop body. As shown in Figure 3.2(b), the code is unrolled by a factor of four, based on the assum ption th a t the superword register w idth is sixteen bytes and the array type sizes are four bytes. Next, if-conversion using Park and Schlansker’s algorithm [55] is applied to convert control dependences into data dependences. Now, associated w ith each instruction is a predicate, shown in parenthesis at the end of the instruction, th a t captures the conditions th a t m ust be true for the instruction to execute. The p set instruction initializes the value of the predicates p H and pFl based on the value of the condition represented by compl. Step 2: Parallelization. After if-conversion, the loop body becomes one basic block of predicated instructions. A modified version of the SLP parallelizer, which packs to gether isomorphic instructions w ith their predicates, derives a mix of predicated scalar and superword instructions. This modification includes predicate analysis to find depen dences among instructions. The resulting code is shown in Figure 3.2(c). W hile some instructions are parallelized, several scalar statem ents remain unparallelized because of data dependences. Since our target architectures do not support predicated execution, both superword and scalar predicates should be removed. 31 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. for(i=0; i<1024; i+ + ){ if(fblue[i] != 255){ bbluefi] = fblue[ij; bred[i+l] = bred[i]; } } for(i= 0; i<1024; i+ = 4 ){ com pl = fblue[i] != 255; p T l, p F l = pset(com pl); bblue[i] = fblue [i]; bred[i+l] = bred [i]; } (PTl) (PTl) (a) Original (b) Unrolled and if-converted for(i=0; i<1024; i+ = 4){ vc = fblue[i:i+3] != (255,255,255,255); v_pT, v_pF = v_pset(vc); bblue[i:i+3] = fblue[i:i+3]; (v-pT) p T l, pT2, pT3, pT4 = unpack(v_pT); bred[i+l] = bred[i]; (p T l) bred[i+ 2] = bred[i+l]; (pT2) bred [i+3] = bred[i+2]; (pT3) bred[i+4] = bred [i+3]; (pT4) } for(i= 0; i<1024; i+ = 4 ){ vc = fblue[i:i+3] != (255,255,255,255); v_pT, v_pF = v_pset(vc); bblue[i:i+3] = select(bblue[i:i+3], \ fblue[i:i+3], v_pT); p T l, pT2, pT3, pT4 = unpack(v_pT); bred[i+l] = bred[i]; (pT l) bred[i+ 2] = bred[i+l] bred[i+3] = bred[i+2] bred[i+4] = bred [i+3] (pT2) (pT3) (PT4) (c) Parallelized (d) Select applied for(i=0; i<1024; i+ = 4){ vc = fblue[i:i+3] != (255,255,255,255); v_pT, v_pF = v_pset(vc); bblue[i:i+3] = select(bblue[i:i+3], \ fblue[i:i+3], v_pT); p T l, pT2, pT3, pT4 = unpack(v_pT); if(p T l) bred[i+l] = bred [i]; if(pT2) bred[i+2] = bred[i+l]; if(pT3) bred [i+3] = bred[i+2]; if(pT 4) bred[i+4] = bred [i+3]; } (e) U npredicated Figure 3.2: Example illustrating steps of SI for(i= 0; i<1024; i+ = 4 ){ vc = fblue[i:i+3] != (255,255,255,255); v_pT, v_pF = v_pset(vc); b ra n c h to LI if v_pT is all false; bblue[i:i+3] = select(bblue[i:i+3], \ fblue[i:i+3], v_pT); LI: p T l, pT2, pT3, pT4 = unpack(v.pT); if(p T l) bred[i+l] = bred[i]; if(pT2) bred[i+ 2] = bred[i+l]; if(pT3) bred[i+3] = bred[i+2]; if(pT 4) bred[i+4] = bred[i+3]; } (f) Overhead reduced compilation in the presence of control flow. 32 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Figure 3.3: Merging two superw ords using a s e le c t instruction. Step 3: E lim inating superword predicates. In Figure 3.2(d), we show how a su perword s e le c t operation can be used to select individual fields from two superword definitions according to the value of a superw ord predicate variable. Concretely, the ef fect of the s e le c t operation “ dst = s e le c t ( s r c l, src2, mask)”, is to assign src2 to dst for the fields where the corresponding mask bit is 1. Otherwise, sr c l is assigned to dst. Figure 3.3 shows this graphically using superwords of 4 scalar elements. Note th at the effect of this transform ation is to execute both control flow paths and select the value from the one th a t would have executed in the scalar version of the code. Thus, the parallelization overhead includes the s e le c t instructions, and the cost of executing both paths. In Section 3.2, we describe how to minimize the num ber of s e le c t instructions to reduce this overhead. Step 4: E lim inating scalar predicates. Next, we restore the control flow for the predicated scalar operations, as shown in Figure 3.2(e). W hile it is straightforw ard to insert control flow corresponding to the predicate on the instruction, this strategy could result in an enormous amount of additional branches as compared to the original scalar code. Thus, another im portant optim ization is minimizing the branches, with an attem pt to recover as close as possible the control flow of the original scalar code, as described in Section 3.3. Step 5: R educing parallelization overheads. The code in Figure 3.2(e) produced by applying the previous four steps can be compiled and run successfully to generate correct results. However, it suffers from the cost of always executing both control flow paths and the extra s e le c t instruction, which may offset the benefits of parallelism. The code in Figure 3.2(f) takes advantage of a common instruction supported by m ultim edia 33 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. extensions, branch-on-superword-condition-codes (BOSCC), which checks the aggregate value of the condition codes associated w ith each field of a superword predicate. For example, a branch-on-none instruction can be thought of as an A N D of the condition codes of all fields of a superword, th a t is, a branch is taken if none of these condition codes is true. The parallelization overheads may be significantly reduced if the expression associated with the BOSCC (branch) is false most of the time. D iscussion. The approach described above has been heavily influenced by features of the ISA of the target architectures, as well as the current organization of the SLP com piler, where we treat the SLP pass as a black box and feed it large basic blocks for parallelization. If the target architecture supported masked superword operations [62] and predicated scalar execution [55, 35], the code in Figure 3.2(c) would not need any further transform ations for SLP. The DIVA ISA supports masked superword operations, but not predicated execution, and th e PowerPC AltiVec, the other platform for our work, supports neither. Thus the compiler m ust eliminate the predicates on scalar instruc tions by restoring control flow, and for architectures including the AltiVec, replace the predicated superword instructions w ith s e le c t instructions th a t achieve the same effect. If the architecture combined SLP support and predication, we could adapt recently- developed algorithms by Chuang et. al. to generate phi-instructions from the CFG of a scalar code to resolve the multiple-definition problem in architectures th a t support predicated execution [16]. Their phi-instruction is a scalar analog to the superword s e le c t instruction. W hile it is possible to use the phi-predicated code as an input to SLP, some scalar phi-instructions would rem ain and scalar control flow may nevertheless need to be restored in architectures such as the AltiVec. 34 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. if (b<0){ a = 1 ; Vp, Vnp = v_pset(Vb<VO) Va = VI (Vp) Va = VO (Vnp) ... = Va Vp, Vnp = Vb < VO Val = VI Va = select(Va, Val, Vp) Va2 = VO Va = select(Va, Va2, Vnp) ... = Va }else{ a = 0; a; (a) scalar (b) parallelized intermediate form (c) naive generation of select Figure 3.4: Merging two superword definitions. 3.2 E lim in atin g Superw ord P red ica tes In this section, we show how to remove superword predicates while preserving the seman tics of the original program through the use of s e l e c t operations. Figure 3.4(a) shows an example sequential code. After if-conversion and parallelization, the control flow is removed and some instructions are guarded by superw ord predicates shown in parenthe ses as in Figure 3.4(b). The first instruction defines a superw ord predicate Vp and its complement Vnp. A field of Vp is set to true if the result of th e comparison is true, and the fields of Vnp are set to the complement of the corresponding fields of Vp. To generate the final code, it is incorrect to simply remove the superw ord predicates; for example, the first definition of Va would be killed by the second definition. Instead, we rename the second definition and use a s e le c t instruction to merge their values into one superword variable as shown in Figure 3.4(c). Figure 3.5 presents the algorithm th at generates the minimum number of s e le c t instructions required to preserve the original program ’s behavior. A s e le c t instruction is required for some but not all definitions of superword variables, as will be discussed below. Given a parallelized code with instructions guarded by predicates, we first build a predicate hierarchy graph (PHG) as defined in Section 2.4.3 [44]. At this stage, instruc tions guarded by both scalar predicates and superword predicates can be intermixed. 35 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. A lgorithm SEL: Given a sequence of predicated instructions IN, remove superword predicates from all superw ord instructions by generating s e le c t instructions. Build a predicate hierarchy graph(PH G ) DU-chain and UD-chain are built based on Definition 6 using IN and PH G for each definition d : V = s\ op S2 (P) NeedSelect < — false for each use u £ DU-chain (d) if ( 3 definition di £ UD-chain(u) such th at di precedes d in basic block ) NeedSelect < — true remove the predicate of d\ if ( NeedSelect = = true ) renam e V to r in d so th a t d : r = s 1 op S2 remove the predicate P of d Insert “dnew: V = select(V, r, P )” after d Replace d and d± with dnew in UD-chain and DU-chain. Figure 3.5: An algorithm to generate s e le c t instructions. For clarity, the reader can assume th a t the PH G discussed in this section contains only superword predicates. Our im plem entation actually has separate PHGs for superword and scalar predicates, with connections between the two graphs. The algorithm relies on both the PH G and UD-chains [2], extended in Definition 6 to consider the effects of predication. Using the PHG and the notion of reaching definition (Definition 6), we build DU-chains for the superword definitions and UD-chains for the corresponding uses as shown in Algorithm SEL of Figure 3.5. Although the PH G involves both scalar and superword predicate variables, only superword variables are included in the DU-chains and UD-chains. To correctly handle upward exposed uses, all variables are assumed to be defined on entry of the basic block, and these definitions are included when appropriate in the DU-chains and UD-chains. In this way, the compiler can generate a select instruction when there is an upward exposed use. The main loop of the algorithm SEL examines each instruction in textual order. An instruction with definition d needs a select instruction if d reaches at least one use u th a t is also reached by an earlier definition d\. If a definition d is the only definition reaching 36 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. if(p){ if(p = = 1) bred[i] = fred; bred[i] = fred; bgreen[i] = fgreen; bred[i] = fred; (P) if(p = = 0) bred[i] = 100; bblue[i] = fblue; bred[i] = 100; U p) if(p = = 1) bgreen[i] = fgreen; }else{ bgreen[i] = fgreen; (p ) if(p = = 0) bgreen[i] = 100; bred[i] = 100; bgreen[i] = 100; U p) if(p = = 1) bblue[i] = fblue; bgreen[i] = 100; bblue[i] = fblue; (p ) if(p — = 0) bblue[i] = 100; bblue[i] = 100; © o rH i i " a J ” £ > U p) (b) Naive unpredicate } (a) Predicated scalar code applied (c) Improved Figure 3.6: Restoring control flow. all its reachable uses, it needs not be combined with anything. Figure 3.4(c) illustrates this point. The first s e le c t instruction is not necessary because no earlier definition reaches any of its uses. Excluding store instructions, this algorithm generates the m inimal num ber of s e le c t instructions. Given n definitions to be combined, this algorithm generates n — 1 select instructions. The m inim ality can be proven by reducing the definitions to leaf nodes of a full binary tree. 3.3 U n p red icate After superword predicates are removed and replaced w ith s e le c t instructions, the code may still contain predicated scalar operations. The sim plest way of removing scalar predicates is to convert each predicated instruction into an if-statem en t containing one statem ent, as in the example code in Figure 3.6(b). W hile correct, the code contains numerous redundant conditional branches, six in this case. Figure 3.7 presents our algorithm th a t generates the control flow graph(CFG) repre senting the improved code as shown in Figure 3.6(c), given input instruction sequence IN. The main algorithm, called UNP, is shown in Figure 3.7(a). In addition to deriving the 37 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. A lg o rith m U N P : Given a sequence of predicated instructions IN, introduce control flow into the instruction sequence after removing predicates. PHG Build a predicate hierarchy graph DG < — Build a data dependence graph CFG < — new basic block(PO) / / root node for e ach instruction I G IN in textual order B < — {basic block b | V basic block b' G C F G (ib' is reachable from b in CFG) => (/3 an instruction I7 G b’ such th a t I is dependent on I ’)} if (B = = 0) B <- NBB(CFG, PHG, I, IN) else Move I in IN to next to the last instruction of the earliest basic block in B Insert I to end of the earliest basic block b g B r e tu r n CFG (a) U N Predicate main A lg o rith m N B B : Given an instructions I, predicate hierarchy graph PHG, the current control flow graph CFG, and predicated input code IN, generate a new basic block in CFG. P < — predicate of I b < — new basic block(P) B <- PCB(P, PHG, CFG, IN, I) for e ach b' G B generate an edge from b' to b r e tu r n b (b) Create a new basic block Figure 3.7: U npredicate algorithm. 38 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. A lgorithm PC B: Given a predicate P, predicate hierarchy graph PHG, the current control flow graph CFG, predicated input code IN, and an instruction I, return a set of basic blocks th a t are predecessors of I. RET <- 0 PH G ’ <- PHG I’ < — I.previous w hile I’ ^ N U L L P ’ < — P.predicate if (does_cover(P ’, P, P H G ’) = = TRUE) R E T <- RET U P.block P H G ’ <- m ark(PH G ’, P ’) if (is_ co v ered (P H G ’, P) = = TRUE) return R ET I ’ < — I ’.previous RET <- R E T U {ROOT} return R ET (c) Predicate covering basic blocks Figure 3.7: U npredicate algorithm (Continued). final control flow graph, UNP derives as an interm ediate result a reordered instruction sequence IN. UNP starts by building a predicate hierarchy graph, PHG. The superword predicates have been eliminated and replaced w ith s e le c t operations. However, both superword and scalar predicates have be considered to account for the scalar predicates unpacked from the superword predicates. UNP also constructs a d ata dependence graph for instruction sequence IN, capturing the ordering constraints on the instruction sequence. Subsequently, UNP initializes the CFG with a root node associated with a constant true predicate PO. The main loop iterates through the input instruction sequence IN. First, we find a set of existing basic blocks where it is safe to insert the given instruction. An instruction I guarded by predicate P can be inserted in basic block B associated with predicate P ' if P = P ' and there is no d ata dependence preventing insertion of I into 39 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. B. If the set is not empty, the instruction I is inserted at the end of the earliest such basic block B . Also, in the input instruction sequence IN , I is moved next to instruction I' th at is the im m ediate prior instruction in B. Although we have already processed I, moving it in the instruction sequence will facilitate finding predicate covering basic blocks in Algorithm PC B for subsequent instructions in the stream . If the instruction cannot be inserted into any existing basic block, we create a new basic block B' and I is placed into B '. W hen a new basic block B ' is created by Algorithm NBB, the predicate covering basic block algorithm (PCB) is used to find a set of predecessors of B '. W hereas M ahlke’s predicate CFG generator scans forward to find a set of successors, we scan the input instruction sequence backward to find predecessor instructions whose predicates cover the predicate of the given instruction. Since th e instructions in the input instruction sequence are processed sequentially, all predecessor instructions chosen m ust have been inserted already. By keeping a pointer from the inserted instructions to the basic blocks, the predecessor basic blocks for the new basic block are identified. We create a copy PHG' of predicate hierarchy graph PHG so th a t we may m ark covering predicates during the search for the appropriate basic blocks to connect to the new basic block in the interm ediate CFG. The function does_cover(P’, P, PHG’) checks if P' covers P in PHG'. If P ' is not marked yet in PHG' and P' is not m utually exclusive with P, the function returns true. The function mark (PHG’, P’) places a m ark on a predicate node P' in PHG' as covered and checks if the predecessor nodes and the successor nodes of P' are also covered as a result of marking. If a node is newly marked, this process is recursively applied to the neighbors of the node. The function is_covered(PHG’, P) examines PHG' and returns true if P is m arked as covered. 40 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 3.4 B ran ch -O n -S u p erw ord -C on d ition -C od e (B O SC C ) Inserting BOSCC instructions is not always profitable. The benefits of BOSCC instruc tions depend on properties such as the density of true or false branches, the num ber of instructions within a branching construct and the data set size. In the rem ainder of this section, we describe the tradeoff space in selecting between the two approaches for SLP in the presence of control flow; in one approach, BOSCCs are not used whereas they are used in the other. The next subsection shows results of a synthetic benchm ark to illustrate this tradeoff space, followed by the compiler analysis and code generation techniques used to exploit BOSCC. We assume th a t parallelization has been performed and select instructions are inserted where control flow paths merge, and focus on using BOSCC to reduce the overheads introduced by parallelization of m ultiple control flow paths. The main compo nents of the algorithm are: a profitability model for BOSCC instructions (Section 3.4.2); a profiling phase for collecting d a ta for the BOSCC model (Section 3.4.3); identifying regions of code and predicates associated with a BOSCC instruction (Section 3.4.4); and code generation for inserting BOSCC instructions (Section 3.4.5). 3 .4 .1 T h e C h a ra cteristics o f B O S C C To gain insight into the factors influencing the profitability of BOSCC instructions, we performed a series of experiments using the following synthetic benchmark. for(i=0; Kdatasize; i++){ temp = A [i]; if (temp == B[i]) C[i] = temp + D [i]; } In this code, whenever the condition (temp == B[i]) evaluates to false, the code following the conditional is bypassed. Thus, a BOSCC branch is most profitable when 41 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 0 10 20 30 40 50 60 70 80 90 100 0 10 20 3 0 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 True Density (%) T rue Density (%) True Density (%) (a) SWS is 4 and d ata (b) SWS is 4 and d ata (c) SWS is 16 and data set size is 16 KB. set size is 128 MB. set size is 4 KB. Figure 3.8: Run tim e of synthetic kernels. the condition evaluates to false. Profitability therefore depends on the true density of the predicate, the frequency of true values for the branch test. We expect th at low true densities should correspond to more benefit from BOSCC instructions. We present the results of a set of experiments in the three graphs from Figure 3.8. In each graph, the horizontal axis corresponds to the true density of the input data set. We used a random num ber generator to create d ata sets w ith true densities from 0% to 100%. Each graph shows the execution tim e of four versions of the code, as a function of true density. The scalar curve represents the execution tim e of the original scalar code. The other three versions were hand-coded in C extended with the Motorola-AltiVec programming model. The s e le c t version corresponds to w hat would be generated by the default approach in our compiler, as shown in Figure 3.2(e) and described in [60]. The B O N version was derived by adding a branch-on-none(BON) instruction to the assembly code of the s e le c t version to bypass the code guarded by the conditional when the test on all fields evaluates to false, similar to the example in Figure 3.2(f). Finally, the BON+BOA version was derived by adding a branch-on-all (BOA) to the BO N version. The branch-on-all perm its an additional optim ization which avoids the select operation; if all fields are known to evaluate to true, then the value of all fields of the corresponding superword of C are the result of the operation guarded by the conditional. 4 2 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. In Figure 3.8(a) and (b), the superw ord size (SWS) is four, th a t is, each superword can hold four integer array elements. Therefore the am ount of available parallelism in a superword operation is four. Figures 3.8(a) and (b) show the run tim es of the benchm ark for two d ata sizes: in (a) the d ata size fits in the L I cache and in (b) the d ata size is larger than the L2 cache. First, we consider the results of Figure 3.8(a). T he scalar curve is consistently slower than the various parallel versions. It perform s best when the true density is either very low or very high. This is because the G 4’s branch prediction is most effective when the branching behavior is consistent. In the s e l e c t version, the branch is elim inated and replaced w ith a merge of fields across the different control flow paths. For this reason, the execution tim e is the same regardless of the true density. It has the best performance among the four versions for true densities at or above 20%. The performance of the BO N version is best for true densities near 0% and is the same as the s e le c t version for tru e densities above 40%. Interestingly, we see th a t the slowest performance is at a true density of 16%, also related to branch prediction accuracy. It is lower th an 50% because the branch-on-none is taken only when the conditions for all four consecutive scalar comparisons are false. For a superw ord size of four and true density of D , the probability for all four conditions to be false is (1 — D )4. W hen two BOSCC instructions are used for the BON+BOA version, the overhead of an additional branch overcomes any benefit. T he results of Figure 3.8(b) show how the tradeoff space is affected when the d ata footprint exceeds the L2 cache size. As the com putation becomes memory bound, the benefits of parallelization become less significant. Thus, the performance gap between the scalar and parallel versions is reduced. For true densities below 40%, the scalar version is actually the fastest. The BO N version behaves similarly to the scalar version for low true densities, while it behaves similarly to the s e le c t version for higher true densities. The BON+BOA version has the best performance for very high true densities. 43 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. To evaluate the effects of increasing the am ount of available parallelism, Figure 3.8(c) shows the im pact of modifying the d ata type to ch ar, thus increasing the superword size to 16. This change increases the performance gap between the scalar version and the other parallel versions for all values of tru e densities. The various parallel versions exhibit very similar behavior. From the experim ents shown in Figure 3.8, we can summ arize the following con clusions. The BOSCC versions incur an overhead due to the addition of branches as compared to the s e l e c t version, and sometimes this overhead makes them unprofitable. For this reason, we have decided in our compiler to use ju st one BOSCC instruction, com parable to the B O N version. We have also determ ined th a t low tru e density can be used as one predictor of profitability. In addition, the profitability of the B O N version over the s e le c t version increases as the cost of the instructions in the branch body increases. Also, as parallelism increases, the profitable true density range of the B O N version ac tually decreases. W hile not shown in these experiments, a related profitability criteria is how many instructions appear in the code bypassed by the branch; more instructions lead to greater benefit. Finally, the cost of memory access instructions can dwarf the benefits of parallelizing the com putation, but the BO N version performs comparably to the best version for all true densities. In general, while not always the best performing version, the B O N version has behavior th at is com parable to the best version for all of the experiments, whereas both the scalar and s e le c t versions sometimes are much slower th an the others. Based on the insights presented in this section, we build a model which can be used to guide the generation of BOSCC instructions only when profitable. 3 .4 .2 B O S C C M o d el The BOSCC model determines the profitability of using a BOSCC instruction to bypass code, allowing the compiler to decide whether or not to generate a BOSCC instruction. The model uses two key properties of the code to determ ine profitability. The first, PAFS 44 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. (percentage of all false superwords), is the percentage of superw ord predicates where all fields are false, and indicates how frequently a BOSCC branch is taken. D eterm ining the PAFS value associated w ith a particular superword predicate m ust be done dynamically, and is computed in a separate profiling phase as discussed in Section 3.4.3. The second, N B I (number of bypassed instructions), is the num ber of instructions bypassed when a BOSCC branch is taken, which represents the num ber of instructions for a single execution of the parallelized code. The N B I can be com puted statically by the compiler. The num ber of instructions of the s e le c t and BOSCC versions are com puted by Equation 3.1 and 3.2 respectively, and a BOSCC instruction is profitable whenever A T (select) > AT(BOSCC). A T (se le c t) = N B I (3.1) AT (BOSCC) = N B I + 1 - P A F S x N B I (3.2) In Equation 3.2, we add an additional instruction for the BOSCC branch, and subtract the number of instructions skipped by the BOSCC branch (PAFS x NBI). In reality, the cost of executing a BOSCC instruction may be higher or lower than th a t of other instructions depending on how the branch predictor performs. The additional weight of executing BOSCC instructions can be varied to improve the precision of the model, but since it is machine-specific, we omit it here. Note th at this model takes into account the effects discussed in the previous section of the data type size and associated parallelism, as well as the am ount of com putation bypassed by the BOSCC instruction. However, it ignores locality effects, which m ust be addressed separately. To provide intuition as to why parallelization using BOSCC is more profitable than scalar execution of the equivalent code, let us assume th a t a scalar instruction is m apped to a single equivalent superword instruction and th at the run tim e is com puted as the 45 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. A lgorithm IN ST R U M E N T : Given a basic block B P < — find superword predicates (B) if (P = = 0) return Insert a basic block counter to B for each superword predicate pred € P Insert a counter for pred (a) Algorithm vec = vec_ld(i_0, ptr); *(_basicblock + 0) = *(_basicblock + 0) + 1; v e c ll8 = vecJd(i_0, ptrl33); vecll9 = vec_ld(i_0, ptrl34); vecl21 = vec_cmpeq(vec, v ecl20); vecl23 = vec_cm peq(vecll8, vecl20); vecl25 = vec_cmpeq(vecll9, vecl24); vecl26 = vec_and(vecl21, vecl23); vecl27 = vec_and(vecl26, vecl25); vecl29 = vec_cmpeq((vector unsigned char)vecl27, vecl20); vecl30 = vecl29; sel = vecJd(i_0, p tr 135); v ecl3 8 = (vector bool char)vec_splat_u8(0); instrum ent = vec_all_eq(vecl30, vecl38); if (instrum ent = = 1) { *(_superword_predicates + 0) = *(_superword_predicates + 0) + 1 } sel = vec_sel(sel, vec, vecl30); (b) Exam ple Figure 3.9: Autom atic instrum entation to compute PAFS in profiling phase. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. number of executed instructions. In this specific situation, we can have a parallelized code using a BOSCC where each instruction is the superw ord counterpart of the scalar instruction in the original. The BOSCC can be thought of as the counterpart of the original scalar branch. If the branch body is executed in the scalar version more than once out of SWS iterations, the branch body in the BOSCC version will be executed exactly once for SWS scalar iterations. In this case, the version using BOSCC will run faster than the scalar version because of less loop overhead. If the branch body is not executed in the scalar version for SWS iterations, the branch body in the BOSCC version also will not be executed and will run faster because of less loop overhead. 3.4.3 P ro filin g S u p p o rt to C o m p u te P A F S The PAFS value in the previous model is determ ined using autom atic instrum entation in a separate profiling phase 1. Figure 3.9(a) shows the simple algorithm for inserting instrum entation code. First, for each basic block, all superword predicates are identified. Next, for each basic block th a t contains superword select instructions, we m easure the total num ber of tim es the block is executed and, for each predicate, the num ber of BOSCC’s taken. To increment the counter only when the superword predicate contains false values in all the fields, we also use a BOSCC instruction. Use of BOSCC expedites the profile run as compared to checking the individual fields in a sequential loop. An example of instrum ented code is shown in Figure 3.9(b). The instructions in bold font are added for profiling. 1 While pro filing has limitations in deriving dynamic information, particularly when a d i f f e r e n t input data s e t i s used than was used in the profil ing stage, we forgo more elaborate approaches f o r deriving dynamic information o n-t he-fly, since is sues of deriving dynamic information axe orthogonal to the focus of t h i s work. Other approaches could also be used to derive the value of PAFS. 47 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. A lgorithm ISP(B): Given a basic block B / / Initially, all instructions are associated w ith constant true predicates for each select instruction I :“dst = select(srcl, src2, pred)” G B where dst = = srcl / / srcl is associated with ’tru e ’ value of pred / / src2 is associated with ’false’ value of pred predicate(I) < — pred; IdentifyBranchBody(src2, I, pred); IdentifyM emoryAccesses(srcl, dst, pred); (a) Identifying superw ord predicates A lgorithm IdentifyBranchBody(src, I, pred): Given an operand src, an instruction I and a predicate pred rd < — reaching definitions of src; if (rd is not a single reaching definition V I is not the only use of rd) return; predicate(rd) < — pred; for each source operand src of rd IdentifyBranchBody(src, rd, pred); (b) Identifying branch body A lgorithm IdentifyM em oryAccesses(src, dst, pred): rd reaching definitions of src u < — uses of dst if (rd is single reaching definition A rd is a load A u is the only use A u is a store A rd and u access th e same address) predicate(rd) < — pred predicate(u) < — pred (c) Identifying unnecessary memory accesses Figure 3.10: Algorithm to identify a predicate for instructions. 48 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 3 .4 .4 Id en tify in g B O S C C P r e d ic a te s Prior to code generation, the compiler locates predicates associated with select instruc tions and identifies th e set of instructions guarded by each predicate. The third operand of each superw ord select instruction, as shown in Figure 3.3, represents a predicate. The algorithm to extend these predicates to other instructions is shown in Figure 3.10. Initially, a constant true predicate is associated w ith all instructions. The algorithm in Figure 3.10(a) scans the code to locate select instructions. For each select instruction whose first source operand and the destination operand are the same, it associates the predicate found in the third source operand with the select instruction, and then follows UD and DU-chains to locate other instructions to which this predicate can be associated. Two sets of instructions are considered, as shown in Figures 3.10(b) and (c). The goal of the algorithm in Figure 3.10(b) is to identify the set of instructions th at are executed only when the predicate evaluates to true. The result of a superword select instruction is the first operand (srcl) when the predicate pred contains all false values. We can therefore bypass any instructions th at define the value of the second operand src2 if all the fields of pred are false. This set of instructions can be thought of as the branch body from the original program , although it could include an even larger set of instructions. The algorithm IdentifyBranchBody then recursively follows the definitions of the variables contributing to the value of src2. Those th a t have a single definition reaching a single use can be guarded by the predicate pred, and can be bypassed by the BOSCC instruction. The goal of the algorithm in Figure 3.10(c) is to eliminate unnecessary memory accesses occurring when all fields of pred evaluate to false. If a load to srcl and a store of dst occur in the code, the value is not modified between the load and store, and no other instructions depend on this load and store, both memory accesses can be predicated with pred. T he algorithm in Figure 3.10 guarantees th a t at most one predicate is associated with each superword instruction. 49 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. A lgorithm FBR(B): Given a basic block B n < — 0 Region [0] < — new region (NULL) current < — prev < — NULL for each instruction I G B pred < — predicate(I) if (current ^ pred) Region [n] .end « — prev n-|— h Region[n] < — new region (pred) Region [n] .moved < — false Region [n] .begin < — I current < — pred prev < — I Region[n++].end < — I for (i= l; i< n; i+ + ) for (j= i+ l; j< n ; j+ + ) if (Region[i],predicate ^ NULL A Region[i].moved = = false A Region [i]. predicate = = Region [j].predicate) if (Region [j] can be moved after Region [i] .end) move instructions in Region [j] after Region [i] .end Region [j]. moved < — true else if (Region[i] can be moved before Region [j],begin) move instructions in Region[i] before Region [j],begin Region[i],moved true return Region, n (a) Form BOSCC regions A lgorithm Insert-BO SCC(B): Given a basic block B B ' <- ISP(B) R, n FB R (B ') for (i= l; i<n; i+ + ) if (R[i].moved = = false A R[i].predicate NULL) NLselect < — # instructions(R[i]) NLboscc NLselect + 1 - PAFS(R[i]) x NI_select if (NLboscc < NI_select) Insert boscc(R[i]) (b) BOSCC insertion algorithm main Figure 3.11: BOSCC insertion algorithm. 50 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 3 .4 .5 In se r tin g B O S C C In str u c tio n s Figure 3.11(b) shows the main algorithm to insert BOSCC instructions. After the predi cate for each instruction is identified, instructions with the same predicate are combined into a BOSCC region if there are no intervening dependences. In the algorithm shown in Figure 3.11(a), the initial BOSCC regions are formed by finding consecutive instructions guarded by the same predicate. Then the BOSCC regions associated with the same non constant predicate are merged if no d ata dependences with the intervening instructions prevent the code motion. The algorithm first checks if the later region can be moved to the end of the earlier region. If this is not possible because of the d a ta dependences with the intervening instructions, the algorithm checks if the earlier region can be moved be fore the first instruction of the later region. T he goal is to form the largest possible region guarded by a single BOSCC predicate. The num ber of adjacent instructions guarded by the same predicate provides the value of N B I for the BOSCC model, while the value of PAFS is derived from profiling. If profitable, a BOSCC instruction is inserted just prior to the instructions th at form a BOSCC region, and it branches to the instruction immediately following the last instruction of the BOSCC region. 51 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. C hapter 4 S U P E R W O R D -L E V E L L O C A L IT Y W hile the m ost im portant optim ization opportunity for the architectures supporting SLP is to exploit parallelism in the SIMD functional unit, another as im portant optim ization is to exploit mem ory hierarchy to reduce memory access tim e. Since parallelization is not as effective when bottleneck is memory accesses, the optim izations targeting memory hierarchy are even more im portant for the architectures supporting SLP. A key idea is to notice th a t a superword register file offers a much larger space th an a scalar register file to store frequently used d ata items. We tre a t the superw ord register file as a small compiler-controlled cache. Our approach is distinguished from previous work on increasing reuse in cache [17, 23, 26, 28, 29, 38, 66, 69], in th a t the compiler m ust also manage replacement, and thus, explicitly name the registers in the code. As compared to previous work on exploiting reuse in scalar registers [69, 10, 45], the compiler consid ers not ju st tem poral reuse, but also spatial reuse, for both individual statem ents and groups of references. Exploiting spatial and group reuse in superw ord registers requires more complex analysis as compared to exploiting tem poral reuse in scalar registers, to determine which accesses m ap into the same superword. We develop an algorithm and a set of optimizations to exploit reuse of d ata in super word registers to eliminate unnecessary memory accesses, which we call superword-level locality (SLL). In conjunction with exploiting SLP, the algorithm performs what we call 52 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. superword replacement, to replace accesses to contiguous array d ata with superword tem poraries and exploit reuse by replacing accesses to the same superword w ith the sam e tem porary. Following this code transform ation, a separate compilation pass will be able to allocate superword registers corresponding to the superword tem poraries. To enhance the effectiveness of superword replacem ent, it is combined with a loop transform ation called unroll-and-jam, whereby outer loops in a loop nest are unrolled, and the resulting duplicate inner loop bodies are fused together. Unroll-and-jam reduces the distance be tween the reuse of the same superword, when reuse is carried by an outer loop, and brings opportunities for superword replacem ent into the innerm ost loop body of the transform ed loop nest. The optim ization algorithm derives appropriate unroll factors for each loop in the nest th a t attem pt to maximize reuse while not exceeding the num ber of available registers. The rem ainder of this chapter is organized into 5 sections. Section 4.1 motivates th e problem and introduces terminology used in the rem ainder of the chapter. Section 4.2 presents an overview of the superword-level locality algorithm. Section 4.3 describes how the algorithm computes the total num ber of registers required for exploiting reuse and the resulting num ber of memory accesses. Section 4.4 describes aspects of how the search space is navigated. Section 4.5 presents optim izations to actually achieve this reuse of d ata in superword registers. 4.1 B ackground and M otivation In m any cases superword-level parallelism and superword-level locality are complementary optim ization goals, since achieving SLP requires each operand to be a set of words packed into a superword, which happens, w ith no extra cost, when an array reference with spatial reuse is loaded from memory into a superword register. Therefore, in m any cases the loop th a t carries the most superword-level parallelism also carries the most spatial 53 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. reuse, and benefits from SLL optim izations. In this chapter, we achieve SLL and SLP somewhat independently, by integrating a set of SLL optim izations into an existing SLP compiler [39]. The rem ainder of this section motivates the SLL optim izations. Achieving locality in superword registers differs from locality optim ization for scalar registers. To exploit tem poral reuse of data in scalar registers, compilers use scalar replacement to replace array references by accesses to tem porary scalar variables, so th at a separate backend register allocator will exploit reuse in registers [10]. In addition, unroll-and-jam is used to shorten the distances between reuse of th e same array location by unrolling outer loops th a t carry reuse and fusing the resulting inner loops together [10]. In contrast, a compiler can optim ize for superword-level locality in superword registers through a com bination of unroll-and-jam and superword replacement. These techniques not only exploit tem poral reuse of data, but also spatial reuse of nearby elements in the same superword. In fact, even partial reuse of superwords can be exploited by merg ing the contents of two registers containing superwords th a t are consecutive in memory (see Section 4.5.4). Thus, as is common in multim edia applications [57], stream ing com putations with little or no tem poral reuse can still benefit from spatial locality at the superword-register level, in addition to the cache level. W hile cache optim izations are beyond the scope of this thesis, we observe th at the SLL optimizations presented here can be applied to code th a t has been optimized for caches using well-known optim izations such as unim odular transform ations, loop tiling and data prefetching. W hen combining loop tiling for caches, superword-level parallelism and superword-level locality optim izations, the tile sizes should be large enough for superword- level parallelism, and for unroll-and-jam and superword replacem ent to be profitable. These points are illustrated by way of a code example, w ith th e original code shown in Figure 4.1(a). This example shows three optimization paths. Figure 4.1(d) optimizes the code to achieve superword level parallelism. In Figures 4.1(b) and (c), we show how the original program can instead be optimized to exploit reuse in scalar registers, using 54 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. for(i= 0; i< n; i+ + ) for(i=0; i<n; i+ + ) fo r (j=0; j< n ; j+ = S W S ) for (j=0; j< n ; j+ + ) a[i][j:j+SW S-l] = a[i-l][j:j+SW S-l] \ a K [j] = a[i-l]p] * b[i] + b[i+ l]; * b[i] + b[i+l]; (a) Original loop nest. (d) Superword-level parallelization (j-loop). for(i=0; i<n; i+ = 2 ) for (j=0; j< n ; j+ + ) { for(i= 0; i< n; i+ = 2 ) a [i] D ] = a[i-l][j] \ fo r (j=0; j< n ; j + = SWS) { * b[i] + b[i+l]; a[i][j:j+SW S-l] = a[i-l] [j:j+SW S-l] \ a[i+l]lj3 = a[i][j] \ * b[i] + b[i+l]; * b[i+l] + b[i+2]; a[i+l][j:j+SW S-l] = a[i]R:j+SWS-l] \ } * b[i+l] + b[i+2]; } (b) Unroll-and-jam on (a) (i-loop). (e) Unroll-and-jam on (d) (i-loop). tm pl[0:SW S-l] = b[0:SWS-l]; stm p l = tm p l[0]; stm p2 = tm p l[l]; field = 2; fo r(i= 0; i< n; i + = 2) { / / ’field’ denotes an index into ’tm p l’ / / for stm p3 if(field = 0) tm p l = b[0]; tm pl[0:SW S-l] = b[i+2:i+SW S+l]; for(i= 0; i<n; i+ = 2 ) { stm p3 = tm p l [field]; tm p 2 = b[i+l]; for (j=0; j< n ; j + = SWS) { tm p3 = b[i+2]; tmp2[0:SW S-l] = a[i-l][j:j+SW S-l] \ for (j= 0; j< n ; j+ + ) { * stm p l + stm p2; tm p4 = a[i-l][j] \ a[i+l][j:j-l-SW S-l] = tmp2[0:SWS-l] \ * tm p l + tmp2; * stm p2 + stmp3; a[i+l][j] = tm p4 \ a[i][j:j+SW S-l] = tmp2[0:SWS-l]; * tm p2 + tmp3; } a[i] jj] = tmp4; stm p l = stmp3; } stm p2 = tm pl[field+ l]; tm p l = tmp3; field = (field+2)%SWS; } } (c) Scalar replacement on (b). (f) Superword replacement on (e) Figure 4.1: Exam ple code for SLL. 5 5 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Original Scalar register reuse SLP only SLP and SLL Figure 4.1(a) Figure 4.1(c) Figure 4.1(d) Figure 4.1(f) Reads 3 n 2 n 2/ 2 + n 2 n2 + n2/S W S (n2/ 2 + n )/S W S Writes n 2 2 n n2/S W S n2/S W S Table 4.1: Num ber of array accesses under different optim ization paths. unroll-and-jam and scalar replacem ent, respectively. In Figures 4.1(e) and (f), we combine these ideas, using unroll-and-jam and superword replacement, respectively, to transform the code in (d) for both superword-level parallelism and superword-level locality. Table 4.1 shows how the three different optim ization paths affect the num ber of array accesses to memory in the final code. The original code has n 2 reads and writes to array a and 2n 2 reads to array b. Exploiting superword-level parallelism in loop j, as in Figure 4.1(d) reduces th e num ber of reads and writes to array a by a factor of S W S since each load or store operates on S W S contiguous d ata items; for array b, there is no change since the array is indexed by i rather than j. If instead the code was optimized for scalar register reuse, as in Figure 4.1(c), we can reduce the num ber of array reads of a down by a factor of 2, and reads of b by a factor of n, with the num ber of writes remaining the same. By combining superword-level parallelism and superword-level locality as in Figure 4.1(f), we see th a t the num ber of reads and writes is further reduced by a factor of SW S. Figure 4.1(f) illustrates some of the challenges in exploiting reuse in superwords. Analysis m ust identify not ju st tem poral, but also spatial reuse, and for both individual statem ents and groups of references. The compiler also m ust generate the appropriate code to exploit this reuse; for example, we select scalar fields of b from the superword, since we are not parallelizing the i loop. The rem ainder of this chapter describes how the compiler autom atically generates code such as is shown in Figure 4.1(f). 5 6 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 4.2 O verview o f Superw ord-L evel L ocality A lgorith m The superword-level locality algorithm has three main steps, as summ arized below. Each step will be described in more detail in th e three subsequent sections. Step 1: Identifying R euse. The first step of the algorithm is to identify both array references and loops carrying reuse. The array references carrying reuse are the ones for which superword replacem ent may be applicable. The loops carrying reuse are the ones to which the algorithm will consider applying unroll-and-jam . Section 2.2 gives a detailed description of d ata reuse. For the purposes of this al gorithm, the relevant dependences carrying reuse are a subset, and are characterized as follows: 1. We consider only true dependences, input dependences, and output dependences. 2. We consider only lexicographically positive dependences. 3. A dependence vector m ust be consistent, or it m ust be invariant with respect to one of the loops in the nest. Applying unroll-and-jam to a loop i with a consistent dependence varying with respect to loop i can create loop-independent dependences in the innerm ost loop of the unrolled loop body. In the example in Figure 4.1(a), there is a true dependence between references A[i][j] and A[i — l][j] w ith distance vector (1, 0). After unroll-and-jam , a loop-independent dependence is created between A[i][j] in the first statem ent and A[i][j] in the second statem ent of the loop body, creating a reuse opportunity. In addition to reuse between copies of a reference created by unrolling, there can be reuse across loop iterations. References with consistent dependences carried by a loop have group reuse which can be exploited by using extra registers to hold the data across iterations. As in previous work [10], our algorithm exploits reuse across iterations of the 57 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. tmp[0:3] = A[i:i+3]; vec2[0:3] = A[i+4:i+7]; for(i= 0; i<N ; i+ = 4){ for(i= 0; i<N ; i+ = 4){ vecl[0:3] = A[i:i+3]; vec2[0:3] = A [i+8:i+ ll]; vecl[0:3] = tmp[0:3]; tmp[0:3] = vec2[0:3]; vec2[0:3] = A [i+8:i+ ll]; } } (a) Original (b) After exploiting reuse Figure 4.2: Reuse across iterations. innerm ost loop only, because exploiting reuse carried by an outer loop could potentially require too m any registers to hold the d a ta between uses. Figure 4.2 shows how reuse can be exploited across iterations of the innerm ost loop by using one register to keep the data th at is reused on every two iterations. For loop-invariant references, unroll-and-jam generates loop-independent dependences between the copies of the reference in the unrolled loop body, since the same location is being referenced by each copy. Step 2: D eterm ining unroll factors for candidate loops. The algorithm next determines the unroll factors for each candidate loop th a t carries reuse, as previously described, and for which unroll-and-jam is legal. The optim ization goal is as follows. Optimization Goal: Find unroll factors (A” i , X 2 , —X n) for loops 1 to n in an n-deep loop nest such th at the num ber of memory accesses is minimized, subject to the constraint th at the num ber of superword registers required does not exceed what is available. The algorithm determines the unroll factors ( X \,X 2 , ■ ■ ■ X n) by searching for the com bination of unroll factors th at satisfies the above optim ization goal. To guide the search, the algorithm calculates the total num ber of registers required for exploiting reuse, which 58 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. is the sum of the num ber of superwords accessed by the references in the loop body after unroll-and-jam is applied, plus the num ber of registers needed for holding data across iter ations of the innerm ost loop. Section 4.3 describes how the algorithm computes the total number of registers required for exploiting reuse and the resulting num ber of memory accesses. Section 4.4 describes aspects of how the search space is navigated. Step 3: C ode Transform ations - U nroll-and-Jam , Superword R eplacem ent, and R elated O ptim izations. Once the unroll factors are decided, unroll-and-jam is applied to the loop nest. Array references are replaced with accesses to superword tem poraries. As part of code generation, our compiler performs related optim izations to reduce the num ber of additional memory accesses and register requirem ents introduced by the SLP passes. These code transform ations are the topic of Section 4.5. 4.3 M od elin g R eg ister R eq uirem en ts & : N u m b er o f M em ory A ccesses This section presents the com putation of the num ber of registers required for exploiting data reuse in superw ord registers and the resulting num ber of mem ory accesses, which are the param eters used to guide the search for the combination of unroll am ounts to be applied to the loop nest. The next subsection describes how the algorithm computes the superword footprint, which represents the num ber of superwords accessed by the unrolled iterations of the loop nest as a function of the unroll factors. Subsection 4.3.2 presents the com putation of the extra registers needed for reusing data across loop iterations. The total number of registers and the corresponding num ber of mem ory accesses are computed in subsection 4.3.3. 5 9 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 4 .3 .1 C o m p u tin g th e S u p erw ord F o o tp rin t This section presents the com putation of the superw ord footprint of the references V in a loop nest, F l(V '), after unroll-and-jam is applied to the nest with unroll factors (X u X 2,...,X n). The algorithm for computing the superw ord footprint for a loop nest first partitions the references in the loop into groups of uniformly generated references [69] (See Sec tion 2.2) 1. Then, for each group of references, it computes the number of superwords accessed in the unrolled loop body. Finally, the total num ber of superwords is com puted as the sum of those of each group of uniformly generated references. We first discuss how to com pute the superword footprint of a single reference as a function of the unroll factors of each unrolled loop. Then we discuss how to com pute the superword footprint of a group of uniformly generated references. The superword footprint of a group may be smaller th an the sum of the individual footprints, since the same superword may be accessed by two or more copies of the original references when the loops are unrolled. Our m ethod determines the num ber of superw ord registers required to hold th e data accessed by the references in the unrolled loop body. However, extra registers m ay be needed to, for example, align a superword operand which is already kept in superword registers. T h at is, the com putation may require more registers than those needed for storing the data. Therefore, we reserve some scratch registers for m anipulating d ata and com pute the num ber of registers needed just for storing the data accessed in the unrolled loops. To simplify the presentation, we assume a loop nest of depth n where all array ref erences have array subscripts th a t are affine functions of a single index variable (SIV 1 We assume that two or more references that access the same array, but are not uniformly generated, access distin ct data in memory, which results in a conservative estimate of the number of superwords accessed by the group and of the number of reg isters required. 60 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. subscripts) 2. We also assume th a t each p-dimensional array referenced by the loop is defined as A[sp][sp_i] ... [si], where Sh is the size of dimension h, 1 < h < p. Dimen sion 1 is the lowest dimension of the array, i.e., the dimension in which consecutive elements are in consecutive m em ory locations. A reference v to array A is then of the form A[ap*lp + bp}[ap- i* lp -i + bp-i]... [a i*h + bi]. Thus, a reference with SIV subscripts has each array dimension h associated with ju st a single loop index variable in the nest, and the loop index variable associated with h is represented as We also assum e th a t the arrays are aligned to a superw ord in memory and th a t the loops are normalized. 4.3.1.1 Superword Footprint o f a Single R eference For each reference v with array subscripts a/, * If, + b, where h is the array dimension and lh is the loop index variable appearing in subscript h, the num ber of superwords accessed by all copies of v when If, is unrolled by Xfh is given by the superword footprint of v in lh, or Fih(v). W hen dimension h is the lowest array dimension (h = 1), the superword footprint is given by Equation (4.1). Equation (4.1a) corresponds to the footprint of a loop-invariant reference. Equation (4.1b) corresponds to the footprint of a reference w ith self-spatial reuse within a superword, as illustrated in Figure 4.3(a), and (4.1c) holds when the reference has no spatial reuse. Flh(v) 1 Xu *ah S W S' XlH (a) if ah = 0 (b) if ah < S W S (c) if ah > S W S (4.1) W hen h is one of the higher dimensions, 1 < h < p, and loop lh is unrolled, the offset between the footprints of each copy of v is ah * n ^ i si, where s, is the size of the i jth Our current implementation can handle a f f i n e SIV subscripts and certain a f f i n e MIV subscripts. 61 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. su p e rw o rd f o o t p r i n t o f s i z e a*Xi„ superword Superwords in memory a*1 + b a*2 + b a*3+ b a*4 + b a*(X,„-1) + b a*X,, + b — superword footprint: H (a) h = 1 and ah < SW S sf t *nsi offset = a*ns, I — ^ h - 1 n s, Superwords in memory JLL 7 T 1 - M t (ah *1+ b)*rjs, J h - 1 (ah *2+b)*ns,_ (ah *X |h +b)*ns, (b) h ? 1 Figure 4.3: Superword footprint of a single reference. array dimension, as shown in Figure 4.3(b). Assuming th a t the size of the lowest array dimension (si) is larger than SW S, which is usually the case in practice for realistic array dimensions, each copy of v in the unrolled loop body corresponds to a separate footprint, as shown in Figure 4.3(b). Therefore the size of the footprint of v in lh is the sum of the X ih disjoint footprints, and is recursively defined by Equation (4.2), where Fit (v) is computed as in Equation (4.1). 62 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Fih(v ) = X ih *Fih_x(v) h = ( n X h) * F h (v) (4.2) i=2 For a single reference, the num ber of superw ord registers required to keep the super word footprint given by Equation (4.1) and th e num ber of scalar registers th at would be required if the same unroll factors were used differ only when a/j < SW S, th at is, when spatial reuse can be exploited in superword registers. For a group of uniformly generated references the analysis m ust also consider group reuse, as discussed next. 4.3.1.2 Superword Footprint o f a G roup o f References The num ber of superwords accessed by a group of uniformly generated references V = {'«i, V‘ 2 ,..., vrn\ when loop //,, is unrolled by X ih is the superword footprint of the group, Fih(V). The superword footprint of a group consists of the union of the footprints of the individual references, as some of the reference footprints may overlap, depending on the distance between the constant term s in the array subscripts. The footprints of two uniformly generated references may overlap in dimension h only if they overlap in all dimensions higher than h. For example, the footprints of references A \2 i\ [j + 2] and [2i + 1] [j] do not overlap in the highest (row) dimension, since the first reference accesses the even-numbered rows of the array and the second accesses the odd-num bered rows. Therefore the footprints cannot overlap in the lowest (column) dimension. On the other hand, the footprints of A [2i] [j + 2] and A[2t + 4] [;j} overlap in the row dimension for iterations i\, F, 1 < i i • F < X u such th at 2i\ = 2F + 4. For the iterations of % in which the footprints overlap in the row dimension, the footprints may overlap in the column dimension if there exist iterations j \ . j-2, 1 < j i , j 2 < X j, such th at Ji + 2 = j 2. 63 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Flh(v 1^ 2) 61) < ah = 0 bi) < ah superword K H superword footprint: '% % % Superwords in memory a*1 + b, a*2 + b, a*1 + b; a*2 + b; (a) ah >sws and (b2-b,) < ah *X / ( i and (b2-b 1 ) mod ah =0 superword Superwords ^ h * ------------H in memory I! W P/////A...... * it * a*X *, + b2--------------------------------------------------------------------------------------------- (b) ah < sws and (b2 -b ;) < ah*Xf Figure 4.4: Superword footprint of a group of references. ' ? a*1 + b, — a*2 + b, — - f a*X|, + b, — a*1 + b2 — a*2 + b2 — X ih + (62 - h )/a h (a) if ah > S W S and (fe2 - and (62 — bi) m od a^ = \(ah * X ih + b 2 - h)/SW S~\ (b) if ah < SW S and (b2 - Fi, fuii + Ft. (vo) (c) otherwise Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. superword M Superwords in memory ■ r a*1 + b, —I a*2 + b, — a*1 + b- a*2 + b; c) ah >sws and (b2-b,) > ah *X ( superword Superwords in memory m m H iH r 1 L H II a*1 + b, — a*2 + b, — a*X, + b, — a*1 + b2 - a*2 + b2 • a*Xl h + b2 - I T (d) ah < sws and (b2-b,) > ah*Xl h Figure 4.4: Superword footprint of a group of references (Continued). Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. The superword footprint Fl (V) of a group V, following unroll-and-jam . is com puted as follows. First, the array dimensions w ith array subscripts th a t are a function of any of the unrolled loops are identified. Then, for each such dimension h, from highest to lowest dimension, the footprint is com puted assuming th a t the footprints of the references in the group overlap in the higher dimensions. For each dimension h > 1, the algorithm partitions references into subsets such th a t each subset corresponds to a disjoint footprint in dimension h. Then, for each subset, the algorithm recursively computes the footprint in dimension h — 1, as we now describe. D im ension h is th e lowest dim ension (h = 1). We first com pute the group footprint of two array references, and then we extend it for m references. T he footprint of group V = ( t'i, ( ,’ 2}, where references v\ and V2 have lowest dimension subscripts a/, *lh + ^1 and ah*lh + such th a t b\ < 62, when loop lh is unrolled by X ih is given by Equation (4.3) in Figure 4.4. Equations (4.3a) and (4.3b) apply when the two footprints overlap, th at is, when (62 — 61) < ah* X ih, as shown in Figures 4.4(a) and (b ). W hen the footprints do not overlap, the group footprint is the sum of the individual footprints, as in Equation (4.3c), with examples in Figures 4.4(c) and (d). In Figure 4.4(a), the references have no self-spatial reuse, th a t is, a/j > SW S, and each individual footprint is a set of X/h superwords. The footprints overlap if {p2 —61) is evenly divided by ah and there exists an integer value k, 1 < k < X ih, such th at k = l + (b2 — b\)/ah. This case corresponds to Equation (4.3a), which computes the group footprint precisely when the two references have group-tem poral reuse. In Figure 4.4(b), both references have self-spatial reuse within a superword, th a t is, ah < SW S. The corresponding footprint size is given by Equation (4.3b). In Figure 4.4(c), V\ has no self- spatial reuse and each copy of v\ in the unrolled loop body accesses a distinct superword, and the same is true for V 2 . In Figure 4.4(d) both v\ and V2 have self-spatial reuse. 66 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. The footprint of a group V = {v\,V 2 , with array subscripts a\ * l\ + bi such th a t 1 < i < m and b\ < 62 < ... < bm , is com puted by first partitioning V into subgroups w ith disjoint footprints in the lowest dimension, as follows. A subgroup Vi = is defined by lowest dimension subscripts a i * lx + bj, where ^ji imin ^ j " S i imaxi (bj- 1 < bj) A {bj - bj- 1 < ai * X h ) A {bimin = b iV bimin - bimin- i > a 1 * X h ) A (bimax =bm y bimax+1 - bimax > a i * X h ) (4.4) Then the group footprint V is com puted as the sum of the disjoint footprints of sets Vi, as in (4.5). Flh(V) = ' E W ) (4-5) i The footprint of each subgroup V% is computed by extending Equation (4.3) to m > 2 refer ences. For example, when the references in V have self-spatial reuse, as in Equation (4.3b) (01 < SW S), each subgroup V ) has a footprint consisting of contiguous superwords, since bj — bj- 1 < a,\ * X]x for all j such th at imvn < j < imax■ T he footprint of V ) consists of the union of the individual footprints, with size given by E quation (4.6). Fih(Vi) = S W S For example, if S W S = 4 and X = 4, group V = {A[i],A[i + 2),A[i + 5],A[i + 12], A[i + 14]} can be partitioned into two subgroups V\ = {A[i], A[i + 2], A[i + 5]} and V2 = [A(i, + 12], A \i. + 14]} with disjoint superword footprints. Since the references have 67 T / / ! (\Vlmin > " • ) Vi h V L min " > * * * 7 ^Imax J / ai * X h + bimax - bimin Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. self-spatial reuse, each individual footprint and the footprint of each subgroup is a set of contiguous superwords. The to ta l num ber of superwords accessed by the references in V is the sum of the disjoint footprints of sets V\ and V2, as in (4.7). Fh (V) = + Fh {V2) "1 * 4 + 5 — 0 '1*4 + 1 4 - 1 2 ' + 4 4 = 5 (4.7) D im ension h is not th e low est dim ension (h ^ 1). W hen h is one of the higher dimensions, the superword footprint of V = {v \,v 2, ■ — lvm } in loop Z /j is again the union of the individual footprints. From Section 4.3.1.1, the footprint of each reference Vi in the unrolled loop body consists of a set of X ih disjoint footprints (each footprint corresponding to a copy of Vi created by unrolling), and the offset between each pair of consecutive footprints is ah * n t i 1 where s* is the size of dimension i. Therefore the footprints of different references in the group may overlap, depending on th e values of a^, bj and the unroll factor X/h. The footprints of two uniformly generated references v\ and to overlap in dimension h if there exists an integer value k, 1 < k < Xjh th a t satisfies Condition (4.8): ah *k + bi = a h + b2. (4.8) th a t is, if (b2 — b\)%ah = 0 and (b2 — b\)/ ah + 1 < X ih. Furtherm ore, if there exists k satisfying the above condition, the footprints of the last X ih — k + 1 copies of V\ in the unrolled loop body overlap w ith those of the first X ih — k + 1 copies of v2. The footprint of {v \,v 2} is then given by Equation (4.9). 68 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Fih(vu v2) = (k - 1) * F i ^ i v i ) + ix ih - k + 1) * Fih_ 1{v i,v 2) + (k - 1) *F ih_1(v2) (4.9) To com pute the size of the entire footprint of V in Z /t, our algorithm partitions V into subsets Vi = {vimin,...,v imax} such th at, for any j , imin < j < imax, the pair { v j-i,V j} satisfies Condition (4.8). The footprint of V\ is the union of the footprints of its reference set and is com puted by extending Equation (4.9) to more th an two references. 4 .3 .2 R e g iste r s for R e u se A cro ss Itera tio n s In addition to superword registers for exploiting reuse in the body of the transform ed loop nest, extra superword registers may be required for exploiting reuse across iterations of the innerm ost loop for references w ith group-tem poral reuse carried by the innerm ost loop n of the transform ed loop nest. To com pute the num ber of registers needed to exploit group-tem poral reuse across iterations of loop n, the algorithm examines groups of references th a t have consistent dependences carried by n 3. Assume th a t unroll-and-jam has been applied to outer loops in a nest. After subsequently unrolling the innerm ost loop, extra registers are required if the reuse distance between references prior to unrolling loop n is larger th an the unroll amount, i.e., if dn > X n, as in Figure 4.2, where dn = 8 and X n = 4. Let C = {vi,V 2 , ■ ■ ■ V rn } be a set of references th at is a subset of a uniformly generated set, and, prior to unrolling the innerm ost loop resulting from unroll-and-jam by X n, each pair {vi,Vi+\) in C has a consistent dependence dl = (0,0,..., dl n), dl n > 0. Also, assume th at the array subscript of the lowest dimension of each reference v, in C is of the form 3Note that such references, i f their lowest dimension varies with n, may also have group-spatial reuse across loop i t e r a t i o n s . However, our algorithm focuses on exploiting group-temporal reuse across i tera t i o n s , since most of the group-spatial reuse i s achieved within the body of the unrolled loop. 69 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. cii * n + bi, and th a t b\ < l > 2 < ■ ■ ■ < brn. Unrolling loop n generates X n copies of each original reference Vi in th e body of th e transform ed loop nest. W hen dl n is a m ultiple of the unroll factor X n , each pair of copies of references (vt, ( j i |_i) will reuse d a ta after jP- iterations. W hen dl n is not a multiple of X n , some copies of a reference will reuse data after dL X n 1 iterations of n, while others will have a reuse distance of X n requiring one more register per copy. Thus, each pair of copies of references (Vi, Vi+i) requires at most A Xn 1 additional superword registers to keep the d ata across iterations of the innerm ost loop. T he num ber of registers required to exploit reuse across iterations of n by all pairs of copies is the num ber of registers required for each pair times the number of registers required to keep the superword footprint of reference v% in the transform ed loop nest: RA(Vi,Vi+i) = ( (P n xn - 1) x FL (vi) (4.10) Equation (4.10) may overestimate the num ber of registers if the footprint component (FfXvi j) overestimates registers, or for certain copies of references if dl n is not a m ultiple of X n . The total num ber of registers required for exploiting reuse across iterations for set C with leading reference v\ is given by: dl un X n 1) X Fl (vi) (4.11) R a (C) = £ ( ( 1 <i<m 4 .3 .3 P u ttin g It A ll T ogeth er Subsections 4.3.1 and 4.3.2 describe the com putation of the num ber of registers required to exploit reuse in the body of the innerm ost loop (superword footprint) and across iterations of the innerm ost loop, assuming th a t unroll-and-jam has been applied the loop nest. This section presents the com putation of the total num ber of registers required and 70 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. the total num ber of mem ory accesses in the innerm ost loop of the transform ed loop nest, which are the metrics used to prune and guide the search for unroll factors described in Section 4.2. The total num ber of registers required to exploit reuse is the sum of the superw ord footprint of the references in the innerm ost loop of the transform ed loop nest and the num ber of registers needed for exploiting reuse across iterations of th e same innerm ost loop. The superword footprint of the references, Fl (V ), is computed as in subsection 4.3.1. The total num ber of extra registers required for exploiting reuse across iterations of the innermost loop is com puted as in subsection 4.3.2, for each set C ofloop-variant references with consistent dependences carried by the innerm ost loop. The total num ber of superword registers required is then: R (V ) = Fl (V) + Y , R a (C ) (4.12) c The total num ber of memory accesses in the innerm ost loop of the transform ed loop nest is the sum of the memory accesses of each group C of references th a t are variant with the innermost loop n and have consistent dependences carried by n. For each group C , the number of memory accesses is given by the superword footprint of the leading reference of the group, vf: M (C ) = F l (v f) (4.13) The total number of memory accesses is then: M(V) = Y , F l {vT) (4.14) C 71 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 4.4 D eterm in in g U nroll F actors As previously stated, the goal of the search algorithm is to identify the unroll factors for the loops in the loop nest such th a t the num ber of m em ory accesses is minimized, w ithout exceeding available registers. Thus, we m ust consider an n — dimensional search space, where each dimension has the num ber of elements corresponding to the iteration count of the loop. A full global search of this search space is prohibitively expensive, especially for deep loop nests or large loop bounds. Thus, we use a num ber of strategies for pruning the search space. First, we elim inate from the search loops th a t do not carry reuse or for which unroll- and-jam is not safe. Further, we rely on the observation th a t the num ber of registers required m onotonically increases with the unroll factor of a loop, assuming th at all other unroll factors are fixed. Thus, we need not search beyond the unroll factors th at exceed available registers. This latter point significantly prunes the search space in th a t the num ber of registers is usually fairly small (e.g., 32 superw ord registers on the AltiVec), so th a t the search is concentrated on fairly small unroll factors. These pruning strategies are used in our current implementation, and at least for the program s in this study, are quite effective at making the search practical. Further pruning is possible by m aking the additional observation th a t for each unrolled loop I, the am ount of reuse of an array reference with reuse carried by I increases with the unroll factor X/. Therefore reuse, like the register requirem ent calculation, is a monotonic, non-decreasing function of the unroll factor for each loop, given th a t the unroll factor of all other loops is fixed. Thus, within each dimension, holding all other unroll factors constant, binary search can be used rather than searching all points. We can also increase unroll factors by am ounts corresponding to the superword size w ithout much loss of precision, rather than considering each possible unroll factor, since the register requirements increase stepwise as a function of superword size. Additional pruning techniques th at take into 72 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. account the hardw are’s capability to take advantage of the results of optim ization have been used in prior work [10, 63]. Our im plem entation navigates th e search space from innerm ost loop to outerm ost loop, for the applicable loops in the nest, varying the unroll factor of one loop while keeping the unroll factors of all other loops fixed. W ithin a dimension of the search space, the lowest num ber of m em ory accesses will be derived at the largest unroll factor th a t meets the register constraint. However, lower unroll factors m ay also have the same estim ate of memory accesses (because reuse is m onotonically non-decreasing), so we identify the lowest unroll factor w ith the equivalent estim ate of memory accesses. Then, the im plem entation considers the next applicable outer loop and the applicable inner loops nested inside it, and in a particular dimension, each tim e it reaches the largest unroll factor th at meets the register constraint, it compares the estim ated num ber of memory accesses to the lowest estim ate so far to determ ine if a better solution has been found. The final result of the algorithm is the unroll factors corresponding to the best solution. As a subtle point, when unroll-and-jam is applied from outerm ost to innerm ost loop, unrolling the inner loop does not affect d a ta access patterns or reuse distance. For this reason, inner loop unrolling is not perform ed in earlier work [10]. In our context, however, because of the relationship between superword-level parallelism and superword replacement, inner loop unrolling exposes opportunities for superword loads and stores and thus can impact the analysis of register requirements. Nevertheless, when reuse is exploited across iterations of the innerm ost loop body as described in Section 4.3.2, it is not necessary to unroll the innerm ost loop beyond the superword size to achieve the goal of considering register requirem ents in conjunction with superword-level parallelism. Note, however, th at smaller unroll factors for the innermost loop m ay be selected, if an unroll-and-jam of an outer loop carries more parallelism and reuse. Although this search should theoretically find the optim al solution, according to our optim ization criteria, in fact the solution is not guaranteed to result in the fewest number 73 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. of memory accesses, for a num ber of reasons. First, in a few cases as noted, the register requirement analysis defined in the previous section m ust conservatively approximate. Second, it is difficult to estim ate the register requirem ents used to hold tem poraries, so we conservatively approxim ate this as well. T hird, there is a tradeoff between using extra registers to hold values across iterations, as discussed in Section 4.3.2, versus using them to actually exploit reuse within the transform ed innerm ost loop body. In fact, in general the algorithm does not take into consideration the am ount of reuse resulting from performing superword replacement on specific references; replacing some references has more im pact on decreasing memory accesses th an others. 4.5 C od e T ransform ations The previous two sections have described how the compiler analyzes the code to identify reuse, register requirem ents and the unroll factors leading towards th e lowest number of memory accesses. In this section, we describe how these analyses are used in transform ing the code to achieve the desired result. In the previous section, we showed how consideration of superw ords instead of scalar variables greatly increases the complexity of determ ining the num ber of registers and memory accesses associated with exploiting reuse under different unroll am ounts. In this section, we further discuss the increased complexity of code generation when performing superword replacement instead of scalar replacement. T he chief source of code generation complexity is the need for superword objects to be properly aligned, as in the following examples. W hen performing memory operations, the architecture m ay actually require th at an access be aligned at superword boundaries. For example, the AltiVec ignores the last four bits of an address when performing a superword load or store. In such an architecture, when an access is not aligned at a superword boundary, the compiler or programmer 74 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. must read /w rite two adjacent superwords. A series of additional instructions packs the two superw ords for reads or unpacks a superword into its corresponding two superwords for writes. Even on architectures th a t support mem ory accesses not aligned at superword boundaries, such as Intel’s SSE, there is a perform ance penalty on unaligned accesses because th e hardw are m ust perform this realignment. To perform an arithm etic or logical operation on two superw ord registers, the fields of the two operands m ust also be aligned. For example, to add the third and fourth fields of one superw ord register to the first and second fields of another, one of the registers m ust be shifted by two fields. Consider also the following example: f o r i = 1, n c [ i] = a [2 i] + b [i] The access to a has a stride of 2, while the access to b has a unit stride. Thus, the compiler or program m er must first pack the even elements of a into a superword register before adding them to the elements of b. A third example occurs when exploiting partial reuse of a superw ord where data in a register m ust be aligned to accom modate the next operation. In the SLP compiler, the default solution to alignment involves packing d ata through memory. T he SLP compiler allocates superword variables by declaring them using a special v e c to r type designation, which is interpreted by the backend compiler to align the beginning of the variable to a superword boundary in memory. The start of each dimension of an array of such objects should also be aligned, by padding if necessary. Under these assum ptions, the SLP compiler can detect when operations are unaligned. Unaligned d a ta is packed into an aligned superword in memory before being loaded into a superw ord register, and is unpacked before storing back to memory 4. 4For architectures that support copying between scalar and superword register f i l e s , such as Intel’ s SSE and DIVA, this packing can be performed more e f f i c i e n t l y through register c o pies. 75 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. In summary, alignment is a key consideration in code generation, and the overhead of performing alignment operations can be quite high. Further, alignment operations may require a num ber of additional superword registers, and in some cases, m ay result in additional accesses to memory not accounted for by the model in the previous section. In this section, we show how to achieve the num ber of registers derived by our model through a set of code transform ations, presented in the order in which they are performed by our compiler. In addition to superw ord replacement, described in Section 4.5.2, we also describe how index set splitting is used to align accesses to the beginning of an iteration in Section 4.5.1, and how our compiler eliminates additional mem ory accesses resulting from packing through memory for alignment in Section 4.5.3. We illustrate how these transform ations collaborate w ith each other by way of an example in Figure 4.5, which is a simplified FIR filter. 4.5 .1 In d ex S et S p littin g A simple way to reduce the need for alignment operations, when applicable, is to perform index set splitting on loops. For example, in Figure 4.5(b), the initial access to o u t[ l] refers to the second field of a superword, assuming o ut [0] is aligned at a superword boundary. Through index set splitting, the portion of the loop from line 4-6 will always perform aligned accesses. This transform ation is always safe, and is profitable whenever it increases the num ber of aligned memory accesses. We assume index set splitting is performed prior to the SLP compiler. The loop is transform ed so th at accesses corresponding to a particular reference in the m ain loop body are aligned to superword boundaries. If there are m ultiple references and different choices for index set splitting are needed to align specific references, we select a representative reference that, if aligned through index set splitting, will also maximize alignment for other references. The reference selected m ust have unit stride within the innerm ost loop. 76 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 1) fo r (i = 1; i < 64; i+ + ) 2) out [i] = 0.0; 3 ) 4) fo r (i = 256; i < 320; i+ + ) 5) for (j = 0; j < 256; j+ + ) 6) out[i-256] = out[i-256] + in[i-j] * coe[j]; (a) Original 1) fo r (i = 1; i < 4; i+ + ) 2) out[i] = 0.0; 3 ) 4) fo r (i = 4; i < 64; i+ + ) 5) out[i] = 0.0; 6) 7) for (i = 256; i < 320; i+ + ) 8) for (j = 0; j < 256; j+ + ) 9) out[i - 256] = out[i - 256] + in[i - j] * coe[j]; (b) After index set splitting 1) for (i = 1; i < 4; i+ + ) 2) out[i] = 0.0; 3 ) 4) fo r (i = 4; i < 64; i + = 4){ 5) out[i + 0] = 0.0; 6) out[i + 1] = 0.0; 7) out[i + 2] = 0.0; 8) out[i + 3] = 0.0; 9) } 10) fo r (i = 256; i < 320; i + = 8) 11) fo r (j = 0; j < 256; j + = 8){ 12) out[i + 0 - 256] = out[i + 0 - 256] + in[i + 0 - (j + 0)] * coe[j + 0]; 13) out[i + 0 - 256] = out[i + 0 - 256] + in[i + 0 - (j + 1)] * coe[j + 1]; 1 4 ) ; 15) outfi + 7 - 256] = out[i + 7 - 256] + in[i + 7 - (j + 7)] * coe[j + 7]; 16) } (c) After unroll-and-jam Figure 4.5: Code generation example. 77 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 1) 2) 3 ) 4) 5 ) 6) 7) 8) 9 ) 10) 11) 12) 13) 14) 15) *((float *)&vecO + 3) — *( (float *)&vecl + 0) flat 1 flat2 fiat3 = *( (float *)&vecl + 1); flat4 = *((float *)&vecl + 2); *( (float *)&vec2 + 0) = fiatl; *( (float *)&vec2 + 1) = flat2; *( (float *)&vec2 + 2) = flat3; *( (float *)&vec2 + 3) = fiat4; vec4 = vec_add(vec3, vec2); vec_st(vec4, i * 4 + 0, (float *)&out[-63]); vec5 = vec_ld(i * 4, (float *)&out[-63]); flat5 = *( (float *)&vec6 + 2); flat6 = * ((float *)&vec7 + 2); *( (float *)&vec8 + 0) = flat5; *((float *)&vec8 + 1) = flat6; 1) flatl = 2) flat2 = 3) flat3 = 4) flat4 = 5) *( (float 6) *( (float 7) *((float 8) *( (float 9) vec4 = 10) flat5 = 11) flat6 = 12) *( (float 13) *( (float *( (float *)&vec0 + 3) *((float *)&vecl + 0) *((float *)&vecl + 1) *((float *)&vecl + 2) *)&vec2 + 0) = flatl *)&vec2 + 1) = flat 2 *)&vec2 + 2) = flat3 *)&vec2 + 3) = flat4 vec_add(vec3, vec2); *((float *)&vec6 + 2) *((float *)&vec7 + 2) *)&vec8 + 0) = flat5 *)&vec8 + 1) = flat6 (d) After SLP compilation (e) After superword replacem ent 1) tem p i = replicate(vec0, 3); 2) tem p2 = replicate(vecl, 0); 3) tem p3 = replicate(vecl, 1); 4) tem p4 = replicate(vecl, 2); 5) vec2 = shift_and_load(tempi, tem p i, 4); 6) vec2 = shift_anddoad(vec2, temp2, 4); vec2 = shift_andJoad(vec2, temp3, 4); vec2 = shift_and_load(vec2, tem p4, 4); vec4 = vec_add(vec3, vec2); 10) tem p i = replicate(vec6, 2); 12) tem p2 = replicate(vec7, 2); 11) vec8 = shift_andJoad(tem pl, tem pi, 4); 13) vec8 = shift_andJoad(vec8, temp2, 12); (f) After packing in registers Figure 4.5: Code generation example (Continued). 78 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Let i be the loop index variable for the innerm ost loop, and lb and ub are the lower and upper bounds for i. To derive the loop bounds for the copies of the innerm ost loop resulting from index set splitting, we begin w ith the starting address, addr, of the reference when i = lb, where addr = base + offset. Here, base refers to the beginning of the lowest dimension of the selected array, and offset is the offset within th a t dimension (Recall th a t the beginning of each dimension is aligned at superword boundaries.). The lower bound (split) of the m ain loop body is com puted by the following equation. { lb if offset m od SW 5 = 0 (4.15) lb + SW S - (offset mod SW S) if offset m od S W S ± 0 If lb is constant, split can be computed at compile tim e. Otherwise, it is com puted at run time. In the example in Figure 4.5, offset for o u t [1] is 1, so if S W S = 4, then split — 4. 4 .5 .2 S u p erw ord R ep la cem e n t Superword replacement removes redundant loads and stores of superword variables, using superword tem poraries instead. We assume th a t this code transform ation will be followed by register allocation th a t places these variables in registers. For example, in Figure 4.5(d) and (e), the store and load at statem ents 10 and 11 can both be eliminated, and vec4 can be used in place of vec5 in subsequent statem ents. Superword replacement is also affected by alignment, in th a t we detect redundant loads and stores by identifying distinct memory operations th a t refer to the same aligned superword, even if the addresses are not identical. The compiler recognizes opportunities for superword replacement by determining th at addresses and offsets for different memory accesses fit within the same superword, and verifies th a t there are no intervening kills to the memory locations. The current imple m entation uses value numbering [52] to detect such opportunities. Value num bering is a 79 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. well-known compiler technique for detecting redundant com putation, but it is sensitive to operand and operator ordering. To increase the success of value numbering, we first preprocess the code so th a t memory access operations are rew ritten into a canonical form, constant folding has been applied to simplify addresses, and alignment is taken into ac count. As earlier stated, all mem ory accesses are aligned at superword boundaries, so if an unaligned address appears in a memory access, the resulting access will be aligned to the preceding superword boundary. The preprocessing performs this alignment in software so th at redundant accesses will be identified by value numbering. The current im plem entation of superw ord replacem ent is more restrictive th an w hat was presented in Section 4.2. Value num bering operates on a basic block at a tim e so we cannot exploit reuse across iterations of the unrolled loop body. This is because we are performing this transform ation after the SLP compiler has flattened th e loop structure to gotos and labels. The dependence inform ation used to perform the register requirement analysis cannot easily be reconstructed from such low-level code. In an im plem entation where SLP and SLL are more tightly integrated, it should be possible to perform superword replacement as a byproduct of the analysis in Section 4.2. 4 .5 .3 P a ck in g in S u p erw ord R e g iste r s As previously described, packing in memory is performed to align superword objects. Memory packing moves d ata elements from a set of locations in memory (sources) to a superword location (destination) so th a t the destination superword contains contiguous data, aligned to a superword boundary or to another operand. For example, in Fig ure 4.5(e), superword variables vecO and v e c l are the sources and superword variable vec2 is the destination for memory packing in lines 1-8. Our im plem entation performs a transform ation we call register packing to optimize memory packing operations. A series of memory loads and stores for scalar variables are replaced by superword operations on registers, as shown in Figure 4.5(f). We identify a 80 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. destination as a superword d ata type th a t is the target of a series of scalar store instruc tions into its fields, such as vec2 in the example. The corresponding sources are identified by finding preceding loads of these scalar variables. If the inputs to these loads are fields of superword d ata types, then these superw ords are the sources. In the example, f l a t l is stored into a held of vec2, and there is a preceding load of f l a t l th at copies a held of source vecO. Once we hnd such a pattern, we verify the safety of this transform ation by guaranteeing th a t there are no intervening m odihcations or uses of either the scalar variables or destination superwords between loading the scalar variables and completion of storing into the destination. We also verify th a t the destination statem ents ultim ately produce contiguous d ata in the superw ord. We dehne source and destination indices as the helds in the source and destination superw ord variables, respectively. For example, the source index of vecO is 3 in line 1 of the example. Once the compiler identihes sources and destinations, it transform s the code to replace memory accesses with operations on superw ord registers. The register packing transfor m ation takes advantage of two instructions th a t are common in m ultim edia extension architectures. Replicate replicates one element of a source register to all elements of a tem porary output register (Figure 4.6(a)). Shift-and-load takes two input registers. The hrst input register is a tem porary, and is shifted left by the num ber of bytes specihed by the third argum ent. The same num ber of fields is taken from the second input regis ter, which is a tem porary derived from a source superword, to fill the output tem porary register (Figure 4.6(b)). Simply stated, we are shifting each source element into the destination superword, in order, so th a t the final result is a destination superword th at corresponds to contiguous aligned data. The steps of the register packing transform ation are as follows. 1. We sort the destination statem ents in increasing order of their destination indices. We then sort the source statem ents to correspond to the ordering of the destination 81 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. a [2] a [3 ] a [0 ] a[0] tempi a [0] a [0] a[0] tempi a [ 0 ] VTT 7 a[0] (a) tempi = replicate(a, 0) <b> P = shift_and_load(p, tempi, 4) Figure 4.6: Operations used for packing in registers. statem ents, so th at, for example, the scalar variable associated with the first source statem ent is the same as the scalar variable associated w ith the first destination statem ent. 2. For each source statem ent, in sorted order, we generate a replicate statem ent whose two inputs are the source superword and the source index, and the output is a superword tem porary. For example, as in Figure 4.5(f), we have replaced line 1 of Figure 4.5(e) w ith tem p i = replicate(vec0,3). 3. We replace each destination statem ent, in sorted order, w ith a sh ift_ an d _ lo a d operation. The first input is the destination superword. T he second input is the tem porary generated by the r e p l i c a t e of the corresponding source statem ent. The third argum ent, the shift am ount, usually involves shifting by a single superword field. For the last destination field, the shift am ount is th e difference, in bytes, between the S W S and the last destination field. For completely filled destination superwords, it will also be just a single field. For example, in lines 1-8 of Fig ure 4.5(e), th e destination superword is completely filled, so the shift am ount is always a single 4-byte field. In lines 10-13, however, only the first two fields are filled, so th e shift amount of the last destination statem ent is a total of 12 bytes. 4. Source statem ents are deleted if the scalar variables are not live beyond the corre sponding destination statem ents. 82 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. R1 R2 Load Sh i f t out Sh i f t i n S h i f t i n S h i f t out S h i f t i n S h i f t out Figure 4.7: Shifting. 4 .5 .4 A n E xam p le: S h iftin g for P a r tia l R e u se In addition to the three optim ization opportunities described in this section, we discovered a new optim ization opportunity, called shifting, for reducing memory accesses. In shifting, d ata in superword registers are partially reused. Partial spatial reuse of superwords occurs when distinct loop iterations access d ata in consecutive superwords in memory, partially reusing the d ata in one or both superwords, as shown by the example in Figure 4.5(a), and illustrated graphically in Figure 4.7. In this example, as before assuming th a t S W S = 4, array reference in[i — j] has partial spatial reuse in loop i. For a fixed value of i and j , the d a ta accessed in iteration (i, j) consists of the last three words of the superword accessed in iteration (i — 1, j), plus the first word of the next superword in memory. This type of reuse can be exploited by shifting the first word out of the superword, and shifting in the next word, as in Figure 4.7. As partially shown in Figure 4.5(c) and (f), only four superwords need to be loaded for the d ata accessed in the 64 copies of in[i — j] in the loop body, after shifting is applied. Before shifting, in[i — j ] had to be loaded from memory (and possibly aligned) for each of the four copies of in[i — j] in the loop body. This shifting opportunity arises frequently in both signal and image processing appli cations, where one object is compared to a subcomponent of another object, such as the example in Figure 4.5(a). We detect these opportunities through the analysis described 83 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. in Section 4.2. T he optim ization shown in Figure 4.7 falls out from the combination of unroll-and-jam , alignment operations generated by the SLP compiler, superw ord replace ment and register packing. 8 4 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. C hapter 5 C O D E G E N E R A T IO N In addition to the optim izations described in C hapters 3 and 4, this chapter describes sev eral optim izations and their associated code generation requirem ents to exploit SLP for full m ultim edia applications. T hey are the techniques to parallelize type size conversion, reduction and unaligned mem ory references and a new packing algorithm called prepack ing. Type size conversion is a common feature of m ultim edia applications, particularly to prom ote small d ata types before or after arithm etic operations. A reduction operation is a com putation of a sum, product, maximum, or other com m utative and associative operation over a set of data elements. A memory reference is unaligned if at least one pair of its run tim e addresses are not congruent w ith each other m odulo superword width. Finally, when a memory reference can be packed with m ultiple other memory references, the first packing opportunity encountered by the SLP algorithm may not be the best choice. For each of these four cases, we describe our extension in the next four sections. In the last section, we summarize this chapter. 5.1 T y p e Size C onversion Type size conversion is a common feature of multim edia applications, particularly to prom ote small d ata types before or after arithm etic operations. Type size conversion is 85 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. int in[1024]; short sh[1024]; int in[1024]; short sh[1024]; for (i=0; i<1024; i+ + ) in [i] = (int)sh[i]; fo r (i=0; i<1024; i+ = 8 ) in[i:i-|-3], in[i+4:i+7] = typesize_up(sh[i:i+7]); (b) Our approach (a) Original Figure 5.1: Parallelization of type size conversions more difficult on superwords th an scalar d a ta types, due to alignm ent issues, instruction set lim itations and the im pact on parallelization. We extend the SLP compiler to perform type conversions in parallel. On AltiVec, the available instructions supporting type size conversion convert to fields th a t are half or double the size of the source operand. Type size conversions of a factor larger th an two must be broken into m ultiple conversions. The alignment offset of the destination variable is adjusted from th a t of the source variables. Predicate variables also require type conversions so th a t they m atch the size of the destination variable of the instruction being guarded. To represent parallel type size conversion operations, we define the following parallel macros. The macro ty p esiz e_ u p doubles the type size of the d ata fields in s rc by assigning the higher half to d s t l and the lower half to d s t2 . The macro ty p esize.d o w n concatenates two superword operands s r c l and src 2 , reduces the data field size by half and assigns the result to d s t. These high-level parallel macros are replaced by a few AltiVec instructions during code generation. For signed operands, different AltiVec instructions are generated for the macro ty p e s iz e ru p from unsigned operands. Figure 5.1(c) shows the code generated by our approach for type size conversion. After eight short integers are loaded into a superword register, type size conversion is performed in parallel using a few AltiVec instructions, represented by ty p e s iz e .u p . Finally, the two d s tl, d st2 = typesize_up(src) dst = typesize_down(srcl, src2) 86 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. for (i=0; i<16; i+ + ) sum = sum + a[i]; sum Y = pack(0, 0, 0, 0); for (i=0; i<16; i+ = 4 ) (a) Original sum V = sum V + a[i:i+3]; su m l, sum2, sum3, sum4 = unpack(sumV); for (i=0; i<16; i+ = 4){ sum = sum + a[i]; sum = sum + a[i+l] sum = sum + a[i+2] sum = sum + a[i+3] sum = sum + sum l; sum = sum + sum2; sum = sum + sum3; sum = sum + sum4; (c) Reduction optim ization } (b) Unrolled Figure 5.2: Parallelization of reduction sum. superwords are stored in memory. Among the 14 benchm arks used in the next chapter, two (MPEG2-distl and EPIC-unquantize)'have type size conversions. 5.2 R ed u ction A reduction operation is a com putation of a sum, product, maximum, or other com m utative and associative operation over a set of d ata elements. From the com piler’s perspective, a reduction occurs when a location is updated on each iteration of a loop, where a commutative and associative operation is applied to th at location’s previous con tents and some data value. In this case, it is safe to reorder the operations. However, reduction variables have dependences, so the compiler m ust transform the code to obtain parallel code from a sequential code. Figure 5.2(a) shows a loop containing a reduction sum operation. W hen this loop is unrolled as shown in (b), scalar data dependences prevent packing the isomorphic statem ents. We extend the SLP algorithm to support reductions in a way similar to the standard code generation for reductions in m ultiprocessors. We create as many private copies of the reduction variable as will fit in a superword. The private copies are packed into one 87 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. for (i=0; i<1024; i+ + ) for (j=0; j<256; j+ + ) t = inp[i+j]; for (i= 0 . i< 1024; i+ + ) / x n • ■ ! fo r (j= °; j<256; j+ = 4 ){ M 0r,6mal tV l = load(&inp[i+j]); tV2 = load(& inp[i+j+4]); for (i=0; i< 1024; i+ + ) perm V = perm _vec(& inp[i+j:i+j+3]); for (j=0; j <256; j+ = 4 ) tV = perm ute(tV l, tV2, permV); tV = inp[i+j:i+j+3]; } (b) Parallelized (c) Code generation Figure 5.3: Parallelization of unaligned m em ory references superword and reduction operations are performed in parallel when the loop is paral lelized. Figure 5.2(c) shows the code after the reduction optim ization is applied to the loop in (a). Assuming superword size is 4, the four private copies are created and initialized by zero in above the parallelized loop, below which a sequential add operation for each private copy accumulates into the global variable. Note th a t pack and unpack instructions are moved outside the loop. Private copies of a reduction variable are initialized w ith the identity of the associated operation. For reduction sum in the above example, the private copies are initialized with zero. For reduction max / min, private copies are initialized by the reduction variable itself. Of the 14 benchm arks used in the experiments, four (TM, MAX, MPEG2-distl and GSM-Calculation) have reduction operations. 5.3 A lign m en t O p tim ization In C hapter 2, we described alignment analysis th at finds constant offsets with respect to superword w idth for the run tim e addresses of each mem ory reference. A memory refer ence is unaligned if at least one pair of its run tim e addresses are not congruent with each other modulo superword width. In this section, we describe our approach to parallelize 88 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. for (y=2; y<768; y + + ) for (x=0; x<1024; x + + ) e[y][x] = u[y][x] - u[y-2][x] + u[y][x+l] - u[y-2][x+l] + u[y][x+2] - u[y-2][x+2]; (a) Original for (y=2; y<768; y + + ) for (x=0; x<1024; x+ = 4){ e[y][x+0] = sum(u[y][x+0:x+2]) - sum(u[y-2][x+0:x+2]); etylfx+i] = sum(u[y] [x+1 :x+3]) - sum(u[y-2] [x+l:x+3]); e [y][x+2] = sum (u [y] [x+2:x+4]) - sum (u [y- 2] [x+2 :x+4]); e [y][x+3] = sum(u[y][x+3:x+5]) - sum(u[y-2][x+3:x+5]); } (c) Parallelized by the M IT SLP compiler for (y=2; y<768; y++) for (x=0; x<1024; x+=4){ e [y ][x + ° ] = u[y][x+0] - u[y-2][x+0] + u[y] [x+l] - u[y-2][x+l] + u[y][x+2] - u[y-2][x+2]; e[y][x+l] = u[y][x+l] - u[y-2][x+l] + u[y][x+2] - u[y-2][x+2] + u[y][x+3] - u[y-2][x+3]; e[y][x+3] = u[y][x+3] - u[y-2][x+3] + u[y][x+4] - u[y-2][x+4] + u[y][x+5] - u[y-2][x+5]; (b) Unrolled for (y=2; y<768; y + + ) for (x=0; x<1024; x + = 4 ) e[y][x+0:x+3] = u[y][x+0:x+3] - u[y-2][x+0:x+3] + u[y][x+l:x+4] - u[y-2][x+l:x+4] + u[y][x+2:x+5] - u[y-2][x+2:x+5]; (d) Parallelized by prepacking Figure 5.4: Parallelization by prepacking unaligned memory references. In the SLP algorithm, two memory references are packed if they are adjacent to each other, access a constant offset with respect to superword width, and they are not separated by any superword boundary. Thus, unaligned memory references are not parallelized. We loosen the requirements for packing memory references so th a t two memory ref erences can be packed only if they are adjacent. Figure 5.3(a) shows an array reference in p [ i+ j] whose alignment offset varies with respect to superword width. W ith our ex tension, the array reference can be parallelized as shown in (b). W hile parallelized, the m emory offset of in p [i+ j :i+ j+ 3 ] in (c) varies during run tim e with respect to super word width. For such unaligned superword memory references, we generate code such 89 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. th a t a desired superw ord is obtained dynam ically from two aligned superword memory accesses. Figure 5.3(c) shows the code generated from (b). After two adjacent superwords are loaded by aligned memory accesses and a perm utation vector is generated from the address, the desired superword is obtained from the two superwords using a perm ute in struction. In general, the address of a superword memory reference can be one of aligned to zero offset, aligned to non-zero offset or unaligned. Depending on the kind of align m ent, our im plem entation generates a simple aligned load, a static alignment with two loads, or a dynamic alignment for an unknown alignment. 5.4 P repackin g to O p tim ize P arallelization O verhead The SLP algorithm packs isomorphic scalar instructions into superword instructions. T he way in which an SLP compiler packs instructions governs the parallelism th a t can be exploited and the am ount of parallelization overhead. The packing policy in the original SLP algorithm is very simple; two memory references are packed in the first chance where they satisfy the three conditions, th a t is, they are adjacent to each other, access a constant offset with respect to superword width, and they are not separated by any superword boundary. This packing policy is quite effective in m any common cases. However, when a memory reference can be packed w ith multiple other memory references, the first packing opportunity encountered by the SLP algorithm may not be the best choice. Figure 5.4(a) shows an example loop nest used to illustrate this point. The original loop nest contains adjacent memory references even before unrolling is applied. W hen the original SLP algorithm is applied to the unrolled loop body shown in (b), it packs the adjacent memory references in the same statem ent instead of packing them w ith their unrolled copies, resulting in the code shown in (c). In the first statem ent of Figure 5.4(b), three array references u[y] [x+0], u[y] [x+1] and u[y] [x+2] are packed into a parallel memory reference u [y] [x+0: x+2] in (c). Since the three array elements should be added into a 90 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [y ][ x + 0 & [ y - 2 ] [ x + l j u [ y ] [ x + 3 [ y - 2 ] [x+3 u [ y ] [ x + 4 u [ y ] [ x + 5 [ y - 2 ] [ x + 2 e [ y ] [ x + 0 ] e [ y ] [ x + 1 Figure 5.5: D ata dependence graphs for the loop body of Figure 5.4(b) scalar datum , a high level operation sum consists of unpacking the three array elements and adding them in scalar mode. The code in (c) is inefficient not only because it involves scalar additions b ut also because it contains additional memory accesses necessary for unpacking d ata elements from a superword register to scalar registers. To make a good choice when there are multiple statem ents w ith which a given statem ent can be packed, we need a basic block-level view th a t can be used to compare the costs of different packing possibilities. We developed an algorithm th a t packs isomorphic d ata dependence graphs instead of isomorphic statem ents. By this algorithm, we prefer parallelizing isomorphic statem ents from independent d a ta dependence graphs to the ones from the same d a ta dependence graph. From the unrolled loop body, we first build d ata dependence graphs. The data dependence graphs for the code in Figure 5.4(b) are shown in Figure 5.5. Next, the isomorphic scalar d ata dependence graphs are packed into a parallel data dependence- graph, where the nodes represent parallel operations and operands. For two independent d ata dependence graphs to be packed together, they m ust be isomorphic [18] and in addition, each pair of corresponding nodes should have the same operation. For memory reference nodes, there is an additional requirement; the two mem ory references should be adjacent. Figure 5.4(d) shows the parallel code generated by our approach, where all operations are performed in parallel mode. In our current implem entation, this packing 91 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. for (i=0; i<1024; i+ + ) fo r (j=0; j <256; j+ + ) temp[j] = inp[i+j]*fil[j]; (a) Original for (i=0; i<1024; i+ = 4 ) for (j=0; j <256; j+ = 4 ){ tem p[j+0] = inp[i+j+0]*fil[j+0] tem pp+1] = inp[i+ j+ l]*fil[j+ l] tem p [j+2] = inp[i+j+2]*fil[j+2] tem p|j+3] = inp[i+j+3]*fil£j+3] tem p[j+0] = inp[i+j+l]*fil[j+0] tem p |j+ l] = inp[i+j+2]*fil[j+l] tem p [j+2] = inp [i+j+3] *fil [j+2] tem pp+3] = inp[i+j+4]*filp+3] tem pp+0] = inp [i+j+2] *fil p+0] tem p[j+ l] = inp [i+j+3] *fil [j+1] tem pjj+2] = inp[i+j+4]*fil[)+2] tem p [j-+3] = inp [i+j+5] *fil [j+3] tem p[j+0] = inp [i+j+3] *fil [j+0] tem pp+1] = inp [i+j +4] *fil [)+1] tem p [j+2] = inp[i+j+5]*fil[j+2] tem p j}+3] = inp[i+j+6]*fil[j+3] } (b) Unrolled Figure 5.6: M ultiple packing choices generated by unrolling multiple loops algorithm is applied conservatively only when this algorithm is surely profitable over the default strategy. Thus, our new packing algorithm is applied before we apply the original packing algorithm so th a t we can apply th e original packing algorithm to the rem aining scalar instructions. Because of this order of application, the new packing algorithm is called prepacking. W hile prepacking is effective when the original loop body contains adjacent mem ory references as shown in Figure 5.4(a), similar situations are often generated by our superword-level locality (SLL) algorithm when m ultiple loops are unrolled. For exam ple, when both loops in Figure 5.6(a) are unrolled as shown in (b), array references to in p have m ultiple packing choices. This type of partial tem poral reuse opportunities are common in m ultim edia applications. 92 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 5.5 Sum m ary SLP is a new technique th a t provides new optim ization opportunities. In addition to the two techniques described in the previous two chapters, we also developed other op tim izations th a t can be used to enhance the performance further. W hile common in many m ultim edia applications, type size conversion, reduction and unaligned mem ory references are not parallelized by the original SLP algorithm. Also, the simple packing policy of th e original SLP algorithm is powerful in many common cases, but suffers when there are m ultiple choices for combining an object with others into a superword. In this chapter, we presented algorithm s th a t can be used to generate efficient parallel code in such cases. All of these extensions working together are essential to obtain the results in the next chapter. 93 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. C hapter 6 E X P E R IM E N T S Chapters 3, 4, and 5 introduced techniques to exploit superword-level parallelism in the presence of control flow, locality in superw ord registers, and code generation techniques to support these optimizations. These techniques are applicable to both m ultim edia extension architectures and a processing-in-memory architecture, DIVA. We have imple m ented the techniques in the SUIF compiler [32] and evaluated the im plem entation on 14 benchmarks. In this chapter, we describe the im plem entation and the experim ental evaluation. This chapter is organized as follows. T he next section describes the benchm arks and their input d ata sets. Im plem entation and experim ental methodology are described in Sections 6.2 and 6.3, respectively. Section 6.4 presents an experimental evaluation of the performance of the benchmarks when all of our techniques are applied. Since this perfor mance is the result of m ultiple techniques, we also perform separate experiments to iden tify the benefits of each individual technique. The benefits of packing data dependence graphs, exploiting SLP in the presence of control flow, and exploiting superword-level locality are discussed in Sections 6.5, 6.6, and 6.7 respectively. 94 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Name Description D ata W idth # lines VMM V ector-m atrix m ultiply 32-bit float 60 FIR Finite impulse response filter 32-bit float 66 YUV RGB to YUV conversion 16-bit integer 110 MMM M atrix-m atrix m ultiply 32-bit float 76 Chrom a Chrom a keying of two images 8-bit character 106 Sobel Sobel edge detection 16-bit integer 128 TM Tem plate m atching 32-bit integer 85 Max M ax value search 32-bit float 90 TR Shortest path search 32-bit integer 94 swim Shallow w ater model 32-bit float 429 tom catv Mesh generation 32-bit float 197 M PEG 2-distl M PEG2 encoder (distl function) 8-bit character 32-bit integer 157 EPIC -unquantize EPIC (Efficient Pyram id Image Coder) (unquantizeJm age of unepic) 16-bit integer 32-bit integer 85 GSM -Calculation GSM encoder (Calculation_of_the_LTP_parameters) 16-bit integer 32-bit integer 204 Table 6.1: Benchm ark programs. 6.1 B enchm arks We use the set of 14 benchm arks shown in Table 6.1 to evaluate our compiler imple m entation, representing m ultim edia and scientific applications. The first nine are kernels consisting of a few loop nests. VMM and MMM are im portant kernels in scientific applica tions, FIR is frequently used in digital signal processing, and YUV performs conversion between different color encoding system s. Chroma, also known as blue screening, merges two images so th a t an object in a foreground image appears w ith the other image as a background. Sobel detects edges from a gray scale image by performing convolutions with two 3 by 3 pixel areas. TM is a representative kernel of an application performing image convolution between two images: a tem plate and an input d ata image. Max is a kernel th at looks for the maximum value. Since it is extracted from tomcatv, its input d ata is also collected by running the same application. TR is a core com putation of the Floyd-W arshall’s shortest path algorithm [18]. The last five are benchm ark programs. 95 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Benchm ark Runtime(% ) M PE G 2-distl 55 EPIC -unquantize 25 GSM -Calculation 49 Table 6.2: R untim e percentage of three functions from UCLA MediaBench. Swim and tomcatv are SpecFP applications w ritten in Fortran. Swim is a weather predic tion program based on the shallow water model [59], and tomcatv is a mesh generation program. MPEG2-distl, EPIC-unquantize and GSM-Calculation are complete functions from the three applications in the UCLA MediaBench [41]. Table 6.2 shows the percent age of each application’s execution tim e of the baseline code spent in these functions, measured on the platform described in Section 6.3. Each function takes up the largest fraction of the overall runtim e in the application. MPEG2-distl computes total absolute difference between two blocks of video frames to convert uncompressed video frames into MPEG-1 and M PEG-2 video coded bitstream sequences. EPIC (Efficient Pyram id Image Coder) is an image d ata compression utility designed to allow extremely fast decoding at the expense of slower encoding. In EPIC , EPIC-unquantize restores the quantized values to decompress the compressed images. GSM is a European standard for mobile communications. In GSM encoder, GSM-Calculation computes the long term predictor gain and the long term predictor lag for the long term analysis filter. Table 6.3 shows the input d a ta sizes for the benchmarks. For the last 8 benchmarks, two different input sizes are used. Large sizes represent the standard inputs provided with the applications whose d ata footprints are much larger th an the LI cache size. Smaller input sizes th a t fit in the L I d ata cache are also evaluated to help isolate the potential gains of increased parallelism from the effects of the memory behavior of the benchmarks. 96 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Name Input Size VMM 512 elements FIR 256 filter, 1M signal YUV 32K elements MMM 512 elements swim Specf’ p95 reference input tom catv Specfp95 reference input Chrom a Large: 400 x 431 color im age(l MB) Small: 48 x 48 color image(12 KB) Sobel Large: 1024 x 768 gray scale image(3 MB) Small: 1024 x 4 gray scale image(16 KB) TM Large: 64 x 64 image, 72 32 x 32 tem plates(1.4 MB) Small: 16 x 64 image, 1 16 x 32 tem plates(10 KB) Max Large: 2 100 x 256 x 256(52 MB) Small: 2 8 x 256 (16 KB) T R Large: 2 1024 x 1024 (8 MB) Small: 2 16 x 16 (2 KB) M PEG 2-distl Large: data blocks for the first 1000 calls (11 MB) Small: data blocks for the first 2 calls(22 KB) EPIC-unquantize Large: reference input (393 KB) Small: first 4 calls (6 KB) GSM-Calculation Large: reference input (1.1 MB) Small: first 50 calls (16 KB) Table 6.3: Input d ata size. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. original C code1 output C code 7.remove scalar predicates (unpredicate BOSCC parallelize (SLP) 6.remove superword predicates (select superword replacement if-conversion Superword-Level Locality unroll alignment / distance analysis Figure 6.1: Im plem entation. 6.2 Im p lem en tation Figure 6.1 illustrates the compiler im plem entation. The input to the system is a C pro gram, which is then optim ized by the SUIF passes in Figure 6.1. Superword-Level Locality (SLL) determines unroll factors based on the algorithm described in C hapter 4. Unroll performs loop unrolling. T he unroll factors are either provided by the previous SLL pass or computed by dividing superword w idth by the smallest d ata type size. As in [40], Alignm ent / distance analysis determines whether memory references are aligned to su perword boundaries and are adjacent to each other in memory. If-conversion is applied right before parallelization and results in code for which instructions are predicated. The next three passes can recognize predicates and use the predicate analysis described in Section 2.4. We extend parallelize (SLP) so th at predicate operands are packed in the same way as the other operands. As described in Chapter 3, predicates are removed by remove superword predicates (select) and remove scalar predicates (unpredicate). Then, 9 8 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. redundant superw ord memory references are elim inated by superword replacement. Fi nally, B O SC C instructions are generated wherever profitable according to the model described in Section 3.4. Among the nine passes, unroll, alignment / distance analysis, and parallelize (SLP) are taken from the original SLP compiler developed by Larsen and Amarasinghe [39] and modified to support our extensions. This ordering of passes was selected prim arily for im plem entation convenience, since we were building on the existing SLP compiler im plem entation. The SLP passes operate on the code at a low level, where it is difficult to reconstruct the loop structure and array access expressions. Thus, superword-level locality analysis is applied prior to SLP, rather th an afterward, as suggested by the examples in Figure 1.5. Superword replacement m ust follow SLP, which is the reason the components of the SLL algorithm are performed on either side of SLP. Note th at both the SLP and SLL passes employ loop unrolling, b u t for different reasons. The unroll pass unrolls the innerm ost loop of a loop nest to convert loop- level parallelism into basic block-level parallelism. T he SLL pass performs unroll-and-jam to expose locality in basic blocks. However, the loop th a t carries the most spatial locality at the superword-level is often the loop th a t carries the m ost superword-level parallelism. Therefore, it is a reasonable choice to use the SLL pass to expose both parallelism and locality in the loop body while suppressing the unrolling originally performed by SLP. T he code generation techniques described in C hapter 5 are implemented by extending parallelize (SLP) except for the reduction transform ation, which is incorporated into the unroll pass to renam e the unrolled copies of the reduction variable. 6.3 E xp erim en tal M eth od ology Figure 6.2 illustrates the experimental flow. We evaluate six different versions of the codes: Baseline, MIT-SLP, SLP+SLLo, SLP-CF-S, SLP-CF-S+B, and SLP-CF+SLLi. Baseline is the original C or Fortran program th a t is the input to the compiler. MIT-SLP is compiled 99 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. original C code B aseline MIT-SLP SLP+SLLo SLP-CF-S (select) SLP-CF-S+B (BOSCC) SLP-CF-S+B+SLLi PowerPC G4 GCC 2 .95.2 MIT SLP compiler S L L > » » » » ) > $ > } ) ) » » » » control flow extension Figure 6.2: Experim ental flow. by the original M IT SLP compiler [39] represented by the three passes 2, 3 and 5 in Figure 6.1. SLP+SLLo incorporates superword replacement (pass 8 in Figure 6.1) and packing in superword registers described in Chapter 4 as well as the code generation techniques described in C hapter 5, which are incorporated into passes 2 and 8. SLP-CF-S exploits SLP in the presence of control flow, represented by passes 4, 6 and 7 in Figure 6.1, in addition to all the optim izations exploited by SLP+SLLo- Similarly, SLP-CF-S+B exploits BOSCC (pass 9) in addition to all optimizations exploited by SLP-CF-S. SLP-CF+SLLi exploits the unroll factors determ ined by SLL (pass 1) in addition to all the optim izations applied to SLP-CF-S+B. Each output version is an optimized C program, augm ented w ith special superword data types and operations [50]. The resulting code is compiled by a GCC (version 2.95.2) backend which has been modified to support superword d ata types and operations for the PowerPC Altivec [61]. The optimized programs are executed on a 533 MHz Macintosh PowerPC G4, which has a superword register file with 32 128-bit registers, a 32 KByte 100 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. L I cache and a 1 M Byte L2 cache. All program s are compiled by the extended GCC backend w ith optim ization flag -03. 6.4 O verall P erform ance Figure 6.3 shows the speedups of the five versions with respect to B a se lin e . Each bar represents the corresponding version with the same name in Figure 6.2. For 8 of the 14 programs, MIT-SLP performs worse th an B a s e lin e because of some overhead introduced by the SUIF compiler passes leading up to SLP, particularly its code transform ations related to decomposing program constructs. This overhead is not inherent to the SLP approach, and we believe it could be elim inated w ith tuning of the SUIF passes. Nev ertheless, since it is not identifying parallelism across basic block boundaries, the best results we could hope for from the SLP compiler is no change from the sequential per formance unless there is parallelism within the basic block. W hile the reduction sum operation in GSM can be parallelized, it appeared as a d ata dependence to the original SLP compiler rem aining unparallelized. The speedups range from 0.61 to 5.15. W hen our code generation techniques and two SLL optim izations 1 are applied, SLP+SLLo speeds up dram atically for the first four kernels. However, the other 10 benchmarks are not improved much. O ther than GSM, we observe th a t the SLP+SLLo results, for the eight benchmarks w ith control flow, do not speed up at all over sequential execution, and for Max show a significant degradation. The main reason for this is th a t SLP+SLLo is unable to exploit any parallelism in the presence of control flow. The analyses and transform ations in SLP-CF-S are crucial to exploiting superword-level parallelism in these codes. SLP-CF-S, exploiting SLP across basic block boundaries, yields a speedup compared to SLP+SLLo for the eight benchmarks with control flow while there are almost no changes for the first six benchmarks. W hen BOSCC instructions are exploited in addition, 1 Superword replacement and packing in superword r egister s 101 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 20.0 15.0 Q. = 5 ■a C D 10.0 C D CL C O 5.0 0.0 Figure 6.3: Overall speedup breakdown (large data). SLP-CF-S+B achieves further speedups for Chroma and TM. The last bar, representing SLP-CF+SLLi, shows additional improvements in Figure 6.3 for seven of the 14 bench marks depending on the am ount of d ata reuse. Overall, when all techniques are combined, Figure 6.3 shows the speedups ranging from 1.05 to 11.36. Cache effects can limit the performance benefits of parallelization for m em ory-bound com putations. To dem onstrate the potential of parallelization, Figure 6.4 shows the same graph for the eight benchmarks with control flow using small d ata set sizes. The speedups for seven of eight benchmarks improve, in the case of Chroma from 5.95 to 19.22. The overall speedups range from 2.18 to 19.22. From these results, we can see th a t cache opti mizations are even more valuable when codes are parallelized. Since cache optim izations are usually applicable for m ultim edia codes, optim izations such as prefetching and tiling should be used in conjunction with parallelization. In the next three sections, we present B aselin e MIT-SLP SLP+SLLO S L P - C F - S S L P -C F -S + B SL P-C F+SL L 1 I cc > > tc 0 0 E 1 - B o < D _Q O CO C\J < 3 LU Q _ o CL LU CO o 102 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 20.0 15.0 Q. -o o 10.0 c u Q. C O 5.0 0.0 CD E o sz O CD n o CO B aselin e — M IT -SL P | ] SLP+SLLO j S L P -C F -S ■ I S L P -C F -S + B ■ ■ SL P-C F+SL L1 X C C C C CM O LU CL o 0 . LU CO (3 Figure 6.4: Overall speedup breakdown (small data). the isolated benefits of exploiting prepacking, SLP in the presence of control flow, and superword-level locality respectively. 6.5 P acking for Low P arallelization O verhead Section 5.4 describes a technique called prepacking th a t leads to b etter packing decisions in term s of overall parallelization overhead. For prepacking, isomorphic d a ta dependence graphs are packed instead of isomorphic instructions. To evaluate the effects of prepack ing, Figure 6.5 compares the performance of three versions. Baseline and SLP-CF+SLLi are the same as in Figure 6.3. NO-PREPACK represents the version compiled without prepacking. W hen prepacking is not in use, the original packing algorithm is used [39]. For nine out of 14 benchmarks, the two packing algorithms result in roughly the same performance. For the other five benchmarks, however, prepacking achieves improvements. For both NO-PREPACK and SLP-CF+SLLi, the main loop body of FIR is almost completely 103 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 10.0 C l = 3 T D CD CD O T 50 0.0 parallelized. However, th e num ber of C statem ents in the parallelized main loop body has decreased from 308 in NO-PREPACK to 182 in SLP-CF+SLLi because the instructions necessary to shuffle d a ta elements are reduced. Similarly, the num ber of superword in structions has shrunk in Sobel and GSM. For TM, th e m ain loop of NO-PREPACK has 28 independent BOSCC regions, each of which containing two to four superword instructions to by pass. For SLP-CF+SLLi, it has only four BOSCC regions containing from 16 to 21 superword instructions. By packing d ata dependence graphs rather than individual in structions, large num ber of instructions are packed at once resulting in more instructions guarded by each superw ord predicate. In this case, b etter packing decisions contribute to not only low parallelization overhead b u t also bigger BOSCC regions for each super word predicate m aking the BOSCCs more beneficial. MPEG2 is of special interest because it is parallelized only when prepacking is used. This is because all memory references B aselin e N O -P R E P A C K SL P-C F +S L L 1 [ C m rlllrCjlll I s CO > c o ■ c o E I - B o a > J D o co X CO DC f- CM (3 LU CL o o _ LU CO a Figure 6.5: Effect of prepacking. 104 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. are unaligned in MPEG2. Since the original SLP algorithm packs only aligned mem ory references, prepacking is essential in parallelizing MPEG2. 6.6 SLP in th e P resen ce o f C ontrol F low To evaluate the benefits of supporting SLP in the presence of control flow, this section focuses on the eight benchm arks containing at least one conditional statem ent in a loop body parallelized by the compiler in Figure 6.3 and Figure 6.4. For each of the bench marks, we compare the speedups of two versions, SLP+SLLo and SLP-CF-S, using two different data set sizes. For the large d ata set sizes of Figure 6.3, the speedups achieved by SLP-CF-S range from 1.25 to 2.59 for the eight benchm arks over Baseline, with an average of 1.78. Most benchmarks show significantly increased speedups for the smaller input sizes, ranging from 1.97 to 15, w ith an average of 5.18. These results suggest th a t exploiting cache optimizations and SLP in the presence of control flow together m ay result in much better performance for large d ata sets. The SLP-CF-S versions of Chroma, Sobel, and EPIC-unquantize effectively exploit the parallelism available in these benchm arks, yielding speedups of more th an 6.21. In particular, the 15 speedup on Chroma is because the data type size of the operands is 8 bits, which results in 16 operations on 8-bit objects per superword operation. TM, Max, TR, MPEG2-distl and GSM-Calculation show more modest speedups. MPEG2-distl, TM and GSM-Calculation have a reduction. In MPEG2-distl, the initialization and finalization of the reduction rem ain inside the loop body since the reduction variable is used as the test for loop exit. Sobel and TM show a performance loss due to unaligned memory accesses. We also observe th a t for the provided input d ata set size, TM has a very low num ber of true values for the branch parallelized by SLP-CF-S. W hile in sequential execution the code would branch around the core com putation, in SLP-CF-S it m ust perform the 105 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. com putation on every iteration and merge with prior results using a select operation. This additional com putation over sequential execution reduces the benefits of parallelization. The com putation in G SM -C alculation is not fully parallelized due to a scalar dependence, but a set of statem ents between the control flow constructs, representing a loop th a t was m anually unrolled, is parallelized by both SLP+SLLo and SLP-CF-S. Even though the code w ithin the control flow construct is not parallelized, the use of predication allowed our compiler to exploit parallelism across w hat would have been m ultiple basic blocks, resulting in a slightly higher speedup for SLP-CF-S. The SLP-CF-S approach presented in this section has dem onstrated fairly significant speedups on eight m ultim edia benchm arks for which the SLP compiler was unable to exploit parallelism. The performance gain for superword-level parallelization in the pres ence of control flow depends on a num ber of factors, related to both the underlying architecture and the input d ata set. The AltiVec, ISA does not support a full set of general operations for all possible types. As examples, 32-bit integer m ultiplication, un packing unsigned integers and division are not directly supported in the ISA, requiring additional instructions. For 16-bit multiplies, v e c jn u le and vec_mulo multiply even or odd num bered elements respectively in superword registers, producing two superwords to prom ote the results to 32 bits. These even and odd m ultiplications shuffle the data ele m ents breaking the spatial adjacency of data elements, requiring additional instructions to reorganize the results. Bitwise selection causes another problem in conjunction with the inconsistency of scalar boolean values and superword boolean values. In some cases, the SLP compiler may pack scalar boolean variables into a superword. Since the result of a scalar comparison is either 0 or 1 instead of a vector of all Os or all Is, the superword s e le c t can be incorrect if scalar boolean variables are packed into a superword and used in s e le c ts . 106 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. original C code output C code PAFS insert BOSCC profile for BOSCC SLP-CF Figure 6.6: An SLP-based compiler th a t supports BOSCC. As discussed in [62], different instruction set features supporting conditionals impact performance. In the AltiVec, the general mechanism of select operations requires execut ing instructions along all control flow paths and merging the results. W hen compared to sequential execution, where branches around code constructs may reduce the opera tion count, there is a tradeoff between parallelism and code w ith fewer branches versus less overall com putation. In examples such as T M where the num ber of branches taken is large, this can limit performance improvement. To reduce parallelization overhead in such cases, we can bypass parallel codes using a special instruction, described in the next subsection. 6 .6 .1 B ra n ch -O n -S u p erw o rd -C o n d itio n -C o d e (B O S C C ) We use BOSCC instructions to reduce parallelization overhead in the presence of control flow as described in Section 3.4. In this subsection, we isolate the benefits of using BOSCC instructions and investigate its characteristics. Figure 6.6 shows our im plem entation inside the thick dashed box, which is based on SLP-CF incorporating the SLP compiler and our control flow extension. Since our profitability model of BOSCC instructions relies on profile information, the im plem entation runs in two phases. In the first run, it generates instrum ented code which is then compiled by an AltiVec-extended GCC and linked to a library th a t supports the generation of a PAFS 2 file. In the second run, the 2See Section 3 . 4 . 2 . 107 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. boscc = vec_any_ne(v4, vzero); if (boscc = = 1) { v5 = vec_sub(v6, v7); bosccl = vec_any_ne(v9, vzero) if (bosccl = = 1) v8 = vec_cts(v5, 0); boscc = vec_any_ne(v4, vzero); if (boscc = = 1) { v5 = vec_sub(v6, v7); v8 = vec_cts(v5, 0); v3 = vec_sel(v3, v2, v9); } vlO = v ec_ c m p lt(v ll, v l2 ); v2 = vec_nor(vl0, v3); v !3 = vec_sel(v!3, vlO, v9); v l = vec_sel(vl, v8, v4); } vlO = v e c_ c m p lt(v ll, v l2 ); v2 = vec_nor(vlO, v3); v l3 = vec_sel(vl3, vlO, v9); v3 = vec_sel(v3, v2, v9); (a) BOSCC-N (b) BOSCC-M Figure 6.7: Example: BOSCCs generated in EPIC. predicates in the source code are annotated with PAFS values produced in the profiling run. Based on the PAFS values, our BOSCC model determ ines the profitability of each BOSCC instruction. Figure 6.8 shows speedup curves for the eight benchm arks with control flow in Ta ble 6.1. Each graph shows the speedups of three parallel versions of a benchmark, SELECT, BOSCC-N and BOSCC-M, with respect to the sequential version of the benchmark. SELECT is the same as SLP-CF-S in Figure 6.2 and the BOSCC-N (Naive B O SC C ) version is de rived by inserting a BOSCC instruction in all possible BOSCC regions. In the BOSCC-M (Model-based B O SC C ) version, the model described in Section 3.4.2 is used to evaluate the profitability of inserting BOSCC instructions. Figure 6.7 shows an example code taken from the parallelized E PIC code. The code segment contains two BOSCC regions of consecutive instructions shown in bold; three instructions in the first region and four instructions in the second. One BOSCC is generated for each of the two superword predicate v4 and v9 in BOSCC-N shown in Figure 6.7(a) whereas in BOSCC-M shown in Fig ure 6.7(b), the second BOSCC is not generated. W hile BOSCC-N has generated BOSCC 108 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. instructions without considering PAFS values, BOSCC-M has generated a BOSCC instruc tion only for v4 in this example because the PAFS values for v4 and v9 are 82 % and 0 % respectively. Figure 6.8(a) shows the speedups of TM for each of th e 72 tem plates of the kernel’s input d ata set, for versions SELECT, BOSCC-N and BOSCC-M. The speedup of BOSCC-N varies with the input d ata sets, since the true density varies from tem plate to tem plate. The BOSCC-M version also has a BOSCC instruction for all tem plates, and therefore the speedups are the same as those of BOSCC-N. Figure 6.9 shows th a t the speedup curve of the BOSCC versions closely m atches the percentage of taken BOSCC branches of each tem plate. A lthough not shown in the figure, the speedups of SELECT follow the inverse of the percentage of taken BOSCC branches, because the run tim e of the sequential baseline is affected by the PAFS while th a t of SELECT is not. The speedups of the parallel versions of Chroma are shown in Figure 6.8(b). The horizontal axis corresponds to the ratio between the sizes of the foreground object and the background image in the input d ata set (both the size and shape of the foreground object affect the true density of the input data). Since in Chroma a BOSCC branch is taken when all pixels in a superword are outside the foreground object, the speedups corresponding to smaller foreground objects are larger, as expected. In SELECT, the runtim e does not vary with the true densities, but there is a small speedup due to the fact th a t in the sequential version the body of the conditional is executed more often as the tru e density increases. BOSCC-M follows the b etter of the SELECT and BOSCC-N speedups for most input d ata sets. The few exceptions are caused by a simplification in our model, where we assume th a t the cost of executing a BOSCC instruction is the same as any other instruction. In general, branch instructions cost more than arithm etic and logical instructions as the percentage of the taken BOSCCs approaches 50 %. The BOSCC model makes the right decisions around 0 % and 100 % but it tends to make 109 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 3.0 . 2.6 W 'vy *J V A W ^ ~ 'vVWJ SELECT BO SCC-N / BOSCC-M 0 10 20 30 40 50 60 70 Template ID (a) TM 20.0 18.0 -d 16.0 14.0 - * SELECT —O BOSCC-N — BOSCC-M 12.0 10.0 % Front Object of Background (b) Chrom a SELECT BOSCC-N BOSCC-M SELECT BOSCC-N BOSCC-M (c) Max (d) Sobel CO 12 SELECT BOSCC-N BOSCC-M SELECT BOSCC-N BOSCC-M (e) T R (f) M PEG 2-distl SELECT BOSCC-N BOSCC-M SELECT BOSCC-N BOSCC-M (g) EPIC-unquantize (h) GSM -Calculation Figure 6.8: Speedups over scalar version for real data. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 3.5 Q _ 3 I D o 3.0 ( I ) CL 2.5 0 16 32 4 8 64 T em plate Figure 6.9: TM : % taken BOSCCs. wrong decisions in between the two ends when the performance m argin between the two versions with and w ithout BOSCC is small. The speedups of Max are 1.26 for SELECT and 1.22 for BOSCC-N, as shown in Fig ure 6.8(c). In BOSCC-N, each BOSCC body contains a single instruction. max = select (max, new_value, compare); We expected GCC to generate a BOSCC instruction for the region associated w ith the select instruction. However, the GCC version we use generates code such th a t the select instruction is always executed and a new copy instruction is added after the BOSCC, possibly because the destination variable (max) is live across the iterations of the innermost loop. Thus BOSCC-N has two extra instructions, a BOSCC instruction and an extra copy instruction, resulting in a slow down w ith respect to SELECT. W hen this problem is corrected m anually at the assembly level by removing the copy instruction and moving the BOSCC ahead of the select instruction, the new BOSCC-N performs better th an SELECT. For the BOSCC-N version of Sobel, a BOSCC instruction is generated for four BOSCC regions containing 2, 2, 1, and 1 instructions, respectively, yielding the same performance as the SELECT version. The PAFS for each BOSCC region are 17 %, 4 %, 2 % and 82 111 % tak en b o sc c ’ s sp e e d u p s Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. % respectively. Since the PAFS values are either high (82 %) or low (17 %, 4 % and 2 %), the cost of BOSCC instructions is reduced. Also, large memory latencies play a role in this result by overlapping w ith the BOSCC latency. If we reduce the mem ory latencies by using small d ata set, BOSCC-N slows down by 10 % with respect to SELECT. No BOSCC instructions are generated for the BOSCC-M version. The speedups of the parallel versions w ith respect to the sequential baseline are 2.59 for all three versions, as shown in Figure 6.8(d). For TR, BOSCC-N performs slightly worse th an SELECT, as shown Figure 6.8(e), again because the only BOSCC region in the kernel contains a single instruction. In addition, since the BOSCC instruction is never taken, the hardware branch predictor performs well. The BOSCC-N version of MPEG2-distl, shown in Figure 6.8(f), has 16 BOSCC in structions, generated for 4 basic blocks. Each BOSCC region consists of two instructions, and the PAFS ranges from 30 to 40% for all BOSCCs increasing their costs. Thus the BOSCC-M version does not have BOSCC instructions. EPIC-unquantize, shown in Figure 6.8(g) is interesting because the BOSCC-M version outperform s both SELECT and BOSCC-N. W hile BOSCC-N has seven BOSCC instructions, BOSCC-M has only four BOSCCs, associated to the four BOSCC regions w ith the highest num ber of instructions and PAFS. As a result, while SELECT performs worse th an the baseline and BOSCC-N achieves a negligible improvement, the BOSCC-M version speeds up by 1.12. As discussed in Section 6.6, the parallelized main loop of GSM-Calculation, shown in Figure 6.8(h), does not have any select instruction because no instructions guarded by conditional statem ents in the sequential code are parallelized. However, six BOSCC instructions are generated in BOSCC-N for another loop nest. Since the PAFS values for all BOSCC regions are less th an or equal to 10 %, BOSCC-N slowed down compared to SELECT. Because of the same reason, BOSCC-M does not have any BOSCC instruction generated and the performance is the same as SELECT. 112 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 4.0 3.5 Q . I < D * ---•# SELECT O O B O SC C -N BOSCC-M 3.0 Q . 2.5 ■ € > — ° 2.0 0 10 20 30 40 50 60 70 80 90 100 P A F S (%) 22.0 a 3 * SELECT -O B O SC C -N -• BOSCC-M 10.0 PA FS (%) SELECT O O B OSCC-N BOSCC-M % 3.4 w 3.0 PAFS (%) (a) TM (b) Chrom a (c) Max Figure 6.10: Speedups over scalar version for random ly generated data. To further investigate how the perform ance of the BOSCC-M versions varies with the input data set, we used a random num ber generator to derive synthetic data sets with PAFS from 0% to 100% for TM, Chroma and Max. Figure 6.10 shows the speedups of the SELECT, BOSCC-N and BOSCC-M parallel versions of these three kernels with the synthetic data sets. For all three kernels, th e speedup of SELECT decreases as the PAFS increases, because the sequential version performs b etter when th e scalar branches are taken more often. In general, BOSCC-N runs increasingly faster th an the sequential version as the PAFS increases. This is because the BOSCC-N versions skip superword instructions, each of which corresponds to SWS 3 scalar instructions. Mild slopes in the lower half of the PAFS range are due to the branch prediction mechanism of the machine. Finally, BOSCC-M usually performs as well as the better of the two other versions except for a small range of PAFS values, again due to our m odel’s simple assum ption for the cost of a branch. 6.7 Superw ord-L evel L ocality The SLL algorithm described in C hapter 4 use compiler-controlled caching in superword registers to reduce memory accesses. In Section 6.2, we described an im plem entation th at incorporates superword-level locality optim izations into an existing compiler exploiting 3See Section 2 . 3 . 3 . 113 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. superword-level parallelism [39]. Now, we describe the experim ental evaluation th a t helps to isolate and analyze the benefits of the SLL algorithm. Figure 6.11 shows how the reductions in memory accesses translates into speedups over MIT-SLP, which represents the original M IT SLP compiler. To isolate the benefits of individual com ponents of our im plem entation, we m easure the performance of the code at several stages of the optim ization process. The first bar, normalized to 1, represents MIT-SLP. T he second bar, called Unroll+SLP-CF, shows the results of running the first code transform ation of the SLL algorithm, described in Section 4.2, which performs unroll- and-jam on the loop nest to expose opportunities for superw ord reuse, and following up with SLP. This bar isolates the im pact of unrolling, since it is not until after the SLP pass th at this reuse is actually exploited. Also, because it is reordering the iteration space to bring reuse closer together in time, this version also obtains locality benefits in the data cache. Thus, this bar provides the cache locality benefits of unroll-and-jam , which can be com pared against the additional improvements from superword register locality. From this bar and on, we use the compiler extended w ith our techniques, represented by SLP-CF-S+B in Figure 6.2, instead of the original SLP compiler. By doing so, we can make the performance gain achieved by our extensions explicit as compared to MIT-SLP. The third bar, representing Unroll+SLP-CF+SWR, shows the speedups after superword re placement is additionally applied. Finally, Unroll+SLP-CF+SWR+RP shows the additional improvement due to packing in superword registers, described in Section 4.5.3. Overall, we see th a t in combination, applications achieve speedups between 1.40 and 8.69 over the original SLP compiler alone, with an average of 3.40. As compared to the de fault unroll am ount, the Unroll+SLP-CF versions achieve huge performance gains for most benchmarks by exploiting the unroll am ounts determ ined by the SLL algorithm in addi tion to the code generation techniques of C hapter 5. Investigation of the low speedups in 114 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. MIT-SLP Unroll+SLP-CF Unroll+SLP-CF+SWR Unroll+SLP-CF+SW R+RP Figure 6.11: Speedups over MIT-SLP. MMM and TM revealed th a t both had a severe register spilling. The register allocation algo rithm in the GCC backend compiler is not optim al and tends to make worse register alloca tions for the bigger basic blocks. A lthough the num ber of superword registers used by the generated C codes is less th an the available superword registers, the register spills occur because of the register allocation algorithm . We expect th a t optim al register allocators can eliminate the unnecessary register spilling [27]. W hen redundant memory references are removed by superword replacem ent for MMM and TM, register spilling also decreases achieving large speedups over MIT-SLP. For eight benchmarks, Unroll+SLP-CF+SWR shows significant improvements. Further speedups are achieved for three benchm arks when packing in superword registers is applied in Unroll+SLP-CF+SWR+RP. The other bench marks do not have the opportunities for packing in superword registers. Consideration of tomcatv and swim shows th a t both programs have little tem poral reuse, although there is a small amount of spatial reuse th a t is exploited by our approach, particularly in 115 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. tom catv. We also observe additional superword-level parallelism due to index set split ting, m otivated by the need to create a steady-state loop where th e data is aligned to a superword boundary. In summary, th e SLL techniques presented in C hapter 4 dram atically reduce the num ber of m em ory accesses and yield significant perform ance improvements across these 14 program s. Thus, this section has dem onstrated the value of exploiting locality in superword registers in architectures th at support superword-level parallelism such as the AltiVec. 6.8 S um m ary In this chapter, we presented the im plem entation of the techniques described in C hapter 3, 4, and 5. In evaluation of the im plem entation on 14 benchm arks, speedups ranged from 1.05 to 19.22 over the sequential input programs. To identify the factors contributing to the overall perform ance improvement, further experim ents were performed focusing on in dividual techniques. O ur extension to exploit SLP in the presence of control flow enabled speedups of 1.97 to 15 over the sequential input program s on 8 benchmarks. This is a dram atic improvement, considering without this extension no performance improvement was observed for 7 of 14 benchmarks. We also evaluated our BOSCC-based algorithm to reduce parallelization overhead in the presence of control flow. On three out of eight benchmarks, BOSCC instructions have been used to achieve further speedups. Moreover, the profitability model to insert BOSCC instructions closely estim ates the actual profit. T he im plem entation of the SLL algorithm is also evaluated on the 14 benchmarks. Com paring to the original SLP compiler, our im plem entation achieves speedups from 1.40 to 8.69 removing a m ajority of memory references. 116 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. C hapter 7 D IV A A N D P IM -S P E C IF IC O P T IM IZ A T IO N S DIVA is a Processing-In-M emory (PIM ) em bedded DRAM device th a t supports superword- level parallelism (SLP). Thus th e two algorithm s described in C hapter 3 and C hapter 4 are also applicable to DIVA. In this chapter, we focus on DIVA and PIM-specific issues and optimizations. One such optim ization is to exploit a DRAM memory characteristic, called page-mode, automatically. A page-mode memory access exploits a form of spatial locality, where the d ata item is in the same row of the memory buffer as the previous access. Memory access tim e is reduced because the cost of row selection is eliminated. T he algorithm increases frequency of page-mode accesses by reordering d ata accesses, grouping together accesses to the same memory row. The DIVA architecture is described briefly in Section 1.5.2. In the next section, we describe the instruction set architecture (ISA) features specific to the DIVA processor. In Section 7.2, we introduce a compiler optim ization th at exploits page-mode memory accesses in DIVA and present the experim ental results on four d ata intensive kernels. This experiment is separately described from those in Chapter 6 because it is performed on a DIVA simulator instead of the PowerPC G4. In Section 7.3, we discuss code generation issues specific to DIVA ISA. In Section 7.4, we present a prelim inary experim ental result on a prototype DIVA system. Section 7.5 summarizes this chapter. 117 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Execution Control * Source2 Sourcel Superword(256-bi t) ALU Permutation Arithmetic Multiplier Logical Superword Register File Figure 7.1: The superword d ata flow. 7.1 T h e D IV A ISA DIVA supports a wide range of superword instructions for superw ord datapath in addition to ordinary scalar instructions. The intent of the superword d atap ath is to process objects aggregated w ithin a row of the local memory array by operating on 256 bits in a single processor cycle. This fine-grained parallelism offers additional opportunity for exploiting the increased processor-memory bandw idth available in a PIM . The superword functional unit can perform bit-level operations, such as simple pattern m atching, or higher-order com putations such as searches and reduction operations. The superword d ata flow is shown in Figure 7.1 and has several features to distinguish it from the other m ultim edia extension architectures. First is the ability to support condi tional execution of instructions on sub-fields within a superword, depending on the state of local condition codes [9]. Although similar designs support some type of conditional op eration, the DIVA superword functional unit provides a much richer functionality through the ability to specify conditional execution in almost every superword instruction and the use of global condition code information in selection decisions. Second, even for appli cations where the superword operations are not applicable, the superword datapath can be used to accelerate memory access tim e and communication. Contiguous d ata required for the scalar or floating point datapaths can be loaded into (stored from) a superword 118 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. register, and transferred directly to (from) the other register files at a small fraction of the scalar mem ory access latency. T hird, because there is no data cache, exploiting the large capacity of the superword register file (1 KB) as described in C hapter 4 is even more im portant. Finally, the superw ord d atap ath is integrated into the communication mechanism, transferring d ata to/from the local comm unication buffer; this allows entire communication packets to be read or w ritten in only one operation. Conditional execution, direct transfers to/from other register files (only in SSE), in tegration w ith communication, as well as the ability to access main memory at very low latency, distinguish the DIVA superword capabilities from multim edia ISA extensions such as SSE and AltiVec, as well as subword parallelism approaches such as MAX [42]. 7.2 P ag e-M o d e M em ory A ccess Accessing a d a ta within a DRAM m acro consists of two steps. First, the entire row containing the d ata is copied into the DRAM open-row buffer. Then, the desired d a ta is accessed from the buffer. This mode of DRAM accesses requiring both row and column accesses is called random-mode. However, m ost DRAM modules support an efficient page-mode access, where a memory access to a location currently in the DRAM open-row buffer fetches the d ata directly from th a t buffer, elim inating the cost of fetching the row from the DRAM array. To fully exploit lower latency page-mode accesses, the user or the compiler m ust reorganize the com putation so th a t accesses to a same memory row are grouped together, and there are no intervening accesses to other rows. Exposing opportunities for grouping accesses to a same array may require transfor m ations such as unroll-and-jam , to bring accesses issued in distinct loop iterations to the body of the transform ed loop, and statem ent reordering, to group the memory accesses. 119 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. fo r(i= 0 ;i< n ;i+ = 4 ){ fo r(j= 0 ;j< m ;j+ + ){ fo r(i= 0 ;i< n ;i+ + ){ fo r(j= 0 ;j< m ;j+ + ){ load AM ] [i] load B[i] load A[j][i] load A[j][i+1] load A[j][i+2] load A[j][i+3] load B [i] } load B[i+1] load B[i+2] load B[i+3] } } } (a) Original (b) After unroll-and-jam and reordering Figure 7.2: Unroll-and-jam and reordering. Recent research has proposed to exploit page-mode accesses through m anual code trans formations [51, 47, 14]. This section presents a compiler algorithm for exploiting page mode automatically. Although the proposed compiler algorithm is applicable to other embedded DRAM systems, we describe the algorithm from the viewpoint of DIVA. In C hapter 4, we pre sented an algorithm for exploiting locality in superword registers. In this section, we show th at with a similar approach we can also exploit spatial locality in the page of a DRAM memory array. The rem ainder of this section is organized as follows. Section 7.2.1 m otivates our approach using a simple example. Section 7.2.2 introduces our algorithm for exploiting page-mode memory accesses. Section 7.2.3 presents experim ental results on a set of four m ultim edia kernels. 7.2.1 M o tiv a tio n Figure 7.2 illustrates the benefits of page-mode accesses using a simple loop nest w ith two array references. Assuming th a t the sizes of arrays A and B are larger than the DRAM ’s 120 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Ref. Loop j Loop i AD][i] B[i] m * RM Latency m * RM Latency n * m * RM Latency n * m * RM Latency Total 2 * n * m * RM Latency (a) Original Ref. Loop j Loop i ' A[j][i] Ap][i+1] Ap][i+2] A[j][i+3] B[i] B[i+1] B[i+2] B[i+3] m * RM Latency m * PM Latency m * PM Latency m * PM Latency m * RM Latency m * PM Latency m * PM Latency m * PM Latency j * m * RM Latency j * m * PM Latency j * m * PM Latency j * m * PM Latency j * m * RM Latency ^ PM Latency j * m * PM Latency j * m * PM Latency Total ^ * m * RM Latency + ^ * ra * PM Latency (b) After unroll-and-jam and reordering Table 7.1: M emory latency com putation. open-row buffer, all array references in Figure 7.2(a) are in random -mode, since reference B[i] displaces the DRAM row containing A[j\ [ ■ < ] from the open-row buffer and vice-versa. For the same num ber of memory accesses in this loop nest, we can increase the page mode memory accesses by applying a series of code transform ations, as shown in Fig ure 7.2(b). First, unroll-and-jam is used to create opportunities for page-mode accesses by moving array references from successive loop iterations of the outer loop into the body of the transform ed inner loop. In the example, unroll-and-jam is used to unroll the outer i loop and fuse together the resulting inner j loop bodies. Next, accesses to the same memory page in th e loop body may be grouped together by reordering the memory ac cesses in the transform ed loop body, if the reordering does not violate data dependences. In Figure 7.2(b), where the i loop is unrolled by a factor of 4, references to the same array (A or B ) in the body of the transform ed loop are grouped together. This results in 121 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 1 . Select a loop to unroll 2 . Control register pressure 3 . Align the loop to page boundaries 4 . Unroll-and-jam 5 . Reorder memory accesses Figure 7.3: The page-m ode mem ory access algorithm. page-mode accesses for all references in the loop body, except leading references M[j][i] and B[i], which are in random mode. Table 7.1 shows the total mem ory access cost for the code in Figures 7.2(a) and (b), if we assume th a t accesses are not going through cache. Assuming th a t random -m ode latency is three times the page-mode latency as in [33], loop (a) has a total latency cost of 6 * n * m * PM Latency, while (b) has a cost of 3 * n * m * PM Latency, a factor of 2 difference in overall memory latency. This example shows the potential for improving performance in embedded DRAM devices through the above code transform ations. To expose opportunities for page-mode accesses by applying unroll-and-jam and mem ory access reordering, a compiler algorithm must: (1) determ ine the safety of these code transform ations and select a loop for which unrolling is profitable; (2) select an unroll factor th a t increases page-mode accesses while not causing register spilling; and, (3) transform the code to reorder the memory accesses. In the next subsection we present our compiler algorithm for exploiting page-mode ac cesses, which includes these three steps. 7 .2 .2 T h e P a g e-M o d e M em o r y A c c e ss A lg o rith m In this subsection, we introduce a compiler algorithm for exploiting page-mode mem ory accesses. Our algorithm is applicable to loop nests with array references in the loop body, where the array subscript expressions are affine functions of the loop index variables. Only array accesses are reordered by the algorithm, since it is difficult to determ ine whether 122 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. two scalar accesses are on the same memory page. For presentation purposes, we make some simplifying assum ptions as follows. 1. Array objects are aligned at memory page boundaries. 2. The lowest dimension sizes of array objects are m ultiples of a memory page size. 3. The compiler backend does not change the memory access order generated by the algorithm. Some of these assum ptions can be removed by modifying the compiler backend (1,3) or by padding array objects (2). The algorithm presented in this subsection unrolls a single loop in a loop nest, since in practice unrolling more th an one loop could create register pressure and instruction cache misses. A set of heuristics is used to select which loop to unroll and its unroll amount. These heuristics result in a fast algorithm th a t is effective for the benchmarks presented in Section 7.2.3. In C hapter 4, we present an algorithm for exploiting superword-level locality (SLL) which uses unroll-and-jam to expose data reuse, and unrolls m ultiple loops in a nest. However, the SLL algorithm cannot be used as is to exploit page-mode memory accesses. Assuming th a t th e SLL algorithm has been applied a priori and focusing on the goal of exploiting page-mode allow much simpler algorithm which is com putationally cheaper as well. Figure 7.3 illustrates the steps of the algorithm, which are described in the remainder of this subsection. The first step selects which loop to unroll, after determ ining the safety of the code transform ations (unroll-and-jam and statem ent reordering). The second steps selects an unroll factor th a t increases page-mode accesses while not causing register spilling. The last three steps apply the code transform ations to the loop nest. 123 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Selecting a Loop To U nroll T he first step of the algorithm selects a loop to unroll, based on the num ber of random -m ode memory accesses of the loop nest after applying unroll-and-jam . The algorithm uses d ata dependence information to determ ine the safety of unroll-and-jam . For each loop I in the loop nest, the algorithm computes the unroll am ount X/ and its corresponding num ber of random -m ode accesses Ri, such th at R{ is the smallest num ber of random -m ode memory accesses if I is selected to be unrolled (assuming th a t references to a same mem ory page can be grouped together). T hen the algorithm compares the num ber of random -m ode accesses of each loop in the nest and selects the loop w ith the smallest R[. For each loop I, th e smallest unroll am ount th at minimizes Ri is com puted as in Equation 7.1. P X l m inaeA(T(a) * C (a ,l)) where P is the memory page size, A is the set of array references in the loop nest which are loop-variant with I in the lowest dimension, a is an array reference in A , T (a ) is the type size of a and C(a, I) is the coefficient of the index variable I in the lowest-dimension subscript of a. After com puting the unroll amounts, the algorithm computes the corresponding num ber of random -mode memory accesses Ri, w ith the goal of selecting the loop w ith smallest Ri. For each loop I, the num ber of random -m ode accesses R{ is com puted as the num ber of distinct pages in the memory-page footprint of A , F i(A ,X i) (assuming th a t the algorithm can group together references to a same page). In Chapter 4, we present the com putation of the superword footprint of a set of array references in a loop nest, which consists of the num ber of distinct superwords accessed by the references, a function of the unroll amounts. The memory-page footprint can be computed in a similar way to th a t of the superword footprint. 124 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Controlling R egister Pressure A fter selecting a loop I to unroll, the algorithm ad justs the unroll am ount of th e selected loop to avoid register pressure and register spilling, which could offset the benefits of unroll-and-jam . In C hapter 4, we presented the com putation of the num ber of registers required to keep the d ata accessed by th e references in the loop nest after applying transform ations for increasing locality in the superword register file. Here we adopt a similar approach to compute the num ber of superword registers required for the given unroll factor. The total num ber of registers required (T N R ) to keep the d ata accessed in the loop nest is computed as the sum of the num ber of registers required for each group of uniformly generated references. If the total num ber of registers is larger th an the num ber of registers available, the algorithm adjusts the unroll am ount X i, by dividing it by the ratio of T N R and the num ber of available registers N R E G . (7.2) Since the smallest type size is used in E quation 7.1, all references th a t have spatial reuse carried by loop I can exploit spatial reuse fully at the memory page level. A ligning th e Loop To Page B oundaries If the starting addresses of the memory accesses in the unrolled loop body are not aligned to a page boundary, each set of memory accesses to a same array will have one additional random -mode access per iteration. In Chapter 4, we applied index set splitting to reduce the need for alignment operations. Here, we apply the same transform ation to reduce these unnecessary random -mode ac cesses. To determ ine the split points, we use Equation 4.15 except th a t superword size (SWS) is replaced by ^ where P is the memory-page size and T is the type size of a representative array reference. X i = Xi f T N R 1 I N R E G 125 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. for(i=32; i<N ; i+ = 64){ load A[i + 0] (RMA) load A[i + 32] (RMA) load A[i + 8] (RMA) load A[i + 40] (RMA) load A[i + 16] (RMA) load A[i + 48] (RMA) load A[i + 24] (RMA) load A[i + 56] (RMA) for(i=32; i<N ; i-F=64){ load A[i + 0] (RMA) load A[i + 8] load A[i + 16] load A[i + 24] load A[i + 32] (RMA) load A[i + 40] load A[i + 48] load A[i + 56] } } (a) U nsorted (b) Sorted Figure 7.4: Sorting offset addresses. Param eters Value Unit Random -m ode latency Page-mode latency Page size 12 4 256 Cycles Cycles Bytes Table 7.2: DIVA simulation param eters. R eordering M em ory A ccesses Finally, the reordering step hoists loads to the top of the loop body and sinks stores to the bottom . W hile being hoisted / sunk, the loads / stores to the same array are grouped together and sorted by their offset addresses. W hen there are unaligned array references even after aligning the loop, sorting the offset addresses can reduce the num ber of random -m ode accesses. Figure 7.4 shows an example where the page size includes 64 elements of array A . All eight memory accesses are in random mode before sorting. After sorting the offset addresses, only two random -mode accesses remain. 7.2 .3 E x p e r im e n ts for th e P a g e -M o d e M em o ry A c c e ss A lg o rith m Two prototypes of the DIVA PIM chip have been fabricated recently [21], but the complete DIVA system is not available for our experiments at the tim e of this writing. Therefore, 126 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Name D escription Input Size VMM MMM YUV FIR Vector-m atrix m ultiply M atrix-m atrix m ultiply RGB to YUV conversion Finite impulse response filter 64 elements 64 elements 32K elements 256 filter, IK signal Table 7.3: Benchm ark programs. we used a cycle-accurate DIVA sim ulator (DSIM) [21], which is modified from RSIM [53]. Table 7.2 shows the simulation param eters for the memory system which closely m atch those of the IBM Cu-11 embedded DRAM m acro [33]. In general, there can be m ultiple DRAM macros and m ultiple open pages in a single chip, but for our experiments we assume th a t only one memory page is open at any given time. We implemented the bulk of the algorithm presented in the previous subsection, and integrated it into the Stanford SUIF compiler. T he input to the modified SUIF compiler is a C program, and the output is a DIVA-extended C program which, in turn, is translated by the DIVA GCC backend. Table 7.3 shows the four kernels used to evaluate the effectiveness of the algorithm , a subset of the kernels from C hapter 6. Figure 7.5 shows the experimental flow. The main algorithm involves selecting unroll factors, performing unroll-and-jam and memory access reordering, and is represented by th e hashed rectangles in Figure 7.5. In Chapter 4, we selected unroll factors for unroll-and-jam th a t maximize reuse in superword registers; here, we use the unroll factors determ ined by the algorithm in Sec tion 7.2.2, which are likely to be larger th an in C hapter 4. In some sense, the optim iza tions for page-mode memory accesses are com plem entary to exploiting SLP and locality in superword registers, and the page-mode optim izations are difficult to isolate in our compiler. In fact, because the SLP and SLL optim izations reduce th e number of memory accesses, we will see less benefit from the page-mode optimizations than if considered in isolation. 127 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. ( c p ro g ra m ^ ys- Control register pressur yy- Align the loop to page yy boundaries Unroll-and-jam Superword replacement(SWR) C AltiVec extended C program Parallelization(SLP) Y T 7 T T T T 7 T 7 T 7 T 7 7 T 7 7 b ~ rT T T T T T 7 T 7 7 7 T 7 T 7 \ Memory access reordering ^version _ /UNROLL I n v e rs io n ,/ J^PMA ^ v e r s i o n ) DIVA gcc DIVA Simulator(DSIM) Figure 7.5: Experim ental flow for page-mode m em ory access. We use as our baseline the SLP version of the code with no unrolling beyond what is required to exploit parallelization of the innerm ost loop. T he UNROLL version includes unroll-and-jam, where the loop selected by the algorithm in Section 7.2.2 is unrolled by the chosen am ount, and inner loop bodies are fused together. As compared to the baseline version, this version isolates the benefits of unroll-and-jam and superword replacement in term s of reduced memory accesses and less loop overhead. The PM A version reflects the performance improvements due to memory access reordering, yielding the full benefit of the optimizations for page-mode accesses. In these experiments, we used optim ization level -01 for the DIVA GCC backend rather than a higher level of optim ization. This was required to avoid reordering of memory accesses in subsequent optim ization passes, which occurs at higher levels of optimization. For all programs but YUV, the algorithm was able to unroll the selected loop by the unroll factor determ ined by Equation 7.1. For YUV, which references six distinct arrays, 128 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. for(i = 0; i < 64; i+ + ) for(j = 0; j < 64; j+ + ) fo r(k = 0; k < 64; k + = 8){ load C[i][j] load B [i] [k] load A[j][k] store C[i] [j] } (a) VMM Figure 7.6: SLP ver for(i = 0; i < 64; i+ + ) for(j = 0; j < 64; j + = 8) fo r(k = 0; k < 64; k + + ){ load C[i][j] load A[i][k] load B[k][j] store C[i]p] } (b) MMM of VMM and MMM. Figure 7.7: Normalized execution time. this unroll factor was too large and resulted in register spilling. The algorithm reduced the unroll am ount by half and the register spilling was eliminated. We first consider how the optim izations for exploiting page-mode memory accesses im pact memory stall time. Figure 7.7 shows the normalized execution times broken down into processor busy time and memory stall tim e, derived from simulation. The UNROLL version sees a significant reduction in both processor busy tim e (9% to 60%) and memory stall tim e (25% to 71%). The prim ary reason for this is th a t superword replacement has eliminated a large number of memory accesses, which not only reduces memory stall time, 129 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. but also reduce processor busy tim e by elim inating address calculation and instruction issue associated with the elim inated memory accesses. Further, reduction in loop control overhead also reduces processor busy tim e. For all program s, the PM A version further reduces memory stall tim e by 21% to 33%. As compared to the UNROLL version, we have not elim inated any instructions, b u t rather have converted random -m ode accesses to page-mode accesses. Next we consider in Figure 7.8 the percentage of all m em ory accesses th a t are in page mode. The percentages of page-mode accesses ranges from 25% to 37% for the baseline version of the program s. We see a decrease in page-mode accesses as a percentage of memory accesses for m ost program s for the UNROLL version, ranging from 6% to 32%. This effect is because superword replacement has removed a large num ber of page-mode memory accesses, and the rem ainder tend to be in random mode. For example, in the VMM loop shown in Figure 7.6(a) after SLP, references to C [i] [ j] in the k-loop are loop-invariant after unrolling, and are usually removed, b ut were page-mode accesses in the SLP version due to the preceding store to the same location. In MMM, the page-mode percentage actually increases for the UNROLL version, as can be seen in Figure 7.6(b). References to A [i] [k] are random -mode accesses, and are eliminated by superword replacement. For the PM A version, which reflects the same num ber of memory accesses as the UNROLL version, the percentages of page-mode accesses range from 63% to 87%. These results show th a t our algorithm has been successful at increasing the percentage of page-mode accesses and reducing the memory stall tim e. We now see how the approach im pacts the overall performance. Figure 7.9 shows the speedups for the SLP, UNROLL and PM A versions of Figure 7.5. Overall speedups as com pared to the SLP baseline range from 1.25 to 2.19. Most of this speedup comes from the 1.19 to 1.89 improvement from unroll-and-jam and superword replacement, as can be seen from the UNROLL version. The speedup of the PM A version over the UNROLL version ranges from 1.04 to 1.16. 130 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Figure 7.8: Percentage of page-mode Figure 7.9: Speedup breakdown, accesses. 7.3 D IV A -S p ecific C ode G eneration We described the DIVA ISA features in Section 7.1. In this section, we consider issues in generating code for DIVA. Although DIVA does not support predicated execution, almost all superword instruc tions can be executed conditionally. In C hapter 3, we described removing superword predicates by inserting select instructions. The same goal can be achieved by using con ditional execution. Given a predicated superword instruction, a special instruction is inserted to move the predicate to the mask register, which is referenced by the subse quent conditional execution. Figure 7.10 illustrates this using an example shown in (a). For comparison, we also show the code in (b), generated by the select algorithm described in Chapter 3. In (c), each predicated superword instruction is replaced with a sequence of two instructions, th a t is, one for setting the mask register and the other for conditional execution. Here, we observe an optim ization opportunity where th e later masfc-setting instruction is redundant if we can recognize th at the two superword predicate values are identical. The more instructions with the same predicate are collected, the more such m ask-setting instructions can be eliminated, leading to a larger performance benefit. In 131 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. for(i= 0; i<1024; i+ + ){ for(i= 0; i<1024; i+ = 4){ } } if(C[i] != 1){ a [i] = c[i]; b[i] = d[i]; v_comp = c[i:i+3] != (1,1,1,1); v_pT, v_pF = v_pset(v_comp); a[i:i+3] = select(a[i:i+3], c[i:i+3], v_pT) b[i:i+3] = select(b[i:i+3], d[i:i+3], v_pT) } (a) Original (b) Select instructions inserted for(i= 0; i<1024; i+ = 4 ){ v_comp = c[i:i+3] != (1,1,1,1); v_pT, v_pF = v_pset(v_comp); move to m ask register (v_pT); a[i:i+3] = cond_store(c[i:i+3]); move to m ask register (v_pT); b[i:i+3] = cond_store(d[i:i+3]); for(i= 0; i<1024; i+ = 4 ){ v.com p = c[i:i+3] != (1,1,1,1); v_pT, v_pF = v_pset(v_comp); move to m ask register (v_pT); a[i:i+3] = cond_store(c[i:i+3]); b[i:i+3] = cond_store(d[i:i+3]); } } (c) Conditional execution (d) Optim ized conditional execution Figure 7.10: Code generation for conditional execution in DIVA. Section 3.4, we described an algorithm th a t forms a largest region of superword instruc tions guarded by the same superword predicate. The similar algorithm can be used for this optimization. As discussed in Section 7.1, the support for d a ta transfer between different register files allows an optim ization, by which scalar memory latencies are reduced further. In this optim ization, we increase the num ber of instructions to reduce the latencies of scalar memory accesses. For example, replacing one scalar memory access w ith a pair of a superword memory access and a copy instruction will not be profitable whereas it may be profitable if the same optim ization is applied for two scalar memory accesses. Thus, a code generation issue is to find the right num ber of scalar memory accesses for this optim ization to be profitable. Since AltiVec supports the general perm utation instruction, one field of a superword register can be moved to any field of another register. The movement of the data fields is guided by a perm utation vector, th at can be generated from an address dynamically as 132 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. float aQ, b[], c[]; for (i=0; i<DATASIZE; i+ + ) a[i] = b[i] + c[i]; Figure 7.11: Stream Add Processors Clock (MHz) O perating System Compiler (Optim ization) DIVA Itanium2 140 900 DIVA O /S Linux icc 8.0 (-03) gcc 2.95.3 for DIVA (-02) Table 7.4: Experim ental environments. discussed in C hapter 5. DIVA allows accessing pre-arranged perm utation vectors using an index in a scalar register in addition to general perm utation. W hile these perm utation instructions can be used for alignm ent operations and parallel reduction operations, their versatility entices further exploration in code generation techniques for applications such as m atrix transpose and sorting. 7.4 P relim in ary B a n d w id th D em on stration The DIVA processor described in C hapter 1 is fabricated and being integrated into a complete system. C urrently the second prototype of the DIVA chip is up and running in an Itanium 2 server. In this section, we present a prelim inary performance result dem onstrating the d a ta bandw idth of the DIVA processor. Figure 7.11 shows a kernel, called StreamAdd, which is used to m easure performance for this section. Since there is no d ata reuse in this com putation, and very little compu tation to hide mem ory latency, it is a useful benchm ark for stressing memory subsystem of architectures. In this experim ent, we compare the Stream Add run tim es on DIVA to those on an Itanium 2 processor. Table 7.4 shows the experimental settings for the two processors. 133 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 1 0 0 0 DIVA Itanium2 ^ 800 0 < d c o CD 1 600 d ) a > E 400 + _ , c 13 cc 200 10000 20000 30000 40000 50000 60000 Array size (#elements) Figure 7.12: Run tim e of floating point StreamAdd. The DIVA code was compiled with the DIVA compiler, which has a separate optimizing compiler and a backend compiler. Based on the algorithm s in Chapter 3 and C hapter 4, the optimizing compiler parallelizes the code. T he backend compiler is ported for DIVA from GCC 2.95.3 extended to support the Pow erPC AltiVec. The Stream A dd performance results are shown in Figure 7.12. The X-axis represents the data set sizes in number of array elements, and the Y-axis represents execution tim e in micro seconds. There are two curves in the graph. The thick straight line labeled DIVA shows determ inistic performance increasing linearly as the problem size increases. This is expected because DIVA does not have d ata cache. The Itanium 2 result shows the performance th a t varies as the problem size increases reflecting its more complicated memory hierarchy. It is better for smaller problem sizes, but as the problem sizes get larger, the way in which the system allocates memory leads to worse performance. Over all, in this experim ent we observe th a t the single DIVA execution time is com parable to th a t of the Itanium 2 execution time. 134 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 7.5 Sum m ary In this chapter, the issues specific to DIVA and PIM architectures are described. We presented a compiler algorithm th a t reduces random -m ode mem ory accesses. In an ex perimental evaluation of the algorithm on a cycle-accurate DIVA sim ulator, we obtain speedups ranging from 1.25 to 2.19 over the parallel baseline for four m ultim edia kernels. In addition, we presented a prelim inary experim ental result dem onstrating d a ta band width on a prototype DIVA system. We observe th at the perform ance of a single DIVA processor is com parable to th a t of the Itanium 2 processor. 135 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. C hapter 8 R E L A T E D W O R K In this chapter, we examine previous work related to each of our approaches and distin guish our research. Previous work related to our control flow extension, superword-level locality algorithm and a DIVA-specific optim ization is described in Sections 8.1, 8.2 and 8.3, respectively. In the last section, we summ arize this chapter. 8.1 E x p lo itin g SLP in th e P resen ce o f C ontrol Flow Some prior work has described autom atic parallelization for m ultim edia extensions [39, 37, 64, 15, 42, 8]. Two distinct approaches are used, th a t is, SLP [39, 37] and an adaptation of vectorization [64, 15, 42, 8]. Extending vectorization techniques for conditionals has been addressed [8, 64], but there is no prior work describing how to parallelize conditionals using an SLP approach. If-conversion is described in [4, 3]. Ferrante and Mace describe restoring control flow back from if-converted code [24]. However, their m ain focus is in generating a sequen tial code from parallel interm ediate representations. More recently, Park and Schlansker describe an if-conversion algorithm th a t is optim al in term s of the num ber of predicates used and the num ber of predicate defining instructions [55], which is the algorithm we use in our compiler. Vectorizing compilers targeting m ultim edia extensions should have a mechanism corresponding to our unpredicate unless if-conversion is applied selectively 136 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. only to the statem ents th a t will be parallelized. Mahlke describes a predicate CFG gener ator which restores the original control flow from a predicated hyperblock code [44]. We use his algorithm in the unpredicate algorithm when an instruction cannot be inserted into an existing basic block. Concepts similar to the select instruction have been described elsewhere [22, 62, 49]. Bik and et. al. used a technique called bit masking to combine definitions. However, their m ethod is lim ited to singly nested conditional statem ents [8]. Chuang et. al. directly generate phi-instructions from the CFG of a scalar code to address multiple-definition problem in architectures supporting predicated execution [16]. A phi-instruction is a scalar analog of the superword select instruction described in C hapter 3. Their approach is related to ours in th a t Park and Schlansker’s algorithm is also used to derive predicates for the phi-instructions. W hile phi-predication could be run as a pre-pass to SLP, th e code resulting from SLP would potentially contain rem aining scalar predicated instructions. In an architecture such as the AltiVec, efficient code generation of the predicated scalar instructions would require an algorithm akin to the unpredicate pass described here. Using phi-predication as opposed to full predication to parallelize conditionals in the SLP compiler is a topic of future research. Branch-on-superword-condition-code (BOSCC) is supported in the AltiVec G4 [50], DIVA [31, 21], and other architectures [7, 6]. The movemask instruction in Pentium can also be used for a similar purpose to BOSCC [34]. However, no prior work describes generating BOSCC instructions autom atically to reduce parallelization overhead of condi tionals. A vector flag population count instruction [46] can be used to change the control flow similar to BOSCC instructions in vectorized programs. However, the probability of taken BOSCCs decreases exponentially w ith the vector length, and the long vector length of vector machines reduces th e chances for the profitability of BOSCC instructions dramatically. 137 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 8.2 Superw ord-L evel L ocality For well over a decade, a significant body of research has been devoted to code transfor m ations to improve cache locality, m ost of it targeting loop nests w ith regular d ata access patterns [25, 12, 70, 71]. Loop optim izations for improving d ata locality, such as tiling, in terchanging and skewing, focus on reducing cache capacity misses. O f particular relevance to this thesis are approaches to tiling for cache to exploit tem poral and spatial reuse; the bulk of this work examines how to select tile sizes th a t eliminate b oth capacity misses and conflict misses, tuned to the problem and cache sizes [13, 17, 23, 26, 28, 29, 38, 66, 69, 58]. The key difference between our work and th a t of tiling for caches is th a t interference is not an issue in registers. Therefore, models th a t consider conflict misses are not appropriate. Further, our code generation strategy m ust explicitly manage reuse in registers. There has been much less attention paid to tiling and other code transform ations to exploit reuse in registers, where conflict misses do not occur, b u t registers m ust be explicitly named and managed. A few approaches examine m apping array variables to scalar registers [69, 11, 45]. Most closely related to ours is the work by C arr and Kennedy, which uses scalar replacem ent and unroll-and-jam to exploit scalar register reuse [10]. Like our approach, in deriving the unroll factors, they use a model to count the num ber of registers required for a potential unrolling to avoid register pressure, and they replace array accesses, which would result in memory accesses, with accesses to tem poraries th at will be put in registers by the backend compiler. Their search for an unroll factor is constrained by register pressure and another m etric called balance th a t m atches memory access tim e to floating point com putation time. Our approach is distinguished from all these others in th a t the model for register requirem ents m ust take spatial locality into account, we replace array accesses with superwords rather th an scalars, and we also consider the optim izations in light of superword parallelism. 138 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. There are several recent compilation system s developed for superword-level paral lelism [39, 64, 15, 19, 5]. Most, including also commercial compilers [68, 48], are based on vectorization technology [64, 19]. In contrast, Larsen and Amarasinghe devised a superword-level parallelization system for m ultim edia extensions [39]. None of these ap proaches exploit reuse in the superword register file. 8.3 D IV A -sp ecific O p tim ization s Previous research has identified the benefits of exploiting page-mode DRAM accesses [51, 47, 14, 54, 30]. Moyer modeled memory system s analytically and developed a compiler technique called access ordering th a t reorders mem ory accesses to b etter utilize the mem ory system [51]. McKee et al. described Stream M emory Controller (SMC) whose access ordering circuitry attem pts to maximize mem ory system performance based on the de vice characteristics [47]. Their compiler is used to detect stream s but access ordering and instruction issue is determ ined by the hardw are. Cham e et al. manually optimized an application for the DIVA system [14] by applying loop unrolling and memory access reordering to increase the num ber of page-mode accesses. Panda et al. have developed a series of techniques to exploit page-mode DRAM access in high-level synthesis [54]. Their techniques include scalar variable clustering, memory access reordering, hoisting and loop transform ations. W hile their ASIC design was able to exploit page-mode memory access, they do not describe an algorithm for autom atic code generation. G run et al. have optimized a set of benchmarks to better utilize efficient m emory access modes for their IP library based Design Space Exploration [30]. However, their focus was on accurate tim ing models of the hardw are system description. Our research on exploiting page-mode mem ory access is distinguished from previous research as the design and im plem entation of a compiler algorithm to exploit page-mode autom atically. Although the experiments are performed for a PIM -based system [31], 139 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. this compiler framework is applicable to embedded-DRAM system s and can also be used as a preprocessor for high-level synthesis. 8.4 Sum m ary This chapter described previous work related to our approaches described in this thesis. Our SIMD parallelization in the presence of control flow is distinguished by its applicabil ity to arbitrary acyclic control flow graphs and the two optim izations to reduce paralleliza tion overheads. T he superword-level locality algorithm is th e first approach th a t exploits superword register files as a compiler-controlled cache. O ur page-mode algorithm for em bedded DRAM devices is the first compiler approach th a t exploits page-mode memory accesses automatically. 140 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. C hapter 9 C O N C L U SIO N M ultimedia extension architectures have been around for the last decade. Yet, compil ers th a t autom atically m ap sequential applications to exploit the SIMD parallelism for such architectures are relatively new. A lthough m ultim edia extensions are different from conventional vector processors in many aspects, m ost existing commercial / research com pilers are based on the technique targeting loop-level parallelism used for conventional vector machines. More recently, a new approach th a t exploits superword-level parallelism (SLP) is suggested specifically targeting m ultim edia extension architectures [39]. This thesis has extended the SLP compiler approach by addressing two im portant open issues: how to exploit SLP in the presence of control flow and how to use superword register files as a compiler controlled cache. For DIVA, which is a processing-in-memory architec ture, we have described a DIVA-specific optim ization th a t exploits a faster DRAM access mode, called page-mode, automatically. In the next section, we describe our contributions by summarizing each technique and in Section 9.2, we describe our future work. 9.1 C ontribu tions This thesis makes the following contributions. 141 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 9.1.1 SL P in th e P r e se n c e o f C o n tro l F low Control flow is common in the core com putation of m ultim edia applications. However, the SLP compiler cannot exploit parallelism across basic block boundaries. This thesis has extended the SLP compiler for exploiting SLP in th e presence of control flow. A key insight is th a t we can use techniques related to optim izations for architectures supporting predicated execution, even for m ultim edia IS As th a t do not provide hardw are predication. We derive large basic blocks with predicated instructions to which SLP can be applied. After parallelization, the basic block can be a a mix of predicated scalar and superword instructions. Since our target architectures do not support predicated execution, both superword and scalar predicates m ust be removed. We describe how to minimize the overheads for removing superword predicates and re-introduce efficient control flow for scalar predicated instructions. In addition, we have discussed other extensions to SLP to address common features of real m ultim edia codes. We have presented autom atically generated performance results on 14 m ultim edia codes to dem onstrate the power of this approach. We observe speedups ranging from 1.09 to 15.00 as compared to sequential execution. As an optim ization on the code parallelized for control flow, we also evaluate the costs and benefits of exploiting branches on the aggregate condition codes associated with the fields of a superword such as the branch-on-any instruction of the AltiVec. Branch-on-superuiord-condition-codes (BOSCC) instructions allow fast detection of ag gregate conditions to bypass a parallel code segment, an optim ization opportunity often found in multim edia applications such as image processing and pattern matching. Our experimental results show speedups of up to 1.40 on 8 m ultim edia kernels when BOSCC instructions are used as compared to the versions not using them . 142 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 9 .1 .2 C om p iler C o n tro lled C a ch in g in S u p erw o rd R eg iste r s Parallelization is not as effective when bottleneck is m em ory accesses. Thus optim izations targeting memory hierarchy are even more im portant for the architectures supporting SLP. This thesis has described a compiler algorithm th a t exploits these superword regis ter files as a compiler controlled cache to avoid unnecessary mem ory accesses. Accessing d ata from superword registers, versus a cache or m ain memory, has two advantages, i.e., removing memory access instructions and their latencies. This research is distinguished from previous work on exploiting reuse in scalar registers because it considers not only tem poral but also spatial reuse. As compared to optim izations to exploit reuse in cache, the compiler m ust also manage replacem ent, and thus, explicitly name registers in the generated code. We have presented a set of results derived autom atically on 14 bench marks. Our results show speedups ranging from 1.40 to 8.69 as compared to using the original SLP compiler. 9 .1 .3 Im p lem en ta tio n an d E v a lu a tio n o f th e P r o p o se d T ech n iq u es The proposed algorithms to exploit both SLP in the presence of control flow and locality in superword registers have been fully implemented into a compiler by extending the original SLP compiler. Our extension also includes additional code generation techniques described in C hapter 5. We have described our im plem entation for a target architecture, the PowerPC AltiVec. The autom atically generated parallel C program s are compiled by the backend compiler and run on the PowerPC G4. T he overall speedups achieved by the compiler im plem entation range from 1.05 to 19.22. Since these speedups are the results of multiple techniques, we also have presented experim ental results isolating the benefits of individual techniques. 143 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 9 .1 .4 D IV A -S p ec ific O p tim iza tio n s Since DIVA is a new architecture, there exist new compiler optim ization opportunities. This thesis has described a compiler algorithm and several optim ization techniques to exploit a DRAM memory characteristic (page-mode) autom atically. A page-mode mem ory access exploits a form of spatial locality, where the d a ta item is in the same row of the memory buffer as the previous access. Thus, access tim e is reduced because the cost of row selection is eliminated. The algorithm increases frequency of page-mode accesses by reordering d a ta accesses, grouping together accesses to the same memory row. We implem ented this algorithm and presented speedup results for four multim edia kernels for a PIM em bedded DRAM device, DIVA. The speedups achieved by exploiting page-mode memory access alone range from 1.04 to 1.16, resulting in overall speedups ranging from 1.25 to 2.19 when combined with optim izations targeting superword-level parallelism and locality as compared to SLP. These results show th a t there is a benefit in exploiting page-mode mem ory access in embedded systems, where the DRAM access tim e domi nates the mem ory latency seen by the processor. Furtherm ore, our results show th at for em bedded system s with support for superword-level parallelism [65, 9, 31], optim iza tions for exploiting the DRAM ’s page-mode accesses are complem entary to optim izations for superword-level parallelism and superword-level locality. In addition, we presented a prelim inary experim ental result dem onstrating d ata bandw idth on a prototype DIVA system. We observe th at the performance of a single DIVA processor is comparable to th a t of the Itanium 2 processor. 9.2 F uture W ork In the course of this research, we encountered several open issues and future directions for this work described as follows. 144 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Parallelization for architectures supporting SLP involves a certain overhead because of the architectural features and lim itations. For example in AltiVec, superw ord mem ory accesses are required to be aligned to superword boundaries, not all operations are supported for all operand types, and d ata movements between register files are not di rectly supported. To get around these requirem ents and lim itations, usually additional instructions are generated. We plan to expand this research by developing a cost model for parallelization so th a t codes are not parallelized when doing so may generate adverse effect. Also, we plan to expand this research in the context of DIVA. A lthough we already have run several applications on DIVA, at the point of this writing, running applications on the DIVA system is not as easy as in commercial product systems. We expect to be able to run more applications on DIVA in the near future and compare the results with those on the AltiVec. Exploiting DIVA specific ISA features is also left as a future work. Most of our current benchm ark program s are selected from m ultim edia and scientific application domains. W hile we desire to include more applications from those two do mains, we also plan to apply our techniques to the ones in other domains such as d ata intensive search algorithms in artificial intelligence. Traditionally, artificial intelligence applications are not considered suitable for SIMD parallelization. However, we see th at their requirements for high d ata bandw idth and large volume of com putation are well matched by the features of the DIVA processor. Currently, we are working on m apping a link discovery algorithm [1] to the DIVA processor. 145 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. R eferen ce L ist [1] Jafar Adibi. Link discovery via a m utual inform ation model: From graphs to ordered lists. In D IM A C S Workshop on Applications of Order Theory to Homeland Defense and Computer Security, DIMACS Center, CoRE Building, Rutgers University, NJ, Septem ber 2004. [2] Alfred V. Aho, Ravi Sethi, and Jeffrey D. Ullman. Compilers: Principles, Tech niques, and Tools. Addison-Wesley, 1986. [3] John R. Allen, Ken Kennedy, Carrie Porterfield, and Joe W arren. Conversion of control dependence to d ata dependence. In Annual Symposium on Principles of Programming Languages, pages 177-189, Austin, Texas, USA, 1983. [4] Randy Allen and Ken Kennedy. Optimizing Compilers for M odem Architectures. Morgan K aufm ann, 2002. [5] K rste Asanovic and Jam es Beck. TO engineering data. UC Berkeley CS technical report UCB/CSD-97-930. [6] K rste Asanovic, Jam es Beck, Tim Callahan, Jerry Feldman, B ertrand Irissou, Brian Kingsbury, Phil Kohn, John Lazzaro, Nelson M organ, David Stoutam ire, and John Wawrzynek. CNS-1 architecture specification: A connectionist network supercom puter. Technical R eport TR-93-021, International C om puter Science Institute, April 1993. [7] Mladen Berekovic, Hans-Joachim Stolberg, and Peter Pirsch. Implementing the M PEG-4 AS profile for streaming video on a SOC m ultim edia processor. In 3rd Workshop on Media and Streaming Processors, Austin, Texas, December 2001. [8] A art J. C. Bik, Milind Girkar, Paul M. Grey, and Xinm in Tian. Autom atic intra register vectorization for the intel architecture. International Journal of Parallel Programming, 30(2):65-98, April 2002. [9] Jay B. Brockman, Peter M. Kogge, Vincent Freeh, Shannon K. Kuntz, and Thom as Sterling. Microservers: A new memory semantics for massively parallel computing. In A C M International Conference on Supercomputing, June 1999. [10] Steve C arr and Ken Kennedy. Improving the ratio of mem ory operations to floating point operations in loops. AC M Transactions on Programming Languages and Sys tems, 15(3):400-462, July 1994. 146 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [11] Steve C arr and Ken Kennedy. Scalar replacem ent in the presence of conditional control flow. Software— Practice and Experience, 24(l):51-77, 1994. [12] Steve C arr, K athryn S. McKinley, and Chau-W en Tseng. Compiler optim izations for improving d ata locality. In Architectural Support fo r Programming Languages and Operating Systems, San Jose, CA, USA, October 1994. [13] Jacqueline Cham e and Sungdo Moon. A tile selection algorithm for d a ta locality and cache interference. In International Conference on Supercomputing, pages 492-499, 1999. [14] Jacqueline Chame, Jaewook Shin, and M ary Hall. Compiler transform ations for ex ploiting bandw idth in PIM -based system s. In The 27th Annual International Sym posium on Com puter Architecture, W orkshop on Solving the M emory Wall Problem , June 11, 2000, Vancouver, British Columbia, Canada. [15] Gerald Cheong and Monica S. Lam. An optimizer for m ultim edia instruction sets. In The Second SU IF Compiler Workshop, Stanford University, USA, August 1997. [16] Weihaw Chuang, Brad Calder, and Jeanne Ferrante. Phi-predication for light-weight if-conversion. In International Symposium on Code Generation and Optimization, pages 179-190, San Francisco, California, 2003. [17] Stephanie Coleman and K athryn S. McKinley. Tile size selection using cache organi zation and d ata layout. In The S IG P L A N ’95 Conference on Programming Language Design and Implementation, La Jolla, CA, June 1995. [18] Thom as Cormen, Charles Leiserson, and Ronald Rivest. Introduction to Algorithms. McGraw Hill, 2nd edition, 1990. [19] Derek J. DeVries. A vectorizing suif compiler: Im plem entation and performance. M aster’s thesis, University of Toronto, 1997. [20] Keith Diefendorff and Pradeep K. Dubey. How m ultim edia workloads will change processor design. Computer, 30(9):43-45, 1997. [21] Jeff D raper, Jacqueline Chame, M ary Hall, Craig Steel, Tim B arrett, Jeff LaCoss, John Granacki, Jaewook Shin, Chun Chen, Chang Woo Kang, Ihn Kim, and Gokhan Daglikoca. The architecture of the DIVA processing-in-memory chip. In Proceedings of the 16th A C M International Conference on Supercomputing, pages 26-37, June 2002. [22] Jeff Draper, Jeff Sondeen, and Chang Woo Kang. Im plem entation of a 256-bit wideword processor for the data-intensive architecture (DIVA) processing-in-memory (PIM) chip. In 28th European Solid-State Circuits Conference, Florence, Italy, September 2002. [23] Karim Esseghir. Improving d ata locality for caches. M aster’s thesis, Dept, of Com puter Science, Rice University, Septem ber 1993. 147 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [24] Jeanne Ferrante and M ary Mace. On linearizing parallel code. In Annual Symposium, on Principles o f Programming Languages, pages 179-190, New Orleans, Louisiana, United States, 1985. [25] Jeanne Ferrante, Vivek Sarkar, and W endy Thrash. On estim ating and enhanc ing cache effectiveness. In Proceedings of the Fourth International Workshop on Languages and Compilers fo r Parallel Computing, pages 328-343, Santa Clara, Cal ifornia, August 1991. [26] Christine Fricker, Olivier Temam, and W illiam Jalby. Influence of cross-interferences on blocked loops: A case study w ith m atrix-vector multiply. A C M Transactions on Programming Languages and Systems, 17(4):561-575, July 1995. [27] Changqing Fu and Kent Wilken. A faster optim al register allocator. In A C M /IE E E international symposium on Microarchitecture, pages 245-256, Istanbul, Turkey, November 2002. [28] Som nath Ghosh, M argaret M artonosi, and Sharad Malik. Cache miss equations: An analytical representation of cache misses. In Proceedings of the 1997 A C M Interna tional Conference on Supercomputing, Vienna, Austria, July 1997. [29] Som nath Ghosh, M argaret M artonosi, and Sharad Malik. Precise miss analysis for program transform ations w ith caches of arbitrary associativity. In Proceedings of the 8th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 228-239, San Jose, California, October 1998. [30] Peter Grun, Nikil D. D utt, and A lexandra Nicolau. Memory aware compilation through accurate tim ing extraction. In Design Autom ation Conference, pages 316- 321, 2000. [31] M ary Hall, Peter Kogge, Jeff Koller, Pedro Diniz, Jacqueline Chame, Jeff Draper, Jeff LaCoss, John Granacki, Apoorv Srivastava, W illiam Athas, Jay Brockman, Vin cent Freeh, Joonseok Park, and Jaewook Shin. M apping irregular applications to DIVA, a PIM -based data-intensive architecture. In A C M International Conference on Supercomputing, November 1999. [32] M ary W. Hall, Jennifer M. Anderson, Sam an P. Amarasinghe, Brian R. Murphy, Shih-Wei Liao, Edouard Bugnion, and Monica S. Lam. Maximizing m ultiprocessor performance w ith the SUIF compiler. Computer, 29(12):84-89, December 1996. [33] IBM. IB M Cu-11 embedded D R A M macro datasheet, M arch 2002. http://w w w - 3 .ibm . com / chips / techlib / 1 echlib .nsf/ techdocs /. [34] Intel. Intel Architecture Software Developer’ s Manual, Volume 2: Instruction Set Reference, 1999. Order Num ber 243191. [35] Intel. Intel(R ) Itanium Architecture Software Developer’ s Manual, October 2002. 24531904.pdf. [36] Intel. Intel(R ) Itanium (R )2 Processor Reference Manual, April 2003. 25111002.pdf. 148 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [37] Andreas Krall and Sylvain Lelait. Com pilation techniques for m ultim edia processors. International Journal of Parallel Programming, 28(4):347-361, 2000. [38] Monica S. Lam, Edw ard E. Rothberg, and Michael E. Wolf. T he cache perform ance and optim ization of blocked algorithm s. A C M SIG P L A N Notices, 26(4):63-74, 1991. [39] Samuel Larsen and Saman Am arasinghe. Exploiting superw ord level parallelism w ith m ultim edia instruction sets. In Conference on Programming Language Design and Implementation, pages 145-156, Vancouver, BC Canada, June 2000. [40] Samuel Larsen, Em m ett W itchel, and Saman Amarasinghe. Increasing and detecting memory address congruence. In International Conference on Parallel Architectures and Compilation Techniques, Septem ber 2002. [41] Chunho Lee, M iodrag Potkonjakl, and W illiam H. M angione-Smith. Mediabench: A tool for evaluating and synthesizing m ultim edia and com m unications systems. In A C M /IE E E international symposium on Microarchitecture, pages 330-335, 1997. [42] Ruby Lee. Subword parallelism w ith MAX2. A C M /IE E E international symposium on Microarchitecture, 16(4):51-59, August 1996. [43] Glenn Luecke and W aqar Haque. Evaluation of fortran vector compilers and prepro cessors. Software Practice and Experience, 21(9), Septem ber 1991. [44] Scott A. Mahlke. Exploiting Instruction-Level Parallelism in the Presence o f Condi tional Branches. PhD thesis, University of Illinois, U rbana IL, Septem ber 1996. [45] Agustin Fernandez M arta Jimenez, Jose M. Llaberia and Enric M orancho. Index set splitting to exploit data locality at the register level. Technical R eport UPC-DAC- 1996-49, U niversitat politecnica de Catalunya, 1996. [46] David M artin. Vector extensions to the M IPS-IV instruction set architecture (The V-IRAM A rchitecture M anual), M arch 2000. [47] Sally A. McKee, W illiam A. Wulf, Jam es H. Aylor, R obert H. Klenke, Maximo H. Salinas, Sung I. Hong, and Dee A. B. Weikle. Dynamic access ordering for stream ed com putations. IE E E Transactions on Computers, 49(11):1255-1271, 2000. [48] Metrowerks. CodeWarrior version 7.0 data sheet, 2001. http: / / ww w.m etrow erks.com /pdf/m ac7.pdf. [49] Motorola. AltiVec Technology Programming Environm ents Manual, Rev. 0.1, November 1998. ftp://w w w .m otorola.com /S P S /P ow erP C /teksupport/ tekli- br ary/m anuals / altivec_pem.pdf. [50] Motorola. AltiVec Technology Programming Interface Manual, June 1999. h ttp ://e - w w w .m otorola.com /brdata/PD FD B /docs/ ALTIV ECPIM .pdf. [51] Steven A. Moyer. Access Ordering and Effective M em ory Bandwidth. PhD thesis, University of Virginia, 1993. 149 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [52] Steven S. Muchnick. Advanced Compiler Design and Implementation. M organ Kauf- mann, 340 Pine St. Sixth Floor, San Francisco, CA 94104-3205, USA, 1997. [53] Vijay S. Pai, Parthasarathy R anganathan, and Sarita V. Adve. Rsim reference manual, version 1.0. Technical R eport 9705, D epartm ent of Electrical and Com puter Engineering, Rice University, July 1997. [54] Preeti R anjan Panda, Nikil D. D utt, and Alexandru Nicolau. Exploiting off-chip memory access modes in high-level synthesis. IE E E Transactions on CAD, February 1998. [55] Joseph C. H. Park and Mike Schlansker. On predicated execution, May 1991. Soft ware and Systems Laboratory, HPL-91-58. [56] Alex Peleg, Sam Wilkie, and Uri Weiser. Intel mmx for m ultim edia pcs. Com m uni cations of the ACM, 40(l):24-38, 1997. [57] Parthasarathy Ranganathan, Sarita Adve, and Norm an P. Jouppi. Perform ance of image and video processing w ith general-purpose processors and m edia ISA exten sions. In International Sym posium on Computer Architecture, May 1999. [58] Gabriel Rivera and Chau-W en Tseng. A comparison of compiler tiling algorithms. In the 8th International Conference on Compiler Construction (C C ’99), Amsterdam, The Netherlands, March 1999. [59] Robert Sadourny. The dynam ics of finite difference models of the shallow water equations. Journal of the Atm ospheric Sciences, 32(4):680-689, 1975. [60] Jaewook Shin, Mary W. Hall, and Jacqueline Chame. Superword-level parallelism in the presence of control flow. In International Symposium on Code Generation and Optimization, March 2005. [61] SIMDtech. Altivec docum ents archive, 2005. http://w w w .sim dtech.org/altivec/ docum ents/. [62] James E. Smith, Greg Faanes, and Rabin Sugumar. Vector instruction set support for conditional operations. In International Symposium on Computer Architecture. ACM, 2000. [63] Byoungro So, Mary W. Hall, and Pedro C. Diniz. A compiler approach to fast hardware design space exploration in fpga-based systems. In Proceedings of the A C M SIG P LA N 2002 Conference on Programming Language Design and Implementation, Berlin, Germany, June 2002. [64] N. Sreram an and R. G ovindarajan. A vectorizing compiler for m ultim edia exten sions. International Journal of Parallel Programming, 2000. [65] Thomas Sterling. An introduction to the Gilgamesh PIM architecture. In Rizos Sakellariou, John Keane, John R. Gurd, and Len Freeman, editors, Euro-Par, volume 2150 of Lecture Notes in Com puter Science, pages 16-32. Springer, 2001. 150 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [66] Olivier Temam, Elana D. G ranston, and W illiam Jalby. To copy or not to copy: A compile-time technique for assessing when d a ta copying should be used to eliminate cache conflicts. In A C M International Conference on Supercomputing, Portland, OR, November 1993. [67] Marc Tremblay, J. Michael O ’Connor, Venkatesh N arayanan, and Liang He. Vis speeds new m edia processing. IE E E Micro, 16(4):10-20, A ugust 1996. [68] Veridian. V A ST /A ltiV ec Features, June 2001. http://w w w .psrv.com / al- tivec_feat.html. [69] Michael E. Wolf. Improving Locality and Parallelism in Nested Loops. PhD thesis, Dept, of C om puter Science, Stanford University, 1992. [70] Michael E. Wolf and Monica S. Lam. A d ata locality optim izing algorithm. In Pro ceedings o f the A C M SIG P L A N ’ 91 Conference on Programming Language Design and Implem entation, volume 26, pages 30-44, Toronto, O ntario, Canada, June 1991. [71] Michael J. Wolfe. More iteration space tiling. In Proceedings of Supercomputing ’ 89, pages 655-664, Reno, Nevada, November 1989. 151 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
An efficient design space exploration for balance between computation and memory
PDF
Architectural support for efficient utilization of interconnection network resources
PDF
Application-specific external memory interfacing for FPGA-based reconfigurable architecture
PDF
A flexible framework for replication in distributed systems
PDF
Combining compile -time and run -time parallelization
PDF
Deadlock recovery-based router architectures for high performance networks
PDF
G -folds: An appearance-based model of facial gestures for performance driven facial animation
PDF
Facial animation by expression cloning
PDF
Extendible tracking: Dynamic tracking range extension in vision-based augmented reality tracking systems
PDF
A modular approach to hardware -accelerated deformable modeling and animation
PDF
Data -driven facial animation synthesis by learning from facial motion capture data
PDF
Cost -sensitive cache replacement algorithms
PDF
Energy efficient hardware-software co-synthesis using reconfigurable hardware
PDF
A voting-based computational framwork for visual motion analysis and interpretation
PDF
Algorithms for compression of three-dimensional surfaces
PDF
Data-driven derivation of skills for autonomous humanoid agents
PDF
Analysis, recognition and synthesis of facial gestures
PDF
An adaptive soft classification model: Content-based similarity queries and beyond
PDF
Complexity -distortion tradeoffs in image and video compression
PDF
High-frequency mixed -signal silicon on insulator circuit designs for optical interconnections and communications
Asset Metadata
Creator
Shin, Jaewook (author)
Core Title
Compiler optimizations for architectures supporting superword-level parallelism
School
Graduate School
Degree
Doctor of Philosophy
Degree Program
Computer Science
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
Computer Science,OAI-PMH Harvest
Language
English
Contributor
Digitized by ProQuest
(provenance)
Advisor
Hall, Mary W. (
committee chair
), Neumann, Ulrich (
committee member
), Pinkston, Timothy M. (
committee member
)
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c16-457228
Unique identifier
UC11336571
Identifier
3196891.pdf (filename),usctheses-c16-457228 (legacy record id)
Legacy Identifier
3196891.pdf
Dmrecord
457228
Document Type
Dissertation
Rights
Shin, Jaewook
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the au...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus, Los Angeles, California 90089, USA