Close
About
FAQ
Home
Collections
Login
USC Login
Register
0
Selected
Invert selection
Deselect all
Deselect all
Click here to refresh results
Click here to refresh results
USC
/
Digital Library
/
University of Southern California Dissertations and Theses
/
Improving memory hierarchy performance using data reorganization
(USC Thesis Other)
Improving memory hierarchy performance using data reorganization
PDF
Download
Share
Open document
Flip pages
Contact Us
Contact Us
Copy asset link
Request this asset
Transcript (if available)
Content
IM PROVING MEMORY HIERARCHY PERFORM ANCE USING DATA REORGANIZATION by Neungsoo Park A Dissertation Presented to the FACULTY OF THE GRADUATE SCHOOL UNIVERSITY OF SOUTHERN CALIFORNIA In Partial Fulfillment of the Requirements for the Degree DOCTOR OF PHILOSOPHY (COM PUTER ENGINEERING) December 2002 Copyright 2002 Neungsoo Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. UMI Number: 3093966 UMI UMI Microform 3093966 Copyright 2003 by ProQuest Information and Learning Company. All rights reserved. This microform edition is protected against unauthorized copying under Title 17, United States Code. ProQuest Information and Learning Company 300 North Zeeb Road P.O. Box 1346 Ann Arbor, Ml 48106-1346 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. UNIVERSITY OF SOUTHERN CALIFORNIA The Graduate School University Park LOS ANGELES, CALIFORNIA 90089-1695 This dissertation , w ritten b y - j W J l L ____________ U nder th e direction o f h.A£. D issertation C om m ittee, an d approved b y a ll its m em bers, has been p resen ted to an d accepted b y The G raduate School, in p a rtia l fu lfillm en t o f requirem ents fo r th e degree o f DOCTOR OF PHILOSOPHY D ate 12-18-2002 DISSER TA TION COMMITTEE Chairperson f... Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. D ed ication To my beloved mother and to the memory of my beloved father Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. A ckn ow led gm en ts I would like to thank Dr. V iktor K. Prasanna, my academic advisor at USC, for his encouragement, support, and guidance throughout my PhD program. The success of this work is due, in large part, to his relentless effort in helping me grow as a researcher and a thinker. I am also thankful to Dr. Cauligi S. Raghavendra for providing me with additional guidance. The technical discussion with him has been helpful for me to complete this work. I would like to thank my committee members, Dr. Antonio Ortega, Dr. A lexandria Tartakovsky, Dr. Monte Ung, and Dr. Roger Zimmermann for their guidance in this dissertation. They willingly spent their valuable tim e giving insightful feedback for this dissertation. I would like to thank my senior group alumni, Dr. Yongwha Chung, Henry Park, Dr. Young Won Lim, Dr. Myungho Lee, Dr. Jongwoo Bae, Dr. Jinwoo Suh, Dr. Prashanth Bhat, and Dr. Charles Wenheng Liu for their help in founding my research background and spending their busy hours on the technical discussion with me. I also would like to thank my fellow graduate students in the group; Seonil Choi, Dr. Am m ar Alhusaini, Reetinder P. Sidhu, Dr. Kiran Bondalapati, Dr. Andreas Dandalis Zachary Baker, Amol Bakshi, Sethavidh Gertphol, Shriram Bhargava Gundala, Bo Hong, Vaibhav M athur, Sumit Mohanty, Gerald R. Mor ris, Jingzhao Ou, Joon Sang Park, Michael Penner, Ronald Scrofano, Dhananjay Raghavan, and Ashwin Sethuram. W ith these peers, my past several years were really enjoyable. They are all very sm art and hard working. In fact, whenever I encountered a problem, I found the necessary help within the group of these people. I would like to thank Henryk Chrostek for his adm inistrative assistant for our group. My thanks also goes to my seniors at USC, Dr. Hiecheol Kim, Dr. Kangwoo Lee, Dr. Joonho Ha, Dr. Jaeheon Jeong, Dr. Seonho Kim, .Jooso Song, Dr. Jongwook Woo, Chulho Shin, Duckdong Hwang, Seongwon Lee, Yongho Song, and Dongsoo Kang for their encouragement. I would like to thank my friends, Dr. Youpyo Hong, Hyungsuk Kim, Dr. Yungho Choi, Chulmin Lee, and Dr. Yongdae iii Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Kim, for their encouragement and help. My thanks also go to my friends and my alumni, even though not all of them are listed. Especially, I would like to thank my m other and my sister. Their support and love have always been a source of an immense power for me to complete this dissertation. Lastly, I would like to dedicate this dissertation to the memory of my father. W ithout his love and support, this accomplishment would not have been possible. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. C on ten ts D ed ication ii A cknow ledgm ents iii List O f T ables viii List O f F igu res x A bstract x iii 1 In trod u ction 1 1.1 Research C o n trib u tio n s.................................................................................. 7 1.1.1 Efficient algorithm for data redistribution between processor s e ts ......................................................................................................... 7 1.1.2 Static data reorganization in uni-processors '. . 8 1.1.3 Dynamic data reorganization in uni-processors......................... 9 1.2 Dissertation O u t l i n e ..................................................................................... 9 2 B ackground 11 2.1 Memory Hierarchy System of HPC Platforms ..................................... 11 2.1.1 High performance com puting p la tfo rm s ...................................... 11 2.1.2 Model of the memory hierarchy s y s te m ...................................... 12 2.1.3 D ata access cost in a uni-processor s y s te m ................................ 13 2.1.4 Communication cost in a multi-processor p la tfo rm .................. 15 2.1.5 Examples of HPC p la tfo rm s ........................................................... 17 2.1.5.1 Uni-processor p la tf o r m s ............................................... 17 2.1.5.2 Multi-processor p la tf o r m s ............................................ 18 3 D a ta R ed istrib u tio n b etw een P rocessor S ets 22 3.1 Background and Related W o rk .................................................................... 23 3.1.1 Problem d e f in itio n ............................................................................ 23 3.1.2 Cost of performing r e d is trib u tio n ................................................. 26 3.1.3 Related w o r k ...................................................................................... 27 3.2 Our Approach to R ed istrib u tio n ................................................................ 30 3.2.1 Array and processor points of v i e w .............................................. 30 3.2.2 A table-based framework for re d istrib u tio n ................................ 33 3.2.3 Communication scheduling using a generalized circulant m a tr ix ................................................................................. 36 v Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 3.3 Efficient Redistribution A lg o rith m s .......................................................... 38 3.3.1 Communication patterns of redistribution ............................... 38 3.3.2 Non all-to-all communication .......................................... 41 3.3.3 All-to-all com m u n icatio n............................................... 51 3.3.4 D ata transfer c o s t...................................................... 55 3.4 Experim ental R e s u lts ..................................................................................... 57 3.4.1 Total redistribution t i m e ............................................... 59 3.4.2 D ata transfer t i m e .................................................... 63 3.4.3 Schedule com putation tim e .............................................. 66 4 S tatic D a ta R eorgan ization: B lock D a ta Layout 70 4.1 Related W o r k ................................................................................................... 71 4.2 Block D ata Layout and TLB P e rfo rm a n c e ............................................ 73 4.2.1 Block data la y o u t............................................................................... 73 4.2.2 TLB performance of block data l a y o u t ...................................... 74 4.2.2.1 A lower bound on TLB m isses...................................... 75 4.2.2.2 TLB perform ance............................................................. 76 4.3 Tiling and Block D ata L a y o u t................................................................... 82 4.3.1 TLB perfo rm an ce....................................................... 83 4.3.2 Cache p e rfo rm a n c e ..................................................... 88 4.3.3 Block size s e le c tio n ..................................................... 93 4.4 Experim ental R e s u lts .................................................................................... 99 4.4.1 S im u la tio n s...............................................................100 4.4.2 Execution on real p la tf o r m s ..............................................................104 4.4.3 Block data layout and Morton data l a y o u t.................................... I l l 5 D yn a m ic D a ta R eorgan ization: D yn am ic D a ta L ayouts 114 5.1 Related W o r k .....................................................................................................115 5.1.1 Memoi'y hierarchy performance o p tim iz a tio n s ...............................116 5.1.2 FF T and W HT performance o p tim izatio n s..................................117 5.2 Factorized Signal Transform s..........................................................................119 5.2.1 Factorization of signal tra n s fo rm s .......................................... 119 5.2.2 Cache performance of the com putation of factorized signal tra n sfo rm s.............................................................................................. 122 5.3 Cache-Conscious Factorization of Signal Transforms .........................................................................................................125 5.3.1 Dynamic data l a y o u t...................................................... 127 5.3.2 Search for an optim al fa c to riz a tio n .........................................131 5.3.3 A lternate factorizations using dynamic data l a y o u t ......................135 5.4 Experim ental R e s u lts ....................................................................................... 136 5.4.1............................Simulation re s u lts ..........................................................137 5.4.2 Experim ental r e s u l t s ...................................................... 138 vi Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 6 C onclusion and Future R esearch R eference List Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. List O f Tables 1.1 Access times for different memory levels on UltraSPARC processors 5 1.2 Bandwidth for local and rem ote memory access on Cray T3E ... . 5 2.1 Communication features of state-of-the-art HPC multi-processor plat forms .................................................................................................................... 18 2.2 Memory hierarchy system of uni-processor p latfo rm s............................ 18 3.1 An example of non all-to-all communication. P = 6, Q = 10 and K = 3................................................................................................................... 24 3.2 An example of all-to-all communication with same message size. P = 6, Q = 10, and K = 4............................................................................. 24 3.3 An example of all-to-all communication with different message size. P = 8, Q = 10, and K = 6............................................................................. 24 3.4 Comparison of data transfer cost and schedule and index computa tion costs of the Caterpillar algorithm, bipartite m atching scheme and our algorithm ............................................................................................. 58 3.5 Comparison of schedule com putation tim e (/xsecs) with data transfer t i m e ................................................................................................................... 69 4.1 Comparison of TLB m isse s.......................................................................... 81 4.2 TLB misses for all tiled row accesses followed by all tiled column accesses ............................................................................................................ 86 4.3 Param eters of T M M ..................................................................................... 91 4.4 Features of various experim ental p la tfo rm s ...............................................105 4.5 Comparison of execution tim e of TMM on various platforms: All times are in seconds............................................................................................ 112 viii Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 4.6 Comparison of execution tim e of LU decomposition on various plat forms: All tim es are in seconds........................................................................113 5.1 Some alternate factorization t r e e s ............................................................... 135 5.2 Param eters of the p la tf o r m s ..........................................................................139 5.3 Compiler and optim ization options used in the experim ents................. 139 5.4 Optimal factorizations of W HT on Alpha 21264 .................................. 144 5.5 Optimal factorizations of F F T on M IPS R10000 ................................... 144 ix Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. List O f F igures 1.1 General multi-processor platform s w ith distributed memory system 2 1.2 A model of the memory hierarchy s y s te m ................................................. 3 2.1 The data access cost of stride p e rm u ta tio n ............................................. 15 2.2 Com putation model for HPC multi-processor p la tfo rm s ...................... 17 2.3 Block diagram of Cray T3E n o d e ............................................................. 20 3.1 Block-cyclic redistribution from array point of view: (a) the array of elements, (b) cyclic(x) on P processors, (c) from cyclic(x) on P processors to cyclic(Kx) on Q processors. In this example, x = 2, P = 3, Q = 6 and K = 2................................................................................ 25 3.2 Steps in performing re d istrib u tio n ............................................................ 26 3.3 Block-cyclic redistribution from cyclic(x) on P processors to cyclic(Kx) on Q processors from processor point of view. In this example, P — 3, Q — 6 and K — 2................................................................................ 32 3.4 The destination processor table T ............................................................ 33 3.5 Table conversion process for re d is trib u tio n ........................................... 34 3.6 Circulant m atrix e x a m p le s.......................................................................... 37 3.7 Generalized circulant m a tr ix ...................................................................... 38 3.8 Steps of column rearrangement ................................................................ 42 3.9 Decomposition of D-nit and C send ............................................................ 48 3.10 Example illustrating an all-to-all case with different message sizes: ^ ( 4 , 3 , 6 ) ......................................................................................................... 52 3.11 Steps for measuring the redistribution tim e............................................. 60 3.12 Total redistribution tim e for non all-to-all communication examples. 62 3.13 Total redistribution tim e for all-to-all communication examples with different message sizes..................................................................................... 64 x Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 3.14 Steps in measuring the data transfer tim e................................................. 65 3.15 D ata transfer tim e for non all-to-all communication examples. . . . 67 3.16 D ata transfer tim e for all-to-all communication examples with dif ferent message sizes........................................................................................... 68 4.1 Various data layouts: block size 2 x 2 for (b) and ( c ) ........................ 74 4.2 Tiled m atrix m ultiplication........................................................................... 83 4.3 Tiled a c c e sse s................................................................................................... 84 4.4 Blocks extending over page b o u n d a r ie s ................................................... 85 4.5 Comparison of TLB misses from simulation and e s tim a tio n .............. 87 4.6 Comparison of TLB misses using tiling+BD L and tiling only .. . . 87 4.7 Example of conflict m is s e s ........................................................................... 88 4.8 Comparison of cache misses from simulation and estim ation for 6- loop T M M ......................................................................................................... 93 4.9 Execution tim e of TM M of size 1024 x 1024 ......................................... 94 4.10 Miss cost estim ation for 6-loop TMM (UltraSparc II param eters) . . 96 4.11 Simulation results of 6-loop T M M ............................................................. 98 4.12 Total miss cost for 6-loop T M M ................................................................. 99 4.13 O ptim al block sizes for 6-loop TM M ......................................................... 100 4.14 Simulation results for TMM using UltraSparc II param eters .... 101 4.15 Total miss cost for TMM using UltraSparc II p a ra m e te rs ....................102 4.16 Simulation results for LU using Pentium III p a ra m e te rs....................... 103 4.17 Effect of block size on LU decomposition using Pentium III param eters 104 4.18 Execution tim e comparison of various techniques for TM M on Pen tium I I I ............................................................................................................... 105 4.19 Effect of block size on T M M .......................................................................... 106 4.20 Effect of block size on LU d eco m p o sitio n .................................................. 107 4.21 Effect of block size on Cholesky facto rizatio n ............................................108 xi Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 4.22 Execution tim e of T M M ................................................................................. 109 4.23 Execution tim e of LU d e c o m p o sitio n ......................................................... 110 4.24 Execution tim e of Cholesky fa c to riz a tio n ...................................................I l l 5.1 Factorization tree obtained by recursively applying Cooley-Tukey a lg o r ith m ............................................................................................................ 120 5.2 Examples of factorization tree for 16-point D F T .....................................121 5.3 Com putation of Appoint D FT using the Cooley-Tukey algorithm . . 123 5.4 Cache behavior for two successive DFTs (Ni=4,B=4,C=32) .... 124 5.5 Effect of stride access to com pute 32-point DFT: (s = log2 Stride) . 126 5.6 Strides resulting from Cooley-Tukey fa c to riz a tio n ................................. 127 5.7 Dynamic data layout approach for the Cooley-Tukey algorithm . . . 128 5.8 D ata access pattern for a 16 x 16 fa c to riz atio n ........................................ 130 5.9 Different DDL factorization trees for 2”-point D F T ..............................132 5.10 Search algorithm considering dynamic data la y o u t................................. 134 5.11 Miss rates for various FF T s iz e s ................................................................... 137 5.12 Miss rates for various cache line s i z e s ......................................................... 137 5.13 Performance of F F T DDL on Alpha 21264.............................................. 140 5.14 Performance of F F T DDL on MIPS R10000 ......................................... 140 5.15 Performance of..F F T DDL on Pentium 4 .................................................141 5.16 Performance of F F T DDL on UltraSPARC I I I .........................................141 5.17 Performance comparison of W HT implementations ..............................142 xii Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. A b stract State-of-the-art computing platforms have deep memory hierarchies. M ismatch between the data layout and data access pattern during the com putation causes expensive memory access costs, degrading the overall performance. In this dis sertation, we propose data reorganization as an approach to improve the memory hierarchy performance. In our approach, the data layout is reorganized in order to m atch the data access pattern of the com putation. Therefore, it reduces the number of cache misses in uni-processor platforms and the num ber of remote m em ory accesses in multi-processor platform s. This dissertation discusses three m ain contributions in using data reorganization to achieve high performance. The data reorganization in multi-processor platform s is called data redistribu tion in this dissertation. However, the overhead of data redistribution between processor sets should be minimized since the redistribution itself requires inter processor communications. Therefore, we propose an efficient array redistribution algorithm between processor sets. The proposed data redistribution algorithm reduces the overall communication tim e, including the data transfer, communica tion schedule, and index com putation costs. Our algorithm generates a schedule th at minimizes the num ber of communication steps, and also eliminates node con tention at each communication step. The network bandw idth is fully utilized by ensuring th at equal-sized messages are transferred in each communication step. Furtherm ore, the tim e to compute the schedule and the index sets is less than 1% of the data transfer tim e for problem sizes of general interest and becomes more insignificant as the problem size increases. This makes our algorithm suitable for runtim e data redistribution. To improve cache performance in uni-processor platform s, block data layout has been proposed in conjunction with tiling. In this dissertation, this is called static data reorganization, since the data layout is reorganized before the compu tation begins and is not changed during the com putation. We provide an analysis xiii Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. of the Translation Look-aside Buffer (TLB), and cache performance using block data layout. For standard m atrix access patterns, we derive an asym ptotic lower bound on the num ber of TLB misses for any data layout and show that block data layout achieves this bound. We show th a t block data layout improves TLB misses by a factor of 0(B) when compared with conventional data layouts for the tiled m atrix multiplication (TM M ), where B is the block size of block data layout. This reduction leads to the improvement of memory hierarchy performance. Using our TLB and cache analysis, we also discuss the im pact of block size on the overall memory hierarchy performance. Simulation results show th at tiling with block data layout reduces the num ber of TLB misses by up to 93% as compared with copying and padding for the TMM. Experim ental results for TMM, LU decom position, and Cholesky factorization on several platform s (UltraSparc II and III, Alpha, and Pentium III) show th at tiling with block data layout achieves up to 50% performance improvement over other techniques th at use conventional layouts. As the com putation proceeds, the data access stride (data access pattern) changes. If the data layout is fixed during the com putation (static data layout approach), this change in the access stride results in poor cache performance. To improve cache performance, we propose to change the data layout dynamically between com putation stages. This technique is called the dynamic data layout approach in this dissertation. To show the effectiveness of our technique, we apply dynamic data layout approach to com pute large signal transforms. In the simula tions, Fast Fourier Transform (FFT) com putation using the dynamic data layout approach achieves a lower cache miss rate than F F T com putation using the static data layout approach when the transform size is larger than the cache size. We also develop optimized packages for F F T and W alsh-Hadamard Transform (W HT) using the dynamic data layout approach. Experim ental results show th at our FF T and W HT packages achieve a performance improvement of up to 3.52 times over other state-of-the-art F F T and W HT packages. xiv Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. C hapter 1 Introduction Current technology trends indicate that, while processor speeds are increasing at the rate of more than 50% per year, the bandw idth and latency of the memory (DRAM) have been improving by only 10-15% [21, 38]. This gap between the processor and memory speeds is a critical bottleneck towards achieving high per formance since accessing data in memory makes a processor stall for several cycles. To hide the long memory access latency and to increase the memory bandwidth, the memory system is built hierarchically between the processor and the main memory in uni-processor platforms. This memory hierarchy system consists of multi-level caches. The level 1 (L I) cache is an on-chip (or on-die) cache. The level 2 (L2) cache is an off-chip cache. Hence, as a cache gets closer to the proces sor, it becomes faster and smaller. Thus, to improve the performance, data that will be accessed in the near future should be copied into these caches. However, a small cache size lim its the am ount of data th at can be copied into the cache. Therefore, efficient utilization of caches is a significant factor in achieving high performance. Compilers use a specific mapping to store a data array in the main memory. For example, a 2-dimensional array is stored in row-major order by the C complier and in column-major order by the FORTRAN compiler. These data layouts cannot m atch the data access patterns during m atrix computations. A mismatch between 1 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Interconnection Network PE PE PE PE Vo s j \ o s J os Figure 1.1: General multi-processor platform s with distributed memory system the data access pattern and the data layout in memory in a uni-processor system causes cache misses. A cache miss results in a memory access to place data into the cache from the m ain memory. This causes the processor to stall until the requested data is available in the cache. Therefore, the overall performance degrades. Deep memory hierarchy systems are also used for multi-processor and m ulti com puter platforms having distributed memories. Current multi-processor plat forms consist of two main components: powerful computing nodes and a low la tency, high bandw idth interconnection network. Each computing node is similar to a uniprocessor platform. These com puting nodes are connected by an inter connection network, which typically consists of high speed point-to-point links and routing switches. Figure 1.1 describes the general architecture of m ulti processor platforms. Recently, cluster com puters [40, 56] have emerged, which consist of general-purpose PCs or workstations connected through high speed net work switches like the Myrinet switch. In such platforms, a rem ote memory access requires an inter-processor communication. A software overhead is involved in inter-processor communication. This makes the latency of remote memory access much larger than th at of local memory access. 2 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Network (Interconnetion) LI Cache Remote Memory Remote Memory Main Memory (Local) Processor L2 Cache Remote Nodes A Node On-Chip Figure 1.2: A model of the memory hierarchy system Similar to the data layout in uni-processor platforms, data placem ent in m ulti processor platforms plays a crucial role in achieving high com puting performance. Before the computation begins, the data is distributed among processors. If the distribution does not match the access pattern, the required data is not in a local memory. Therefore, the processor requests rem ote memory accesses. In such sys tem s, a remote memory access requires expensive inter-processor communication which causes a processor to rem ain idle until data is available in the local memory. This idling degrades the memory hierarchy performance. 3 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. The performance degradation in the memory hierarchy system results from a mismatch between the data access pattern and the d ata layout: this mism atch increases the num ber of cache misses in uni-processor platform s and the num ber of remote memory accesses in multi-processor platform s. In this dissertation, we propose an approach to improve the performance considering the data access p at tern and the data layout. To discuss the effect of the data access pattern and the data layout on performance, we use the model of a m emory hierarchy system which includes all the memory levels of uni-processor and multi-processor platforms. The memory hierarchy system can be represented by the model shown in Fig ure 1.2. The memory hierarchy system of a computing node in a multi-processor platform is the same as that of a stand-alone uni-processor system. As a memory in a computing node is in the higher level of the memory hierarchy (closer to the processor), it has shorter access latency. Hence, the memory in the lowest level of the memory hierarchy, which is farthest away from the processor, has the longest access latency. Table 1.1 shows the different access times of each memory level in a uni-processor platform consisting of an UltraSPARC processor [34]. Further, from a computing node point of view, the rem ote memories in other computing nodes can be considered to be at a lower level than its own local memory in the mem ory hierarchy system, because the rem ote memory access latency is much longer than the local memory access latency as shown in Table 1.2. In this model of the memory hierarchy system, a cache miss (an inter-processor communication) results in accessing a lower level memory (the rem ote memory). To improve the overall performance, before data is requested by a processor, it is copied into a memory closer to the processor. Thus, the num ber of accesses to a memory farther away from the processor is reduced, thereby fully utilizing the computing power of the processor. To efficiently supply data into the memory closer to the processor, we exploit the characteristics of data reference in application programs. Thus, the locality 4 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table 1.1: Access tim es for different memory levels on UltraSPARC processors Memory Level Access Time CPU registers On-chip cache Qff-chip cache Main memory 1 cycle 3-4 cycles 10’s of cycles 100’s of cycles Table 1.2: Bandwidth for local and rem ote memory access on Cray T3E Memory Level Access Time Local memory Remote memory 960 M Bytes/seconds 200 M Bytes/seconds of data references is a significant factor to improve the performance. Locality can be classified into spatial locality and temporal locality [21, 38]. Spatial locality means that if a memory location has been accessed recently, then memory locations close to the accessed location are likely to be accessed in the near future. Temporal locality means that if a memory location has been accessed recently, then it is likely to be accessed again in the near future. So, it is significant th at consecutively or frequently accessed data are stored into a contiguous address space. The m atch between the data access p attern and the data layout improves the locality of data reference. Many high performance computing (HPC) applications, including scientific computing and signal processing applications, consist of several com putational stages [54, 91, 100]. In each com putational stage, the data access pattern is reg ular. However, as program execution proceeds from one com putational stage to another, the data access pattern can be changed. Therefore, a fixed data layout cannot m atch various access patterns during the com putation, thereby degrad ing the memory hierarchy performance. For example, a two-dimensional Discrete 5 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Fourier Transform (D FT) consists of row-DFTs and column-DFTs. Each row-DFT requires a row of two-dimensional data as input data, and each column-DFT re quires a column of two-dimensional data. Therefore, the data access pattern of a two-dimensional D FT changes from row access to column access. If the data lay out of the two-dimension DFT is fixed as row-major layout, a m ism atch between the data access p attern and the data layout can occur during the com putation of column-DFTs. In our memory hierarchy system, mismatches between the d ata access pattern and the data layout cause an increase in both local memory traffic between caches and (local) m ain memory and rem ote memory traffic between lo cal and remote memories. Therefore, in such applications, a fixed data layout can degrade the performance. To improve the memory hierarchy performance, data layout or data access pattern should be changed. Some works propose th at the data access pattern is changed by reordering computations. This is called control transformation. Examples of control trans formation techniques include tiling, loop fusion, loop interleaving, and loop inter change [14, 20, 64, 65, 83, 85, 102]. After reordering com putation, the tem poral lo cality is improved. However, spatial locality cannot usually be improved, since the distance between consecutively accessed data elements (stride) has not changed. In this dissertation, we propose data reorganization as an approach to improve the memory hierarchy performance where the data layout is reorganized to m atch the data access pattern. This approach improves the spatial locality of applications since consecutively accessed data are stored into contiguous memory locations. For HPC applications consisting of several com putation stages, data reorganization makes the data, layout m atch the data access pattern as the latter is changed during computation. This reduces the number of rem ote memory accesses incurred in multi-processor platforms. In this dissertation, the data reorganization in m ulti processor platforms is called data redistribution. In a uni-processor system, the standard data layout given by a compiler cannot m atch the data access pattern 6 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. during computation. A viable approach to improving memory performance is to reorganize data so th a t the logical data access patterns m atch closely w ith spatially contiguous locations in m ain memory. This reduces the number of cache misses. Therefore, our data reorganization approach can be applied uniformly to applications on any platform. In Section 1.1, the m ain contributions of our data reorganization approach are described. In Section 1.2, the rest of the dissertation is summarized. 1.1 R esearch C ontributions Our data reorganization approach is applied to improve performance at two dif ferent levels of the memory hierarchy system: (1) reducing the num ber of rem ote memory accesses and (2) reducing the num ber of cache misses. We present three main contributions using data reorganization. 1.1.1 Efficient algorith m for d a ta red istrib u tion b etw een p rocessor sets Run-time array redistribution is necessary to enhance the performance of parallel programs on distributed memory supercom puters. Since the data redistribution itself requests interprocessor communications, the redistribution overhead should be minimized in order to achieve high performance. Therefore, it is necessary to develop an efficient algorithm for data redistribution between processor sets. Our proposed algorithm reduces the overall tim e for communication by consider ing the data transfer, communication schedule, and index com putation costs. The proposed algorithm is based on a generalized circulant m atrix formalism. Our al gorithm generates a schedule th at minimizes the number of communication steps 7 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. and eliminates node contention in each communication step. The network band width is fully utilized by ensuring th a t equal-sized messages are transferred in each communication step. Furtherm ore, the tim e to com pute the schedule and the index sets is less than 1% of the data transfer tim e for problem sizes of general interest and becomes negligible as the problem size increases. Therefore, our proposed algorithm is suitable for run-tim e array redistribution. 1.1.2 S ta tic d a ta reorgan ization in u n i-p rocessors For signal processing and dense m atrix applications, several experim ental stud ies have been conducted using block data layout - in conjunction with tiling - to improve cache performance. In this dissertation, the conversion to block data lay out is called static data reorganization, since the data layout is reorganized before the com putation begins and is not changed until the com putation finishes. We provide a theoretical analysis for the TLB and cache performance for block data layout. For generic m atrix access patterns, we derive an asym ptotic lower bound on the num ber of TLB misses for any data layout and show th a t block data layout achieves this bound. We show th a t block data layout improves TLB misses by a factor of 0(B) compared with conventional data layouts, where B is the block size of block data layout. This reduction leads to improved memory hierarchy perfor mance. Using our TLB and cache analysis, we also discuss the im pact of block size on the overall memory hierarchy performance. To validate our analysis, we have conducted simulations and experim ents using tiled m atrix m ultiplication, LU decomposition and Cholesky factorization. For m atrix m ultiplication, simulation results using UltraSparc II param eters show th at tiling with block data layout, with a block size given by our block size selection algorithm, reduces TLB misses by up to 93% in comparison with other techniques (copying, padding, etc.). LI and L2 cache misses are also reduced. Experim ents on several platform s (UltraSparc II and III, Alpha, and Pentium III) show that tiling with block data layout achieves 8 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. up to 50% performance improvement over other techniques th a t use conventional layouts. 1.1.3 D y n a m ic d ata reo rg a n iza tio n in u n i-p rocessors As the comprrtation of an HPC application proceeds, the data access stride (data access pattern) changes. For example, a large signal transform is decomposed into a number of smaller transforms using the divide-and-conquer property. These small transforms have different strides. Using a fixed data layout, this change of access stride during the com putation results in poor cache performance since the cache in state-of-the-art uni-processor platform s is direct-m apped or small set-associative. To improve cache performance, the data layout is dynamically reorganized between com putation stages so th at it m atches the data access pattern. This technique is called the dynamic data layout approach in this dissertation. To show the effectiveness of our technique, we have applied the dynamic data layout approach to large signal transform s. In our experim ents, FF T and Walsh- Hadam ard Transform (W HT) packages are implemented using the dynamic data layout approach. Experim ents were performed on Alpha 21264, MIPS R100Q0, UltraSPARC III, and Pentium III. Experim ental results show th at our FFT and W HT packages achieve performance improvement of up to 3.52 times over other state-of-the-art FFT and W HT packages. 1.2 D issertation O utline The rem ainder of this dissertation is organized as follows: In Chapter 2, we explain the memory hierarchy system for uni-processor and multi-processor platforms and provide some examples. We present the communi cation cost model for multi-processor platforms. We discuss the cost of data access 9 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. in a hierarchical memory system for a uni-processor platform , which depends on the data access pattern. In Chapter 3, we present an efficient algorithm for block-cyclic array redistri bution between processor sets. We show th a t the communication schedule of the data redistribution can be represented by a table which falls into the generalized circulant m atrix form. We explain th a t the proposed redistribution algorithm m in imizes the number of communication steps, avoids node contention, and transfers messages of the same size between communication pairs in each communication step. We show how the communication schedule and index com putation are m ath ematically computed. To verify the performance of our redistribution algorithm, experimental results for data redistribution are presented. In Chapter 4, we present a static data reorganization for dense m atrix appli cations. We propose an asym ptotic lower bound on TLB performance for generic m atrix access patterns. We present analysis of the cache and TLB performance when block data layout is used in concert w ith tiling. We also show the im pact of block size on the memory hierarchy performance and provide an optim al block size range. To validate our analysis, we present simulated and experim ental results. In Chapter 5, we present a dynamic data reorganization for factorized signal transforms. To achieve optim al performance, we discuss a search algorithm that considers the size and the access stride of transforms to find a factorization with m inim um execution time. We apply dynamic data reorganization to state-of-the- art F F T and W HT packages for experim ents. We report experim ental results on several state-of-the-art platforms. In Chapter 6, we present concluding remarks and directions for future research. 10 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. C hapter 2 Background In this chapter, general descriptions of HPC platforms and their memory hierarchy systems are presented. Also, the data access cost in memory hierarchy and the communication cost are described. 2.1 M em ory H ierarchy S ystem o f H P C P latform s 2 .1.1 H igh p erform an ce com p u tin g p latform s In the current VLSI technology trend, the clock speed and transistor density of VLSI systems are rapidly improving. These technologies improve the processor speed and the memory capacity. In recent times, state-of-the-art uniprocessor workstations consist of a high speed general purpose processor and a main m em ory with a large am ount of DRAM. These powerful uni-processor platforms are attractive for high performance computing (HPC) applications, which need a high computing power and a large amount of memory. However, processor speeds are increasing at the rate of more than 50% per year, while the bandw idth and latency of the memory (DRAM) have been improving by only 10-15%. This gap between processor and memory speeds is a critical bottleneck toward achieving high per formance since accessing data in the memory causes a processor to stall for several 11 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. cycles. To hide the long memory access latency and to increase the memory band width, the memory system is hierarchically built between the processor and the main memory in uni-processor platforms. This memory hierarchy system consists of multi-level caches. In the memory hierarchy system, as a cache is located closer to the processor, it becomes smaller and faster. W ith recent advances in semiconductor technology, Commercial-Off-The-Shelf (COTS) components have steadily improved in com putational performance, cost, and capacity. The use of COTS components allows flexibility in the design of the architecture and allows architecture designers to incorporate rapidly the latest trends and novel design techniques. HPC multi-processor platforms built using these technologies achieve computing power in the range of Gflops/s (109 flops/s) and ultim ately Tflops/s (101 2 flops/s) for various scientific and engineering appli cations. HPC multi-processor platforms are typically composed of COTS compo nents. Two main components of multi-processor platform s are powerful computing nodes and a low latency high bandw idth network [9]. Each computing node con sists of a general purpose processor and local memory, which are used in current generation uniprocessor workstations. The com puting nodes are interconnected with a low-latency high-bandwidth network which typically consists of high speed point-to-point links and routing switches. Figure 1.1 describes the general archi tecture of HPC platforms. Examples of such HPC platforms include the IBM SP2, the Cray T3E, the Intel Paragon, the SGI Origin 2000, and the HP Exem plar, among others. The high com putational power, ready availability, and cost effectiveness of these systems have led to their wide use in diverse applications. 2.1 .2 M o d el o f th e m em ory hierarchy sy stem State-of-the-art uni-processor and multi-processor platform s have a deep memory hierarchy system. In uni-processor platforms, the memory hierarchy system con sists of multi-level caches and a main memory. The level 1 (11) cache is an on-chip 12 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. (or on-die) cache. The level 2 (L2) cache is an off-chip cache, farther away from the processor. Some platforms have a level 3 cache. As a cache is closer to the processor, it becomes smaller and faster. In multi-processor platform s, nodes are connected with an interconnection network. From the processor’s point of view, several remote memory banks are connected through the network interface card (NIC). The access latency of the rem ote m em ory is much larger than the access latency of the local main memory. Therefore, it can be considered to be at a level lower than the local m ain memory. From a processor’s point of view, these memory hierarchy systems can be simply presented as a uniform model as shown in Figure 1.2. The main characteristic of this memory hierarchy system is th at as a memory is farther away from the processor, the size and access latency of the memory increase. 2.1.3 D a ta access co st in a u n i-p rocessor sy stem In state-of-the-art uni-processor platform s, the execution tim e of an application is proportional to the cost incurred while accessing data in memory. Thus, the data access cost can be considered to evaluate the performance of an application. To analyze the data access cost, it is significant to understand the cache behavior for the data access pattern of an application. In this dissertation, the data access pattern is can be classified as follows: ® Unit stride access: Whole elements in an array are accessed sequentially with stride 1. • Stride access: Elements in an array are accessed with stride S. • Random access: Elements in an array are accessed in random order. Random access is not considered in this proposal. 13 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. To estim ate a data access cost, we consider a simple memory hierarchy which consists of a cache and main memory. A cache is assumed to be a direct-m apped cache with write back on a write miss. Cache size is C and cache line size is L. The number of cache lines in a directed m apped cache is given by CjL. Assuming that an array is too large to fit into a cache, the cost to access N array elements is given as follows: AcCCSS COSt — NhitThit T NmissTmisgy where N = Nh.it + Nmiss, Nhit is the num ber of cache hits, Nmiss is the number of cache misses, Thit is the cost of a cache hit, and Tmiss is the cost of a cache miss. Consider an array accessed with Stride S. The num ber of cache misses and hits to access the array with stride S is represented as a formulation with S and L. In the case th at S is uni-stride access, every Lth element access causes a cache miss and L — 1 subsequent accesses result in cache hits. If S is larger than L, every stride access causes a cache miss. In 1 < S < L, some accesses cause hits and others cause misses. Denote lcm(S , L ) as the least common m ultiple of S and L. For consecutive accesses, lcmCN) misses and lcm(S, L)(^ — j;) hits occur. We can represents cache behavior cost for array access with stride S as follows: l - f ) r A i< ] for S < L for S > L where N is the number of elements accessed with stride S. In a general multi-level memory hierarchy system consisting of both TLB and /-level caches, the evaluation of the total access cost is more complicated. It can be evaluated as follows: i Acess Cost = T M ■ M tlb + ^ C M zHt+1 (2.1) i=l 14 Access Cost N x £ ./71 £ miss + ( NxT„ Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 200 180 160 140 g 120 c o E CD E H 100 80 60 40 20 . 0 — $— Time(msec) Cache Misses / / y 16 64 Stride 1K 800 1 700 600 500 ' ■ 300 q J Z 200 o 100 4K Figure 2.1: The data access cost of stride perm utation where CM{ is the number of misses in the ith level cache, T M is the number of TLB misses, Hi is the cost of a hit in the ith level cache, and Mt\b is the penalty of a TLB miss. The (I + l)th level cache is the main memory. It is assumed th at all data resides in the main memory (CMi+ 1 = 0). Figure 2.1 shows the execution tim e of stride perm utations on UltraSPARC I. As the stride increases, the execution tim e increases. If the stride becomes larger than 8 , the number of cache misses is saturated. However, as the stride increases further, the execution tim e increases, since the num ber of TLB misses increases. If the stride is larger than the size of a page, every access causes a TLB miss. Therefore, the execution tim e is saturated at a particular level. 2 .1 .4 C om m u nication cost in a m u lti-p rocessor platform Current multi-processor platforms employ high bandw idth and low latency inter connection networks. In these platform s, large software overheads are incurred as a message traverses through various software layers and network interfaces to its destination [100]. The software overheads dominate the hardware switch latencies. 15 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. From algorithm design and analysis point of view, the networks of such platform s can be modeled as an all-to-all connected network. Its connection forms a complete graph. Hence, the communication tim e is topology-independent. A model for general multi-processor platform s is required to design efficient algorithms for high perform ance application. The General purpose Distributed Memory (GDM) model [100] is a representative example. The model assumes th at a node can send at most one message and receive at most one message at a tim e. The model represents the communication cost of a message passing operation with two param eters: the startup time Ts and the unit transm ission tim e r^. The startup tim e includes the tim e for buffer copy, system call invocation, and communication protocol processing tim e. It is incurred once for each transm itted message. The unit transmission tim e is the cost of transferring a message of unit length over the network. The total communication tim e for sending a message of size m data units for one processor to another is Ts + m x r^. Suppose th at each processor sends m units of data to a destination processor and the set of destination processors is a perm utation at the same tim e, then the total communication tim e is Ts -f- m x rg. Figure 2.2 illustrates the com putation model for HPC multi-processor platforms. In general, the startup tim e is usually larger than the transm ission tim e. Thus it is im portant to reduce the num ber of communication steps to minimize the communication time. Similar models have been used by others [28, 41, 82] The observed values of the above param eters for various HPC multi-processor platform s are shown in Table 2.1. Values Ts and tj, shown in Table 2.1 were obtained based on our measurem ents on SP2 using M PICH and T3E using M PI (EPCC). On Intel Paragon, values were obtained using Basic Linear Algebra Com m unication Subprograms (BLACS) [10]. It should be noted th a t these numbers vary depending on the version of the software environment used for message pass ing. 16 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. hardware latency bandwidth Ns. congestion Interconnection N etwork PE \ ] Network Interface VOS J □ Ts: Startup Time □ Xd . Transmission Rate □ m: Message size in bytes Communication Time Ts+ m .Xd IBM SP-2 MPI Implementation Ts: 10’s of psec Xd: 10’s of nsec/byte Figure 2.2: Com putation model for HPC multi-processor platforms 2.1.5 E xam p les o f H P C p latform s 2.1.5.1 U n i-p rocessor platform s The state-of-the-art uni-processor platforms have multi-level memory hierarchy system to bridge the speed gap between the processor and the m ain memory. LI cache is implemented in on-chip or on-die of the processor core. This implementa tion causes the LI cache to be the fastest in the memory hierarchy system in order to efficiently provide data from the main memory to the processor. An L2 cache is located between the LI cache and the main memory. The latency of the L2 cache is much smaller than that of the memory but larger than th at of LI cache. In 17 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table 2.1: Comm unication features of state-of-the-art HPC multi-processor p lat forms Multi-processor Platform Ts (ysec) rd (/isec/byte) Bandwidth l / r d (M Bytes/s) IBM SP2 39-46 0.010 100 Cray T3E 28 0.005 200 Intel Paragon 37.7 0.0075 133.60 Table 2.2: Memory hierarchy system of uni-processor platforms Platforms Speed (MHz) LI cache L2 cache TLB Size (KB) Line (Byte) Size (K B ) Line (Byte) Entry page (KB) Alpha 21264 500 64 64 4096 64 128 8 MIPS R10000 195 32 32 4096 64 UltraSparc II 400 16 32 2048 64 64 8 UltraSparc III 750 64 32 4096 64 512 8 Pentium III 800 16 32 512 32 64 4 Pentium 4 1700 8 64 256 64 general, fully associative caches can be efficiently utilized without cache conflicts. However, to find whether the accessed data exists in the cache or not, the entire cache is searched, resulting in a long access latency. So, state-of-the-art caches are small set-associative or direct-m apped rather than fully associative or large set-associative to improve the access latency. Table 2.2 summarizes the memory hierarchy system in state-of-the-art uni-processor platforms. 2.1.5.2 M u lti-p rocessor platform s In this subsection, we briefly explain the IBM SP2 and the Cray T3E as examples of current representative HPC platforms. 18 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. IB M SP 2 IBM SP2 platform uses a Power 2 Super Chip (P2SC) microprocessor. The chip consists of an Instruction Cache Unit (ICU), a D ata Cache Unit (DCU), Dual Floating Point Units (FPU s), Dual Fixed Point Units (FXUs), and a Storage Control Unit. Since there are six processing units, the P2SC can issue up to six instructions in each clock cycle. There are 54 physical registers. The num ber of registers defined by the architecture is only 32. The extra registers (54-32=22 registers) are used for register renam ing which reduces the communication between the processor and cache. The size of the ICU is 32KB and the DCU is 64 to 256 KB. The DCU uses four way set associative dual ported caches. It can be configured either as a 256KB with 8-word memory bus or 128KB with a 4-word memory bus. There is no secondary cache. The peak memory bandw idth is 2.1 GB/sec. However, the manufacturer- supplied sustained memory bandw idth is 910 M B/sec. The data access times for different hierarchies are as follows: • Register: 0 clocks • Cache: 1 clock ® Cache miss: 18 clocks • TLB miss: 36-56 clocks The nodes of the IBM SP2 are interconnected by a bidirectional M ultistage Interconnection Network (MIN). The MIN allows interconnections between any pair of processors. Further, the communication between any pair of processors needs the same num ber of hops since the distance between any pair of processors is the same. A peak bidirectional bandw idth of up to 150MB/sec is supported on the newer machines. The message latency without software overhead is 500 nsec 19 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Stream s Memory Banks DEC ALPHA 21164 Control E-Registers Router Figure 2.3: Block diagram of Cray T3E node to 875 nsec. However, the latency th at is observed at the user level is much higher (tens of jusec). Cray T3E The Cray T3E series is the second generation of scalable parallel processing systems from Cray Research built around COTS processors. The Cray T3E series is based on the DEC Alpha 21164 microprocessors with clock speeds ranging from 300 MHz to 600 MHz. The system logic runs at 75 MHz. The interconnection network is a 3-D torus that provides scalability of the system. Up to 2048 nodes can be configured in a single system. Each node of a Cray T3E system contains a processor, support circuitry, local memory, and a network router. A block diagram of a Cray T3E node is shown is Figure 2.3. The DEC Alpha 21164 microprocessor can issue up to four instructions per clock period. The four concurrent instruction pipelines consist of: first and second integer pipeline (E0, E l), floating-point add pipeline (FA), and floating-point mul tiply pipeline (FM). The two floating point pipelines allow a peak performance of 20 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 600 M Flops/sec to 1200 M Flops/sec, depending on the speed of the microproces sor. Each processor contains an 8 KB direct-m apped instruction cache (Icache), an 8 KB direct-mapped data cache (Dcache), and a 96 KB, 3 way set associative, write-back, write-allocate secondary cache (Scache). The Dcache cache line is 32 bytes while the Scache cache line is 64 bytes. The local memory is controlled by a set of four memory controller chips, directly controlling eight physical banks of DRAM. Each bank is organized into 64-bit words. The 64 bytes of Scache cache line is spread across the eight banks with each bank containing 8 bytes. There is a 32-bit path on each of the channels between the memory controller and memory banks. This allows a theoretical m axim um of 4 channels x 32 bits per channel x 75 MEtz = 1.2 GBytes/sec possible bandw idth. The sustainable maximum is 80% of peak or 960 M Bytes/sec. The memory bandw idth is enhanced by six stream buffers. Each stream buffer can store up to two 64-byte Scache lines. These buffers autom atically detect consecutive references to memory locations and prefetch data. The Cray T3E processing elements (PE) are connected by a high-speed, low- latency 3-D torus interconnection network. It is capable of peak interprocessor transfer rates of 500 M Bytes/sec in each direction and up to to 122 GBytes/sec of payload bisection bandwidth. The network operates asynchronously and inde pendently of the PEs. Each node contains a network router, which is a crossbar switch connecting a PE port, an I/O port, and six network channels (one for each dimension and direction). The network router has bidirectional links in each di mension and can handle data transfers of up to 3.6 G Bytes/sec. Routing can be both deterministic ordered routing or adaptive routing to avoid hot-spots and network contentions. 21 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. C hapter 3 D ata R ed istrib u tion b etw een P rocessor Sets Many HPC applications, including scientific com putations and signal processing, consist of several com putation stages [-54, 91, 100]. As the program execution proceeds from one com putational stage to another, the d ata access patterns and the number of processors required for exploiting the parallelism in the application m ay change. These changes usually cause the data distribution in one stage to be unsuitable for a subsequent stage, resulting in an increase in the number of inter processor communications. Therefore, data distribution should be reorganized between two consecutive stages to reduce the num ber of rem ote accesses. This is called data redistribution [77, 78]. Since the param eters of the redistribution are generally unknown at compile tim e, the details of data redistribution should be computed at run-tim e. Furtherm ore, data redistribution itself is an overhead affecting the overall performance, since it requires inter-processor communication. Therefore, it is necessary to develop efficient algorithms for data redistribution th a t maximizes the performance gain of data redistribution. In this chapter, we propose an efficient algorithm for array redistribution be tween processor sets. The communications required for array redistribution are represented as a tabular form. We show th at the communication table of data redistribution can be converted to the generalized circulant m atrix form. Our approach ensures th at the num ber of communication steps are minimized and 22 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. that there is no node contention during communications. We also show th at our schedule and index com putations are im plem ented using simple arithm etic com putations. The schedule and index com putation tim e is negligible compared w ith the data transfer tim e. Therefore, our approach is suitable for run-tim e data re distribution. Our technique can also be used to im plem ent scalable redistribution libraries, to implement the REDISTRIBUTE directive in H P F [47], and to develop parallel algorithms for supercom puter applications. The rest of this chapter is organized as follows. Section 3.1 defines the array redistribution problem and reviews prior work. Section 3,2 explains our table- based framework. It also discusses the generalized circulant m atrix formalism for deriving conflict free communication schedules. Section 3.3 explains our redistri bution algorithm and index com putation in detail and gives the proofs of their correctness and their complexity. Section 3.4 reports our experim ental results on the IBM SP-2. 3.1 B ackground and R ela ted W ork 3.1.1 P ro b lem d efinition The block-cyclic distribution, cyclic(x), of an array is defined as follows: given an array with N elements, P processors, and a block size x, the array elements are partitioned into contiguous blocks of x elements each. The ith block, & * • , consists of array elements whose indices vary from ix to (i 4 l) s - 1, where i is a global block index and 0 < i < These blocks are distributed onto P processors in a round-robin fashion. Block 6, is assigned to processor j, Pj, where j = i mod P. In this chapter, we study the problem of redistributing from cyclic(x) on P processors to cyclic(Kx) on Q processors, which is denoted as UX(P, K, Q). Figure 3.1 (c) shows 3 F ? 2(3, 2 ,6). Here, initially pairs of consecutive elements of the array 2 -3 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table 3.1: An example of non all-to-all communication. P = 6 , Q = 10 and K = 3. .Destination Source Table 3.2: An example of all-to-all communication with same message size. P = 6 , Q — 10, and K = 4. '"'~-42esti nation Source — 0 1 2 3 4 5 6 7 8 9 0 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 . 2 2 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 4 2 2 2 2 2 2 2 2 2 2 5 2 2 2 2 2 2 2 2 2 2 Table 3.3: An example of all-to-all communication with different message size. P = 8 , Q = 10, and K = 6 . .^Destination Source 0 1 2 3 4 5 6 7 8 9 0 2 1 2 1 2 1 2 1 2 1 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 3 1 2 1 2 1 2 1 2 1 2 4 2 1 2 1 2 1 2 1 2 1 5 2 1 2 1 2 1 2 1 2 1 6 1 2 1 2 1 2 1 2 1 2 7 1 2 1 2 1 2 1 2 1 2 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. (a) Array A, N=48 44 Superblock 1 Superblock 0 (b) CYCLIC(2) on P=3 processors K = 2 (c) CYCLIC(4) on Q =6 processors Figure 3.1: Block-cyclic redistribution from array point of view: (a) the array of elements, (b) cyclic(x) on P processors, (c) from cyclic(x) on P processors to cyclic(Kx) on Q processors. In this example, x — 2, P = 3, Q — % and K — 2 . are distributed over P = 3 source processors. Then, the block size is doubled and the new blocks (of size 4) are redistributed over Q — 6 destination processors. In performing $lx(P, K, Q), three classes of communication patterns between source and destination processors arise. One case is the non all — to — all com munication: every source processor communicates with some of the destination processors. This case is shown in Table 3.1, where N = 30x, P = 6, Q = 10 and K = 3. Note that the messages are of equal size (lx ). The second case is the all — to — all communication with the same message size. In Table 3.2, all the source processors have messages of the same size (2x) and each source processor communicates with all the destination processors. Here, N = 120x, P = 6 , Q = 10 and K ~ 4. The last case is the all— to —all communication with different message sizes. This is shown in Table 3.3 where N = 120m, P = 8 , Q — 10, and K — 6 . Here half the messages are of size lx while others are of size 2x. 25 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Interconnection Network Nl: Network Interface 1. Index Computation 2. Schedule Computation 3. Message Pack/Unpack 4. Data Transfer Figure 3.2: Steps in perform ing redistribution 3.1.2 C ost o f perform ing red istrib u tio n We briefly explain the four costs associated with data redistribution: Index C om p u tation Cost: Each source processor determines the destination processor indices of the array elements th at belong to it and computes the local memory location (local index) of each array element. This local index is used to pack array elements into a message. Similarly, each destination processor deter mines the source processor indices of received messages and computes their local indices to find out the location where the received message is to be stored. The total tim e to compute these indices is denoted as index com putation cost. Schedu le C om pu tation Cost: The communication schedule specifies a collec tion of sender-receiver pairs for each communication step. Since in each com m unication step, a processor can send at most one message and a processor can 26 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. receive at most one message, careful scheduling is required to avoid contention while minimizing the num ber of communication steps. Tim e to com pute this com munication schedule can be significant. Reducing this cost is an im portant criteria in performing run-tim e redistribution. M essage P ack in g /U n p a ck in g C ost: At each sender, a message consists of words from different memory locations which need to be gathered in a buffer in the sending node. Typically, this requires a memory read and a memory w rite operation to gather the data to form a compact message in the buffer. The tim e to perform this data gathering at the sender is the message packing cost. Similarly, at the receiving side, each message is to be unpacked and data words need to be stored in appropriate memory locations. D a ta Transfer Cost: The data transfer cost for each communication step consists of start-up cost and transmission cost. The start-up cost is incurred by software overheads in each communication operation. The total start-up cost can be reduced by minimizing the num ber of message transfer steps. The transmission cost is incurred in transferring the bits over the network and depends on the network bandwidth. 3.1.3 R ela ted w ork The array redistribution problem has been the focus of several research efforts [18, 23, 44, 46, 81, 93, 94, 99], Many of these efforts have targeted efficient imple m entation of high level compiler directives such as the REDISTRIBUTE directive in H PF [47]. In [81], Banerjee et al. use a line segment formalism called PITFALLS to represent the cyclic(x) distribution. The array elements th a t m ap to a processor are represented as a set of stride line segments. For every pair of processors, the array elements whose indices are in the intersection of the respective line sets are exchanged. Their technique handles arbitrary source and destination processor 27 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. sets as well as multidimensional arrays. However, PITFALLS does not consider communication scheduling during redistribution. Their m ain contribution is to determine the elements to be exchanged. Other previous studies [18, 44, 46, 53, 93, 94, 99] consider redistribution from cyclic(x) to cyclic(Kx) on the same set of processors. This is a classical redistri bution problem. Techniques to compute index sets are proposed in [44, 93, 94]. However, explicit scheduling of the communication to minimize overall redistribu tion tim e is not considered. In [93, 94], Choudhary et al. present efficient index com putation algorithms for the special case when P mod K = 0. They also consider redistribution from cyclic(x) to cycliciyy) on the same processor set, for arbitrary x and y. Although it is possible to calculate explicitly the destination and source processor of each ele ment of the local array, such a scheme is expensive for use in practice as potential node contentions occur in each communication step. In [94], gcd and 1cm methods have been proposed. These are two phase algorithms where cyclic(x) is first re distributed to cyclic{m), followed by a redistribution from cyclicim) to cyclic(y). Here, m can be gcd or Icm of x and y. In [94], it is shown th at multidimensional arrays can be redistributed by applying these algorithms to each dimension of the array separately. In [44], Ni et al. provide new logical processor num bers (Ipids) for the target distribution, so as to minimize the amount of data to be communicated during redistribution. D ata blocks th at have the same Ipids across source and target distributions need not be moved. However, index set com putation becomes com plicated. This approach is not applicable when the num ber of source and target processors are different. Sadayappan et al. [46] and Walker et al. [99] have proposed algorithms which reduce communication overheads. In [99], a K step schedule is derived for cyclic(x) to cyclic(Kx) redistribution on the same processor set. In each step, processors 28 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. exchange data in a contention free m anner: each processor sends its data to exactly one processor and receives data from exactly one processor. Therefore contention at the destination nodes does not occur. A similar communication schedule is shown in [46]. Although the communication schedule presented in [99] is based on modular arithm etic and th at in [46] is based on tensor products, the resulting communication schedules are similar. For redistribution between different processor sets, the Caterpillar algorithm was proposed in [80]. It uses a simple round robin schedule to avoid node con tention. However, this algorithm does not fully utilize the network bandw idth i.e., the size of the data sent by the nodes in a communication step varies from node to node. This leads to increased data transfer cost. In [23], Desprez et al. propose a general solution for block cyclic redistribution on arbitrary source and destination processor sets. They can handle redistribution from cyclic(x) on P processors to cyclic(y) on Q processors, where x, y, P, and Q are arbitrary positive integers. Their m ethod to compute the communication schedule uses bipartite matching. They propose two strategies, stepwise strategy and greedy strategy. In stepwise strategy, an attem pt is m ade to minimize the num ber of communication steps but the total data transfer cost is not optimized. In greedy strategy, the total transmission cost is optimized but the num ber of communication steps is not minimized. Even though the data transfer tim e is reduced, the tim e to compute the communication schedule using bipartite m atching is significant. The schedule com putation cost is 0((P + Q )4). As the num ber of processors is increased, the schedule com putation tim e can be larger than the data transfer cost. To the best of our knowledge, the fastest scheme to find a maximum matching in a unit weight bipartite graph has 0 ( V 1/2E) complexity, where V is the number of nodes and E is the number of edges [24, 67, 88]. For our problem, since there can be lcm(P: KQ) such matchings, the complexity to compute the 29 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. schedule will be 0 ((P + Q)3'5)- Although this complexity is better compared with the general case, it is still expensive to com pute these graph matchings. In [18], Y.C. Chung et al.have proposed the index com putation m ethod for redistribution from cyclic(x) to cyclic(y) on the same processor set. They proposed the basic-cycle calculation (denoted as BCC) technique which is the closed forms for source/destination processor indices of array elements. These closed forms are useful to efficiently determ ine the communication sets of a basic-cycle. The communication schedule was not considered in this technique. Therefore, the node contention problems exist on the redistribution communications. 3.2 Our A pproach to R ed istrib u tion In this section, we present our approach to block-cyclic redistribution problem. In subsection 3.2.1, we discuss two views of redistribution and illustrate the concept of a superblock. In the following subsection, we explain our table-based fram e work for redistribution using the destination processor table and column and row reorganizations. In subsection 3.2.3, we discuss the generalized circulant m atrix formalism which allows us to compute communication schedule efficiently. 3.2.1 A rray and p rocessor p o in ts o f view Figure 3.1 shows 3 R 2(3 ,2 ,6) from the array point of view. The elements of the array are shown along a single horizontal axis. The processor indices are m arked above each block. For the redistribution UX(P, K,Q), a periodicity can be found in the block movement pattern. For example, in Figure 3.1, bo, 63, A, and 69, which are initially assigned to P0, are moved to Q0, Q 1, Q 3, and Q4 respectively. After that, 612 in Pi, is moved to Q0 again. We find th at the communication pattern between b0 and bn is repeated on other blocks. Such a collection of blocks is called as a superblock. The period of this block movement pattern is lcm(P, K Q ), and is the 30 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. size of the superblock. In Figure 3.1, superblock size is lcm(3,2 x 6) = 12. In the next superblock, blocks bu to b23 are moved in the same fashion. From the processor point of view, the block-cyclic distribution can be repre sented by a 2-dimensional table. Each column corresponds to a processor and each row index is a local block index. Each entry in the table is a global block index. Therefore, element (i,j) in the table represents the ith local block of the jth. p rocessor_ Figure 3.3 shows the example of 5 R 2(3 ,2 ,6) from the processor point of view. Blocks are distributed on the table in a round-robin fashion. The table corresponding to source processors is denoted as initial layout representing which blocks are initially assigned to which source processors. Similarly, the final layout represents which blocks are assigned to which destination processors. Our problem is to redistribute the blocks from initial layout to final layout. These layouts are shown in Figure 3.3(a) and (d) respectively. The initial layout can be partitioned into collections of rows. Each collection consists of L s = lcm(P, K Q )/P rows. Similarly, the final layout can be partitioned into disjoint collections of rows; each collection having L 4 — lcm(P, K Q )/Q rows. Note that each collection corresponds to a superblock. Blocks th at are located at the same relative position within a superblock are moved in the same way during the redistribution. These blocks can be transferred in a single communication step. The M PI derived data type can handle these blocks as a single block. W ithout loss of generality, we will consider only the first superblock in the following to illustrate our algorithm. We refer to the tables representing the indices of the blocks within the first superblock in the initial (final) layout as initial distribution table D init (final distribution table Dfjna]). These are shown in Figure 3.3(c) and (f), respectively. The cyclic redistribution problem essentially involves reorganizing blocks within each superblock from an initial distribution table D in;t to a final distribution table Dfinai. 31 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Po P i P 2 0 b 0 b 1 h 2 1 b 3 i> 4 b 5 2 b 6 b 7 b 8 3 b 9 b 10 b 11 4 b 12 b 13 b 14 5 b 15 b 16 b 17 6 b 18 b 19 b 20 7 b 21 b2 2 b 23 Q o Q - j Q 2 Q 3 Q 4 Q $ (a) Initial layout K 0 1 2 0 0 1 2 1 3 4 5 2 6 7 8 3 9 1 0 1 1 Po P i p 2 b 0 b 1 b 2 b 3 b 4 b 5 b 6 b 7 b 8 b 9 b 10 b 11 / (b) (c) Initial Distribution Table Dinit CYCLIC(2) on P processors K = 2 0 1 2 3 b 0 b 2 b 4 b 6 % b l bfi b 7 4 b 12 b 14 b 16 b 18 'f f o $ b 13 b 15 b 17 b 19 (d) Final layout Q o Qf Q2 O3 Q4 Q5 b 0 fh b 2 b 3 b 4 b 5 b 6 b 7 \ b s: I - ’ t o 7 icn 9 \ a 2 T O (e)V , Q < \ 0 1 2 3 4 5 0 0 2 4 6 8 1 0 1 1 3 5 7 9 1 1 (f) Final Distribution Table Dfinaj CYCLIC(4) on Q processors Figure 3.3: Block-cyclic redistribution from cyclic(x) on P processors to cyclic(Kx) on Q processors from processor point of view. In this example, P = 3, Q = 6 and K = 2 . 32 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. / \ 0 1 2 0 0 1 2* 1 3 4 5 2 6 7 8 3 9 10 11 global block index destination processor index 0 1 2 ~ 3 0 1 2 0 0 1 1 2 2 3 3 4 ^4 5 5 D ;n; init Figure 3.4: The destination processor table T 3.2.2 A table-based fram ew ork for red istrib u tio n Given the redistribution param eters, P , K, and Q, each block’s location in D jnit and Dfinai can be determ ined. Through redistribution, each block moves from its initial location in D init to the final location in D finai. Thus, the processor ownership and the local memory location of each block are changed by redistri bution. This redistribution can be conceptually considered as a table conversion process from D in;t to Dfinai, which can be decomposed into independent column and row reorganizations. In a column reorganization, blocks are rearranged within a column of the table. This is therefore a local operation within a processor’s mem ory. In a row reorganization, blocks within a row are rearranged. This operation therefore leads to a change in ownership of the blocks, and requires interprocessor communication. The destination processor of each block in the initial distribution table is de term ined by the redistribution param eters and its global block index. A send communication events table is constructed by replacing each block index in the initial distribution table with its destination processor index as shown in Figure 3.4. This table is denoted as destination processor table (dpt) T. The (i,j)th entry 33 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. ' \ 0 1 2 0 0 4 2 1 6 10 8 2 3 1 5 3 9 7 11 Row Reorganization D ' init I 0 1 2 0 2 1 3 5 4 1 0 2 4 3 5 Colum n Reorganization t 0 1 2 0 1 2 3 4 5 6 7 8 9 10 11 'send ' \ 0 1 2 0 0 0 1 1 1 2 2 2 3 3 4 3 4 5 5 D, 0 1 2 3 4 5 0 2 4 6 8 10 1 3 5 7 9 11 D' Colum n Reorganization final I init D final K 0 1 2 3 4 5 0 0 2 4 6 8 10 1 1 3 5 7 9 11 Figure 3.5: Table conversion process for redistribution of T is the destination processor index of ith local block in source processor j and 0 < i < Ls i.e. T considers only one superblock. It is a Ls x P m atrix. Each row corresponds to a communication step. On the other hand, each destination processor has to know from which source processors it receives messages. The source processor index of each block in the final distribution is also determined by the redistribution param eters and its global block index. A receive communica tion events table can be constructed by replacing each global block index in Dfinai with its source processor index. In oirr algorithm, during a communication step, a processor sends data to at most one destination processor. If Q > P, at most 34 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. P processors in the destination processor set can receive data and the other des tination processors rem ain idle during th a t communication step. Therefore, each communication step can have at most P communicating pairs. On the other hand, if Q < P, only Q destination processors can receive data at a tim e. The m axim um number of communicating pairs in a communication step is m m (P , Q). W ithout loss of generality, in the following discussion we assume th at Q > P. Figure 3.5 shows our table-based framework for redistribution. To convert the initial distribution table D injt to the final distribution table Dfinai, dpt T can be used. But, the use of T itself as a communication schedule is not efficient. It leads to node contention, since several processors try to send their data to the same destination processor in a communication step. For example, in Figure 3.5, during step 0 , both source processors 0 and 1 try to communicate with destina tion processor 0. However, if every row of T consists of P distinct destination processor indices among {0 , 1,•••,Q — 1}, node contention can be avoided in each communication step. This is the m otivation for the column reorganizations. To eliminate node contention, the dpt T is reorganized by column reorganiza tions. The reorganized table is called a send communication schedule table, C send- In section 3.3, we discuss how these reorganizations are performed. C send is a Ls x P m atrix as well. Each entry of C send is a destination processor index and each row corresponds to a contention-free communication step. To m aintain the correspondence between D inn and T , the same set of column reorganizations is ap plied to D init which results in a distribution table, D-nit corresponding to C send- In a communication step, blocks in a row of D(nit are transferred to their des tination processors specified by the corresponding entries in C send- Referring to Figure 3.5, in the first communication step, source processors 0, 1 and 2 transfer blocks 0, 4 and 2 to destination processors 0, 2 and 1 respectively as specified bv Cgend- Such a step is called row reorganization. The distribution table Dflnal corresponding to the received blocks in destination processors is reorganized into 35 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. the final distribution table Dfinai by another set of column reorganizations. (For this example, we do not need this operation.) The received blocks are then stored in the memory locations of the destination processors. The key idea is to choose & Csend such th a t the required row reorganizations(com m unication events) can be performed efficiently and it supports easy-to-compute contention-free communica tion scheduling. So far, we have discussed a redistribution problem from cydic(x) on P pro cessors to cyclic(Kx) on Q processors. A dual relationship exists between the problem from cyclic(x) on P processors to cydic(Kx) on Q processors and the problem from cyclic(Kx) on Q processors to cyclic(x) on P processors. The re distribution from cyclic(Kx) on Q processors to cyclic(x) on P processors is the redistribution with reverse direction of the redistribution §lx(P, K, Q). Its send (receive) communication schedule table is the same as the receive (send) commu nication schedule table of RX(P, K, Q). Therefore, our scheme for $lx(P, K. Q ) can be extended to the redistribution problem from cyclic(Kx) on Q processors to cyclic(x) on P processors. In the following subsection, we define generalized circulant m atrix. Using the generalized circulant m atrix, we derive an efficient contention-free communication schedule for 9tx(P, K,Q) ■ 3.2 .3 C om m u n ication schedu lin g u sin g a gen eralized circulant m atrix Our framework for communication schedule performs local rearrangem ent of data within each processor as well as interprocessor communication. The local rear rangement of data, which we call column reorganization, results in a send commu nication schedule table C sena. We will show th at for any P, K and Q , the send 36 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 0 3 2 1 1 0 3 2 2 1 0 3 3 2 1 0 0 3 2 1 1 0 3 2 2 1 0 3 0 3 2 1 0 3 2 1 0 3 2 1 (a) m = n = 4 (b) m = 3, n = 4 (c) m =4, n = 3 Figure 3.6: Circulant m atrix examples communication schedule is indeed a generalized circulant m atrix which avoids node contention. Definition 1 An m x n matrix is a circulant m atrix if it satisfies the following properties: 1. If m < n, row k = row 0 circularly right shifted k times, 0 < k < m. 2. If m > n, column I = column 0 circularly down shifted I times, 0 < I < n. Figure 3.6 shows several different circulant matrices. Note that the above definition can be extended to block circulant matrices by changing “row” to “row block”. D efin ition 2 An M x N matrix is a generalized circulant m atrix if the matrix can be partitioned into blocks of size m x n, where M — s ■ m and N = t • n, for some s, t > 0 such that the block matrix forms a circulant matrix and each block is either a circulant matrix or a generalized circulant matrix. Figure 3.7 illustrates a generalized circulant m atrix. There are two observations about the generalized circulant m atrix: (i) the s blocks along each block diagonal are identical, and (ii) if all the elements in row 0 are distinct, then in each row all elements are distinct. 37 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. N = f n m • m Submatrix (C irculant Matrix) Generalized Circulant Matrix Figure 3.7: Generalized circulant m atrix We will show that through our approach the destination processor table T is transformed to a generalized circulant m atrix C send with distinct elements in each row. 3.3 Efficient R ed istrib u tion A lgorithm s In this section, we discuss the communication patterns th at arise in performing array redistribution, develop the algorithms for our table-based approach to obtain the send schedule, and present correctness proofs of our techniques. 3.3.1 C om m u nication p a tter n s o f red istrib u tio n Consider the movement of the global block i in the redistribution ffix(P, K , Q ). It is the m th local block of source processor p, where m = i/P and p = i mod P. i = P x m + p (3.1) 38 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. After redistribution, K consecutive global blocks become one block in the new layout. Therefore, global block i is kth block in the Tff new global block, where k = i mod K and ix = i/K . The i1 ^ new global block is at nth new local block of destination processor q, where n = ix /Q and q — ix m od Q. i = ( Qx n + q )x K + k (3-2) We can derive the following redistribution equation. i = P x m + p = ( Q x n + q ) x K Jt-k (3-3) For example, consider the 16th global block in ^ ( 3 ,2 ,6 ) of Figure 3.1. Here, i = 16, P = 3, p = 1 and m = 5. All block indices start from 0. Also, K = 2, k — 0 , ix = 8 , n = 1 and q — 2 . Let L = /cm(P, KQ) and G — gcd(P, K Q ). As discussed in Section 3.2.1, global blocks i and L + i are m apped to the same source processor p and redistributed to the same destination processor q, because L is a least common multiplier of P and KQ. The boundaries for Eq. (3.3) are 0 < p < P 0 < q < Q (3.4) 0 < k < I< From the redistribution equation, we can classify communication patterns into 3 classes for the redistribution problem $lx(P, K , Q ) (See [23] for an alternative for mulation for the cyclic(x) to cyclic(y) problem) according to the following Lemma. L e m m a 1 The communication pattern induced by §tx(P, K, Q) requires: (i) non all-to-all comm unication if G > K, (it) all-to-all communication with the same 39 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. message size if K — aG, where a is an integer greater than 0, and (in) all-to-all communication with different message sizes if G < K and K aG. P ro o f: The redistribution equation, Eq. (3.3), can be rew ritten as p — Kq — (nQK — mP) + k (3-5) which can be expressed as p — Kq = AG + k because nQ K — m P is a m ultiple of G and A is an arbitrary integer. For any p and q, if there is at least one block satisfying the above equation, this redistribution requires all-to-all communication. When both sides are divided by G, their rem ainders are (p — Kq) mod G = k m od G (3-6) Assume th at G < K. There is at least one k satisfying Eq. (3.6), since k is between [0, IT — 1]. Note that if K = aG for any integer a > 0, k is between [0, aG — 1]. Thus, for the expression k mod G in Eq. (3.6), a distinct numbers result in the same remainder. Therefore, Eq. (3.6) has a solutions for any (p , q) pair. This redistribution requires all-to-all communication with the same message size. In the initial distribution table, a blocks in a column are transfered to the same destination processor. If K > G and K aG, then it requires all-to-all communication with different message sizes. Assuming th at G > I\, we can find a processor pair (p,q) exchanging no message. W hen (p — Kq) mod G > K, none of k satisfies Eq. (3.6). For example, consider p — 0 and q = Q — 1 pair which gives (p — Kq) mod G — K. The right hand side has a maxim um value of K — 1. Therefore, processor 0 doesn’t send any message to processor Q — 1. □ Among these three cases, the case of all-to-all processor communication with the same message size can be optimally scheduled using a trivial round-robin schedule. However, it is non trivial to achieve the same message size between all pairs of nodes 40 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. in a communication step for all-to-all case with different message sizes. Therefore, we focus on the two cases of redistribution requiring scheduling of non all-to-all communication and all-to-all communication with different message sizes. 3.3.2 N o n a ll-to -a ll co m m u n ica tio n We first explain how dpt T is transform ed to the send communication schedule table C g e n d - Given the redistribution param eters P, Q , and K, we get the Ls x P initial distribution table Di„it and its dpt T . Let G\ = gcd(P, K), K\ = J r and Pi = gr. In the dpt, every Kj row has a similar pattern. It has a different destination processor index. We shuffle the rows such th at rows having similar pattern are adjacent resulting in the shuffled dpt TV The shuffled dpt T i is divided into Q i slices in the row direction, Qi = It is divided into Pi slices in the column direction. Now, dpt T i can be considered as a K\ x P\ block m atrix m ade of Qi x G\ submatrices. This block m atrix is then converted into a generalized circulant m atrix by reorganization of submatrices within columns and rotating individual columns within subm atrix by appropriate amounts. This generalized circulant m atrix can then be used as a communication schedule table C send- In this procedure, the K identical values in row 0 of the dpt are distributed to K distinct rows, and hence, row 0 has distinct values. Since C send is a generalized circulant m atrix, all rows are distinct and we achieve a conflict-free schedule. In the above reorganizations, an element is moved within its column. Therefore, it does not incur any interprocessor communication. Figure 3.8 shows an example where dpt T of 3 ffx(6 ,4,9) is converted to generalized circulant m atrix form C send by column reorganizations. In this example, Ls = 6 , G\ = 2, K\ — 2, P\ — 3, and Qi = 3. Figure 3.8(a) shows the initial distribution table, D init , and (d) shows the corresponding dpt T . Rows of D jnit and T are shuffled, as shown in Figure 3.8(b) and (e). Now we can partition the shuffled tables into submatrices of size 3 x 2. The diagonalization of submatrices and diagonalization of elements in each 41 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Po Pi P2 P3 P4 P5 Po P2 P3 P4 P5 Po Pi P2 P3 P4 P5 0 0 1 2 3 4 5 0 0 1 2 3 4 5 0 0 25 8 33 4 29 1 6 7 8 9 10 11 1 12 13 14 15 ie 1/ 1 12 1 20 9 16 5 2 12 13 14 15 16 17 2 24 25 26 27 28 29 2 24 13 32 21 28 I / 3 18 19 20 21 22 23 3 f’ 7 8 9 10 11 3 6 31 2 27 10 35 4 24 25 26 27 28 29 4 u: 19 20 21 22 23 4 18 7 14 3 22 11 5 30 31 32 33 34 35 5 31 32 33 34 35 5 30 19 26 15 34 23 (a) I>init (b) Dinitl (c) D'init Po Pi P2 P3 P4 P5 Po Pi P2 P3 P4 P5 Po P^ P2 P3 P4 P5 0 0 0 0 0 1 1 0 0 0 0 0 1 • > 0 o 6 2 8 1 t 1 1 1 2 2 2 2 1 3 3 3 3 4 4 1 3 0 5 2 4 i 2 3 3 3 3 4 4 2 6 6 6 6 i 7 2 6 3 8 5 7 4 3 4 4 5 5 5 5 3 1 1 2 2 2 2 3 1 / 0 6 2 8 4 6 6 6 6 7 7 4 A 4 5 5 5 5 4 4 i 3 0 5 2 5 7 7 8 8 8 8 5 7 7 8 8 8 8 5 / 4 6 3 8 5 (d) T (e) T j (f) C send Figure 3.8: Steps of column rearrangement subm atrix is shown in Figure 3.8(c) and (f). Figure 3.8(f) is a generalized circulant m atrix, C send- In the following theorem , we will formally show th at this procedure correctly obtains a generalized circulant m atrix which is the send communication schedule table C send- T h e o re m 1 The initial dpt T of IR .X(P, K,Q) with P < Q can be reorganized via column reorganizations to a send communication schedule table C send such that (i) C send is a generalized circulant matrix and (ii) Every row of C send has P distinct numbers from the set of {0,1, 2, • • ■ , Q — 1}. P ro o f: In the initial distribution table D jnit , each element, D jnit(«, j) , is assigned a global block index, aP +j, where 0 < a < Ls and 0 < j < P. The corresponding dpt T of Dinit is obtained by replacing each element by (D in;t(a,j)/K ) mod Q. 42 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. First, we show th at the dpt T can be converted to a generalized circulant m atrix Csend- Then every row of C send consists of P distinct num bers from the set of {0,1,2, ■ ■ ■ ,Q — 1}- Therefore, C send can be used as a send communication schedule table. The following are the two m ajor steps in our approach to converting the dpt T to a generalized circulant m atrix. 1. S tag e 1: We will prove th at the dpt T corresponding to the initial distri bution table D init has some rows with similar patterns. These rows can be brought together in this stage. It results in the interm ediate dpt T i. Corre sponding operations can be performed on D in;t resulting in the interm ediate distribution table D initi - TT and D initi can be represented in block m atrix form. 2. S tag e 2: By considering TT as a block m atrix, we will prove that by reorga nization of submatrices w ithin columns and circular shift of elements within columns of submatrices, we will obtain a generalized circulant m atrix, which is our send communication schedule table C send- We now show m athem atically th a t these reorganizations can be performed cor rectly. Stage 1 (Dinit to T j) : In R X(P, K,Q), the A m atrix will have Ls rows and P columns. W ith G\ = gcd(P, AT), we observe that every K \h row will have the same modulo value with respect to IT, where AT = Dinit (a, i) - Dinit (a mod AT, j) = (a/AT) AT A = 0 (m od A') (3.7) There are exactly Qi rows such th at a mod AT = «i, where 0 < o, < AT. The cor responding Qi rows in dpt T have the same pattern but their destination processor indices are different. In the first stage, these rows are gathered by moving row a 43 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. to row a' = (a mod K X)QX + a /K x, where 0 < a' < Ls. In Diniti , for a given a1 the global index of an element can be determ ined by first mapping to original row, the corresponding a — (a' mod QX )K X + a'/Qx = a 2K x + a[, where 0 < a[ < K x and 0 < a' 2 < Qi. From this row shuffling, any element of Diniti (o', j) is then given by: D ^ iti(a ', j) = D init((a' mod Q X )K X + a'/Qx,j) = Dinit(cE2 A i + ax,j) = (a^A-i + a'-^P P j W ith Px = Dmiti can be considered as K x x Pi block m atrix and each subm atrix is a Qx x Gx m atrix. Let us denote these submatrices by where 0 < a\ < K x and 0 < j x < Px. D i„iti can be represented as, D initl Ao,o Ao,i Ai,o A i,i A ^ - 1,0 A-Kx- i,! A o .Pj- i A i.Pj-i A a'i-I.Pj-1 (3.8) Let j x = j j G x and j 2 — j mod Gx for any j. Recall th at the global block index of any element is D initi(a', j) = (a-jATi + a\)P + i> which can be w ritten as (a^Pi + ji)G x + (a2K xP P j 2) by replacing j by j xGx P j 2- Now, let us consider a subm atrix A a 'j j . Element D initi(ed, j) is in subm atrix A 0' ^ , and is located at (a2,j 2) within A T h i s m atrix can be w ritten as, (a[Px p j x)Gx 0 KiP 1 I\XP + 1 (Q1- 1 ) K 1P {QX- 1 ) K XP +1 G i- 1 K xP p G x - l (Qx - l)KxP p G x - 1 (3.9) 44 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. The corresponding dpt T i of D ^ iti is obtained by replacing each element by ( D i n i t i {a',j)/K ) mod Q. This can be represented again in block m atrix form as, Ti Bo,o Bo,i Bi,o B iti B k, - i ,o B k i- i ,! Bo,Pi-i B l , P i - l BAy-l.Pj- (3.10) The expanded form of a subm atrix can be w ritten as, B “iiji where 0 P i ' K p i + i i ) Ah o Pi + X mod Q, (3.11) 0 Pi ( Q i - i ) P i ( Q i - i ) P i ( Q i - i ) P i Thus, we have completed the proof for stage 1. Further, we can see that each block m atrix has a base value and a fixed offset m atrix. S tag e 2 (T i to C send ): We can transform T i to C send by reorganizing subm a trices within columns of the block m atrix and circularly shifting elements within columns of submatrices. First we will show th at Qi x G i block matrices can be reorganized to obtain a block-wise circulant m atrix. Next, we show the elements within a subm atrix can be converted to circulant m atrix. Now consider the base b W fi+ji) K i of B0' i n Eq. (3.11), where 0 < b < Pi. We refer to each collection of K\ adjacent submatrices as a run in row-major order. Run r contains submatrices whose base value is r. Through these column reorganizations of the block m atrix T l5 the kth subm atrix of a run moves to the kth row in its column, where 0 < k < K\. Thus, K { submatrices of each run 45 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. are aligned into a diagonal. In this column reorganization of the block m atrix, submatrix B a'j , in T x moves to Ci1j 1 in C send- The relationship between a[ and ix is as follows. ix = (a\Px +jx) mod K x (3.12) The resultant block m atrix is a block-wise circulant m atrix as shown as follows. Csend Co,o Co,i C li0 Cl,! C/Vj-1,0 C x'j-1,1 Co.Pi-1 C i,Pj-i C a'j- i .P!- (3.13) Similarly, we can convert each subm atrix to a circulant m atrix. Elements within the j f 1 column of subm atrix B a» jx are circularly shifted by j? 2 positions. The relationship between a' 2 and i 2 is as follows. i 2 = («2 + j 2) mod Qx (3.14) Each reorganized subm atrix is shown below. c tlJ1 = ( (,/|/V K/l) + " A mod g, where Kx 0 (Qx - l)Px Pi 0 (Qx — Gx + l)P i (Qx - G x + 2)Px (3.15) ( Q i - i ) P , (Q i- 2 ) P i ■ ■ ■ ( < 3 i- G ,)P , The resulting m atrix is a block-wise circulant m atrix and its submatrices are also circulant matrices. Therefore, it is a generalized circulant m atrix. This m atrix can be used as a send communication schedule table C send- The same column 46 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. reorganizations are applied to the D in;ti . The reorganized distribution table, D -nit is in the following form: D i n i t = Do,o D 0,i D i,o D ip D o ^ - i D pP j-i D ^ - 1 , 0 D / f j - 1 , 1 • • • D a ' i - u P j - i (3.16) Dn,j! = {a[Pi + ji)Gi + Z, (3.17) where (Qi - G i + l)K iP + 1 {Q1- G 1 + 2)K1P + 1 (Qx - G 1)K1P + G1 - 1 _ We will show th at every row of C semj has P distinct numbers from the set of {0,1,2, • • •, Q — 1}. Consider the 0th row of C send- For row T = 0, it implies (a'jPx + jx) m od K\ — 0. Then, (a^Px + ji) € {0, K\, ■ ■ • , (P\ — 1)/Tx} since 0 < a'j < Ki and 0 < jx < Px- Therefore, ^ {0,1, • • •, (Px — 1)}. In the first row, the second term of Eq. (3.15) is ([0 (Q\ — l)Px • • • (Qi — Gj + l)P t]) mod Q. Since Qx = , QiPi mod Q = 0. Therefore, the second term is [0 Q — Pi • • ■ Q — (G\ — l)P i]. By summing the base and offset values, elements in the first row will have a value in range 0 to Px — 1 or in range Q — (Gx — l)Px to Q — 1. Since Px — 1 < Q — (Gx — l)P i, row 0 of the send communication table C send consists of P distinct destination processor indices. Since C send is a generalized circulant m atrix, every row of C send consists of P distinct destination processor indices. □ (Qi — l ) P i P + 1 ATP (Qx - l)KxP (Qi — 2 )A jP + 1 47 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. A 0 1 2 3 4 5 A c 2 0 1 0 1 0 1 0 0 i 25 8 33 4 29 0 | 0 8 8 -i .4 0 0 25 0 25 0 25 1 12 1 20 9 1 (> 5 0 0 0 8 8 4 4 1 12 1 12 1 12 1 2 24 13 32 21 28 17 0 o 8 8 4 4 2 24 13 24 13 24 13 3 c p 27 10 35 — £ 6 2 2 10 10 + 0 0 25 0 25 0 25 4 18 7 14 3 22 11 1 c. G 2 2 10 10 1 12 1 12 1 12 1 5 3iJ 19 15 S 4 23 6 6 2 2 10 10 2 24 13 24 13 24 13 (a) D'jni (b) Djbasa (c) ^offset A 0 1 2 3 4 5 A 0 1 2 0 1 0 1 0 1 0 0 6 2 8 1 7 0 0 2 2 1 0 0 6 0 6 0 6 1 3 0 5 2 4 1 0 0 0 2 2 1 1 1 3 0 3 0 3 0 2 6 3 8 5 7 4 0 0 2 2 < 1 2 6 3 6 3 6 3 3 1 7 0 6 2 8 — 1 1 0 j 0 2 2 + 0 0 6 0 6 0 6 4 4 1 3 0 5 2 1 1 1 0 ! 0 2 2 1 3 0 3 0 3 0 5 7 4 6 3 8 5 1 1 0 ! o 2 2 2 6 3 6 3 6 3 (d) Csen{] (e) Cfrase (f) Coffset Figure 3.9: Decomposition of D-nit and C send The send communication schedule table C sencj is K x x Px block m atrix. Each block is Qi x Gx subm atrix. The subm atrix of Eq. (3.17) consists of two parts. These can be referred as base and offset respectively. The row and column of the base are indexed as ix and j x respectively. Also, the row and column of the offset are indexed as z2 and j 2 respectively. Thus, entries of the base m atrix are independent of z2 and j 2. All entries within Qx x Gx subm atrix of C base have the same value, given by (( ° i ) mo(j Q — Similarly, the entries of the offset are independent of ix and j\. Therefore, all Qx x G\ subm atrices of C offset are identical to one another. Each is a Qx x Gx circulant m atrix. Figure 3.9 shows the base and the offset of D-nit and C send for ^ ( 6 ,4 ,9 ) . W ith reference to Figure 3.5, observe that C send helps to specify the row reor ganizations th at convert D(nit to Dfinal- The initial column reorganizations convert 48 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Dinit to D-nit and T to C send- Although Section 3.2 indicates the column reor ganizations of data w ithin a processor, this operation can be expensive for large array sizes. Instead, the reorganization can be done by m aintaining pointers to the elements of the array. Each source processor has a table which points to the data blocks to be packed in a communication step. It is denoted as send data location table D send- Each entry of D send is the local block index of the corresponding entry of D-nit. Therefore, D send(?,j) = D-nit(i,j)/P. Each entry of C send, C send(i, j) , points to the destination processor of the corresponding entry of D send, D send(b j)- Our scheme computes the schedule and data, index set at the same time. The receive communication schedule table C recv and the receive data location table D recv can be computed in a similar way. They are not discussed further. Theorem 2 gives the formulae to compute the individual entries of C send and Dgend efficiently. Referring to Figure 3.9, in the communication schedule table, Csend, i indices represent the communication steps and j indices represent the source processor indices. Each entry C send(i, j) refers to the destination processor to which the j th processor communicates in the ith communication step. In the following theorem, A, jx- and j 2 refer to the indices defined in the proof of Theorem 1. (See Eq. (3.12) and Eq. (3.14)) T h e o re m 2 Given redistribution parameters, K , P , and Q, let G\ = gcd{P, K), Pi = P/Gi, K\ = K /G i, C ?2 = gcd(Pi,Q), and Q i = Q/Gv. A send communica tion schedule table C send the send data location table D send in the generalized circulant matrix form can be computed as follows: Csend if, j ) = { M i l - A)} mod Px + {(i2 - j 2) mod Qi}Pi} mod Q( 3.18) D S end(*,i) = {m{ji - A)} mod K x + {(i 2 - j 2) mod Qi}Kx (3.19) where n and m are solutions of nK\ — mPx = 1. 49 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. P ro o f: From the proof of Theorem 1, the (i,j)th entry of D-nit is shown as follows. D i n i t ( h i ) — D i n i t i ( a ' , i ) — (a[Pi + ji)G i + (a'2I<iP + 3 2 ) From Eq. (3.12) and Eq. (3.14), ii — {a'xPi + ji) mod K\ *2 = (®2 T 3 2) mod Q\ Let t = (a^Pi +ji). From Eq. (3.21), t = X K \ + i\ — Y P \+ ji where 0 < X < Pi and 0 < Y < Pi. Hence, X R \ - YP\ = (jx - ix) We have to solve a Diophantine equation to find X and Y. Since gcd(K\ , we can find m and n using the Euclid algorithm such th at nK\ — mP\ — 1 From Eq. (3.23) and Eq. (3.24), X - {n{jx - h)} mod Px Y = {m(ji — H)} mod K\ (3.20) (3.21) (3.22) (3.23) ’ i) = 1, (3.24) (3.25) (3.26) Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Eq. (3.22) becomes a2 = (i2 — jf) mod Q\. By replacing a[ and a'2 with i\ and i2, Eq. (3.20) can be rew ritten as follows D init( b i) = {XKi + i\)G\ + {(*2 — J2) mod Q x)K\P + j2 Therefore, C send(i, j) = (D-nit(i, j) / K ) mod Q gives Eq. (3.18) since i\G\ + ji < K. Similarly, we can prove Eq. (3.19). □ The above formulae for computing the communication schedule and index set for redistribution are extremely efficient compared w ith the methods presented in [23], which use a bipartite m atching algorithm. Furtherm ore, using our formulae, each processor computes only entries th at it needs in its send communication schedule table. Hence, the schedule and index set com putation can be performed in a distributed way and the total cost of com puting the schedule and index set is 0(max(P,Q)). Our scheme minimizes the num ber of communication steps and avoids node contention. In each communication step, equal-sized messages are transferred. Therefore, our scheme minimizes the data transfer cost. C o ro lla ry 1 In the proposed redistribution algorithm, the costs of index set and communication schedule computations are 0(max(P,Q)). The amortized cost to compute a step in the communication schedule and index set computation is 0(1 ). 3.3.3 A ll-to -a ll com m u n ication The all-to-all communication case arises if G(= G\G2) < K as stated in Lemma 1, where G\ = gcd(P,K) and G2 = gcd{P\,Q). From the first, superblock, the dpt T is constructed. The dpt T is a Ls x P m atrix, where Ls = K\Q\. Since Q = Q\G2 and G2 < Ah, the num ber of rows in dpt T Ls > Q. Therefore, each column has more entries than Q destination processors. In each column, several blocks are transferred to the same destination. The column reorganizations as stated in Section 3.3.2 are applied to the dpt T , which results in a generalized circulant 51 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. p, p2 p2 Pi p3 Destination Processor Table dpt(T) (LsxP Matrix) 0 0 0 0 1 0 0 ; 3 ? 1 1 1 1 2 2 1 4 1 ■ ! 5 2 2 3 3 3 2 ■ f i 5 1 1 s 3 4 4 4 5 3 1 0 3 2 4 5 5 0 0 — ► 4 S 4 1 1 1 5 0 1 1 1 5 3 2 ■ « 4 6 2 2 2 3 6 2 1 0 3 7 3 3 4 4 7 0 5 4 1 8 4 5 5 5 8 4 3 2 5 Column Transformed dpt (Lsx P Matrix) Po Pi P2 P3 1 1 3 2 1 i 1 111 3 4 3 i 0 3 1 . 1 1 4 2 1 s n h—- 4 — - message size = 2 message size = 1 Send Communication Schedule Table (Csend) Figure 3.10: Example illustrating an all-to-all case with different message sizes: £*(4,3,6) m atrix and is a Kx x Px circulant block m atrix. Each block is a Qx x G % subm atrix which is also a circulant m atrix. In the block m atrix, the first G 2 blocks in each column are distinct. Blocks in every Gty row have the same entries but different circular-shifted patterns. These blocks can be folded onto the blocks in their first row. Therefore, the first G2 rows in the block m atrix only are used in determ ining a send communication schedule table C send- It is a Q x P generalized circulant m atrix. Since blocks in every Gtj1 row are folded onto blocks in their first row, for the all-to-all communication case with different message sizes, blocks in the first (Ki mod G 2) rows of C send have size \^£], whole blocks in the remaining rows have size [ |A j. Figure 3.10 shows an example of the send communication schedule table of ^ ( 4 ,3 ,6 ) , generated for all-to-all case with different message sizes. In this ex ample, each processor has more entries than 6 destination processors. The cor responding dpi is a Ls x P m atrix, where Ls = 9 and P = 4. Applying column reorganizations on it results in a generalized circulant m atrix, which can be con sidered as a K x x Px block m atrix, where Kx = 3 and Px = 4. Each block is a 52 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Qi x Gx m atrix, where Q i = 3 and G\ = 1. The first G2 — 2 rows are used as a send communication schedule table C send. The 3rd row is folded onto the 1st row. Hence, the message size in the Ist row is 2 and th at in the 2nd row is 1. If Ah is a multiple of G2, the message size in every step is the same. Therefore, the network bandwidth is fully utilized by sending equal sized messages in each communication step. Theorem 3 summarizes the above intuition and shows th at the send communi cation schedule table for all-to-all communication can be obtained as a generalized circulant m atrix. Further, it ensures th at equal-sized messages are transferred in every communication step. T heorem 3 If a redistribution problem iR .x(P, K,Q) requires all-to-all communi cation, the send communication schedule table is a Q x P matrix which consists of the first {0,1 — P\th rows in a Ls x P generalized circulant matrix as computed by the equations of Theorem 2. Equal-sized messages are transferred in each communication step. Proof: From Theorem 1, the dpt T is rearranged into a generalized circulant m atrix form by column reorganizations. It is computed by Eq. (3.18) in Theorem 2. The generalized circulant m atrix is a K\ x Pi circulant block m atrix. Each block m atrix is a Qi x G\ circulant m atrix. We show that every Gx fi row in the circulant block m atrix consists of the same indices in each column. Therefore, we prove th at the send communication schedule table is represented as Q x P generalized circulant m atrix and equal-sized messages are transferred in each communication step. Consider a generalized circulant m atrix computed by Eq. (3.18) in Theorem 2. The second part of Eq. (3.18) is related to an offset of the Q j x Gi block m atrix. For processor j, the second part is qPi mod Q, where q = (i2 — j 2) mod Qi and 0 < qfi2 < Qi- Since qPx mod Q = qPx - { ( qPi/Q ) x Q } = G2{qP 2 - Qi(qPi/Q)}, 53 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. qPx mod Q is a m ultiple of G 2. Therefore, the second part of Eq. (3.18) is an element in set {0, G2, 2G2, ■ • •, {Qx — 1 )G2}. Similarly, the first part of Eq. (3.18) is related to the base value of each Q i x Gx block m atrix. A base value is determined by {{n{jx — A)} mod Pi} mod Q — {n(ji — p )} mod P i, for processor j and 0 < i\ < Kx- For processor j , consider the base value of the block m atrix at the first row and the block m atrix at the G ^ row. For the block m atrix at the first row, the base value in Eq. (3.18) is com puted as follows, (n ji) mod Pi = njx-{{njx)/P i}P x = nji — XxG2 where Ai = [(?U i)/Pi]P2 and Pi = P2G2. For the block m atrix in the G2 f c row, the base value in Eq. (3.18) is also computed as follows, n{jx - G2) mod Px = n(jx - G2) - {n{ji - G2)/P i} P i = {njx) - {n - [n(jx - G2)/P i]P 2}G2 = (njx) — X2G2 where A2 = n — [n(ii — G i)/P i]P 2. Even though their base values are different, the difference of both base values are a multiple of G2. W hen the base value is divided by G2, their rem ainder is the same. Therefore, the Qx x Gi block m atrices with these base values consist of the same destination processor indices, because offset values are multiples of G2. So, block m atrices at every Gf/1 row in each column consist of the same entries in C 8end- In the case of all-to-all communication with same message size, Iix is a , multiple of G2. Therefore, for the all-to-all communication case, the send communication schedule table is a Q x P m atrix which consists of the first {0,1, • • •, Q — l} th rows in a Ls x P generalized 54 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. circulant m atrix. Every Gif row in the circulant block m atrix is folded into its first row. Therefore, equal-sized messages are transferred in a communication step. □ C o ro lla ry 2 For the redistribution problem iRx(P, K,Q), the proposed algorithm minimizes the data transfer cost in both non all-to-all communication case and all-to-all communication case. 3 .3 .4 D a ta tran sfer co st In distributed memory model, the data transfer cost has two param eters; start-up tim e and transmission tim e. The start-up tim e, Ts, is incurred once for each com munication event and is independent of the communicated message size. Generally, the start-up tim e consists of the transfer request and acknowledgment latencies, context switch latency, and latencies for initializing the message header. The unit transmission tim e, Td, is the cost of transferring a message of unit length over the network. The transmission tim e for a message is proportional to its size. Thus, the data transfer tim e for sending a message of size m units from one processor to another is modeled as t s + m rj. In this model, a reorganization of the data elements among the processors, in which each processor has m units of data for another processor, also takes Ts + m rj time. This model assumes that there is no node contention. This is ensured by our communication schedules for redistri bution. Theorem 4 shows the d ata transfer cost for our redistribution algorithms using the distributed memory model. T h e o re m 4 Consider an array with N elements initially distributed cyclic (ay) on P processors. The array needs to be redistributed to cyclic (Kx) on Q processors. Using our algorithms, the data transfer costs for performing MX(P, K,Q) are (i) LgTs+yTd in the case of a non all-to-all communication pattern, and (ii) QTs-\-^Td in the case of an all-to-all communication pattern. 55 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. P ro o f: D ata transfer cost is considered as a sum of total start-up cost and total transmission cost. Consider our send communication schedule table C send, which is in generalized circulant m atrix form and each row consists of distinct elements. The communication schedule is specified as a sequence of row reorganization on C send- A row reorganization moves elements to their destination processor specified by Csend- This corresponds to an interprocessor communication event. The num ber of communication steps is equal to the num ber of row reorganizations on C send. We prove the theorem in term s of the number of reorganizations required to convert Csend to its desired final form. The total start-up cost is proportional to the number of communication steps. Therefore, it is in proportion to the num ber of row reorganizations. Also, communication pairs in each communication step communicate messages of the same size. Therefore, source processors have the same total transmission cost without wasting the network bandwidth. The total transmission cost is proportional to the size of an array assigned to each processor. (i) no n a ll-to -all c o m m u n ic a tio n : We know th at each row of C send repre sents communication step i. The total num ber of communication steps is Ls. In each step, processor pairs communicate one block per superblock between them . Therefore, data transfer cost can be estim ated as follows, N data transfer cost = Ls ■ (Ts + rf) 1 1j s r r ■ b sl s I (ii) all-to -all c o m m u n ic a tio n : From Theorem 3, we need only Q rows in the generalized circulant m atrix form to develop the communication schedule table Csend- We know th at each row of C seud represents communication step i. The total number of communication steps is Ls. In each step, equal-sized messages are communicated between processor pairs. The message size in the first r rows on Csend is q + 1 blocks per superblock, where r = Ls mod Q and q = Ls/Q. The 56 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. message size in remaining rows is q blocks per superblock. Therefore, data transfer cost can be estim ated as follows, N data transfer cost = QTS + {r(q + 1) + (Q — r)q) —— rd P LS N = QTS + Ls—— rd 1 I ~ J S nr m1 ! □ 3.4 E xperim ental R esu lts Our experiments were conducted on the IBM SP2 at the Maui High Performance Computing Center. The algorithms were w ritten in C and MPI. M PI communica tion primitives were used for interprocessor communication. Table 3.4 shows a comparison of the proposed algorithm with the Caterpillar algorithm [80] and the bipartite matching scheme [23] with respect to the data transfer cost and schedule and index com putation costs. For the all-to-all com munication case with equal-sized messages, the data transfer cost is the same in each communication step for all three algorithms. Also, the schedule com putation can be performed in a simple way. Hence, the all-to-all communication case with equal-sized messages is not considered in Table 3.4. In Table 3.4, M is the size of the array assigned to each source processor (M = jr). For the non all-to-all communication case, Ls < Q , where Ls = /r,'lP';A(T . Our algorithm as well as the bipartite matching scheme perform fewer communication steps than the Caterpil lar algorithm. For the all-to-all communication case with different message sizes, the messages transm itted in a communication step are of the same size in our algorithm. Therefore, the network bandwidth is fully utilized and the total trans mission cost is TdM. However, the bipartite m atching scheme does not consider 57 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table 3.4: Comparison of data transfer cost and schedule and index com putation costs of the Caterpillar algorithm, bipartite matching scheme and our algorithm. Non all-to-all communication All-to-all communication with different message sizes Data transfer cost Schedule and index computation cost Data transfer cost Schedule and index computation cost Caterpillar algorithm [BO ] 0 ( 2 ) Q -l QTs + xd X mi i = 0 0 ( 2 ) Bipartite matching scheme [23] 0((P + 2 ) 3'5) N/A N/A Our algorithm d T ' + ' - 5 0 ( 2 ) QTs + xdM 0 ( 2 ) Note: Ls < Q for the non all-to-all communication case, M = N /P and m ,- is the maximum transfered data size in communication step i. this case. In the Caterpillar algorithm, the transmission cost in a communica tion step is dominated by the largest message transferred in th at step. Let to * - denote the size of the largest message sent in a communication step i. Note th at m-i > M. The total start-up cost of the Caterpillar and our algorithm is QTS since the number of communication steps is the same. On the other hand, the total transmission cost of our algorithm is tjM which is less than that of the Caterpillar algorithm. The Caterpillar algorithm as well as our algorithm perform the schedule and index com putation in 0(Q) time. However, the schedule and index com putation cost in the bipartite m atching scheme is much greater even for the problem considered here. To evaluate the total redistribution cost and the data transfer cost, we consider 3 different scenarios corresponding to the relative size of P and Q: (Scenario 1); P*C Q , (Scenario 2); Q > 2P , and (Scenario 3)] P < Q < 2P. In our experiments, 58 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. we choose P = 18 and Q = 78 for Scenario 1, P = 30 and Q = 66 for Scenario 2, and P = 46 and Q = 50 for Scenario 3. The array consists of single precision integers. The size of each array element is 4 bytes. The array size was chosen to be a multiple of the size of a superblock to avoid padding using dum m y data. The rest of this section is organized as follows. Subsection 3.4.1 reports exper im ental results of the overall redistribution tim e of our algorithm , the Caterpillar algorithm, and bipartite m atching scheme. Subsection 3.4.2 shows experim ental results for the data transfer tim e of our algorithm and the C aterpillar algorithm. Subsection 3.4.3 compares our algorithm and the bipartite m atching scheme with respect to the schedule com putation tim e. 3.4.1 T otal red istrib u tio n tim e In this subsection, we report experim ental results for the total redistribution tim e of our algorithm, the Caterpillar algorithm , and the bipartite m atching scheme. The total redistribution tim e consists of the schedule com putation tim e, index com putation tim e, packing/unpacking tim e, and data transfer tim e. The experim ental results of bipartite matching scheme are reported only in the case of non all-to-all communication, since all-to-all communication instances are not considered in the bipartite m atching scheme. The bipartite m atching scheme and our scheme opti mize the num ber of communication steps and minimize the data transfer cost in non all-to-all communication. They will differ only in the schedule com putation time. Thus, we can obtain the total redistribution tim e of the bipartite m atching scheme by simply adding the schedule com putation tim e of the bipartite matching scheme to the total redistribution tim e instead of our schedule com putation time. In our experiments, the source and the destination processor sets were disjoint. In each communication step, each sender packs a message before sending it and each receiver unpacks the message after receiving it. Pack operations in the source processors and unpack operations in the destination processors were overlapped, 59 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. for (j = 0; j<nl; j++) { ts = MPI_Wtime() /* redistribution routine */ compute schedule and index set for (i=0; i<n2; i++) { if (source processor) { /* source processor */ pack message send message to a destination processor ) else { /* destination processor */ receive message from a source processor unpack message } } node_time[j] = MPI„Wtime() - ts compute tavg from node_time of each node T [j] = tavg > compute Tmax = max{T[j]}, Tmin = min{T[j]}, Trned = median{T[j]}, Tavg = average{T[j]} Figure 3.11: Steps for measuring the redistribution time. i.e., after sending their message in communication step i, senders start to pack a message for communication in step (* + 1) and receivers start to unpack the message received in step i. Our methodology for measuring the total redistribution tim e is shown in Fig ure 3.11. The tim e was measured using the MPI_Wtime() call, n l is the number of runs. A run is an execution of redistribution. n2 is the num ber of communication steps. Each processor measures node_tim e[j] in the j th run. Generally, source and destination processors which do not perform an interprocessor communication in the last step, complete the redistribution earlier than the processors which re ceive a message and unpack it. A barrier synchronization, MPI_Barrier(), was performed at the end of the redistribution. After measuring node_time, the av erage node_time over (P + Q) processors is computed and saved as tavg. The measured value is stored in an array T, as shown in Figure 3.11. After the redistri bution is performed n l times, the maximum, minimum, median, and average total 60 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. redistribution tim e are com puted over n l runs. In our experiments, n l was set to 20. Over n l runs, a variation in the measured redistribution tim es was observed. Since the minimum tim e has the smallest component due to OS interference and other effects related to the environm ent, it provides a more accurate observation of the redistribution tim e than others. So, we use Tmin only in our experim ental results. Figure 3.12 shows experim ental results for the non all-to-all communication case. In these experiments, 96 nodes were used. Figure 3.12 shows results of the redistribution 9£i6(18, 9, 78), where 18 source processors and 78 destination proces sors were used and K was set to 9. The total num ber of array elements (in single precision) was varied from 808,704 (3.23 Mbytes) to 14,174,080 (64.7 M bytes). In the non all-to-all communication case, the messages in each communication step are of the same size. The total num ber of communication steps is 39 in our al gorithm, while it is 78 in the Caterpillar algorithm. Therefore, the redistribution tim e of our algorithm is theoretically 50% of that of the Caterpillar algorithm. In the experimental results shown in Figure 3.12(a), the redistribution tim e of our algorithm is between 48.42% and 56.67% of th at of the Caterpillar algorithm. In comparison with the bipartite m atching scheme, our algorithm has faster schedule computation. In experiments with small arrays, the schedule com putation tim e of the bipartite matching scheme can be observed up to 17% of the total redistribution time. Figure 3.12 (b) shows experim ental results for P = 30, Q = 66 and K — 15 and (c) shows experimental results for P = 40, Q = 56 and K — 25, respectively. For these two instances, the num ber of communication steps using our algorithm is 33 and 35, respectively. The num ber of communication steps using the Caterpillar algorithm is 66 and 56, respectively. Therefore, the redistribution tim e of our algorithm can be expected to be reduced by 50% and 37% when compared with th at of the Caterpillar algorithm. Our experimental results confirm these. The 61 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Total Redistribution Tim e (msec) P = 18, K = 9, Q = 78 on the IBM SP2 P = 30, K = 15, Q = 66 on the IBM SP2 3 5 0 .0 —& 0ur a lg o r ith m G —© B ip a rtite m a tch in g schem e s— q C a ter p illa r a lg o r ith m 3 0 0 .0 o C D C O E 2 5 0 .0 ( 1 5 .§ ^ 200.0 o ~ 1 5 0 .0 to T 3 0 5 C C 1 0 0 . 0 o * ” 5 0 .0 0.0 3 0 x 1 0 f 0.0 2 0 x!0e Total Array Size % — efcO u r a lg o rith m G —© B ip artite m atchin g scheme □— 0 C a terp illa r a lg o rith m 3 0 0 .0 2 5 0 .0 200.0 1 5 0 .0 5 0 .0 1 0 x 1 0 6 Total Array Size (a) 3fti6(18,9, 78) (b) 3 M 3 0 ,15> 66) P = 40, K = 25, Q = 56 on the IBM SP2 35 0.0 —3 fc O u r a lg o rith m G —© B ip artite m atchin g scheme g—e C a terp illa r a lg o rith m o O < / i £ 25 0 .0 0 E i- c o 3 n C O 0 a: 0 0.0 0.0 1 5 x 1 0 £ Total Array Size (c) 3f16(40,25,56) Figure 3.12: Total redistribution tim e for non all-to-all communication examples. 62 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. schedule com putation tim e of the bipartite m atching scheme is around 10 msecs in these instances. Figure 3.13 compares the overall redistribution tim e for the all-to-all commu nication case with different message sizes. Figure 3.13(a) reports the experim ental results for $$16(18, 8, 78). The array size was varied from 539,136 points (2.16 Mbytes) to 10,782,720 points (43.13 M bytes). For this case, both the algorithms have the same num ber of steps (78). W ithin a superblock, a third of the messages are two blocks in size while the others are one block in size. In our algorithm, equal-sized messages are transferred from any node in each communication step. Therefore, during a third of the communication steps, messages of size two blocks are sent while messages of size one block are sent during other two thirds of the communication steps. The Caterpillar algorithm does not attem pt to schedule the communication operations so th at equal-sized messages are sent in each communi cation step. Therefore, the redistribution tim e in a step is determined by the tim e to transfer the largest message in th at step. Theoretically, the total redistribution time of our algorithm is reduced by 33.3% compared with th at of the Caterpillar algorithm. In our experiments, we observed up to 19.15% reduction in redistribu tion time. W hen the array size is small, both algorithms have approximately the same performance since the start-up cost dom inates the overall data transfer tim e. As the array size increases, the reduction in the tim e to perform the distribution using our algorithm improves. For two other scenarios, we obtained similar results (See Figure 3.13(b) and (c)). 3 .4.2 D a ta tran sfer tim e In this subsection, we report the experim ental results of the data transfer tim e of our algorithm and the Caterpillar algorithm. In this section, the bipartite m atching scheme is not considered, since it has the same complexity in the non all-to-all 63 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Total Redistribution Tim e (msec) P = 18, K = 8, Q = 78 on the IBM SP2 P = 30, K = 8, Q = 66 on the IBM SP2 250.0 ffiO ur algorithm B — B Caterpillar algorithm 200.0 150.0 100.0 50.0 0.0 10x106 5x10® Total Array Size (a) 3^x6(18, 8,78) H c o 3 £ V > T 3 CD DC 3 0 0 .0 3ftOur algorithm D 1 □ Caterpillar algorithm 2 5 0 .0 200.0 1 5 0 .0 100.0 50 .0 0.0 20x10® 0 Total Array Size (b) 9^6(30,8,66) P = 40, K = 18, Q = 56 on the IBM SP2 300.0 $ —H |fO ur algorithm a-— -g Caterpillar algorithm 0) 250.0 £ 200.0 150.0 ^ 100.0 50.0 0.0 20x10® 10x10® Total Array Size (c) 3 ^i6(40,18,56) Figure 3.13: Total redistribution tim e for all-to-all communication examples with different, message sizes. 64 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. for (j = 0; j<nl; j++) { /* redistribution routine */ compute schedule and index set node_tr[j] = 0; for (i=0; i<n2; i++) { if (source processor) { /* source processor */ pack message ts = MPI_Wtime() send message to a destination processor node_tr[j] = tr[j] + MPI_Wtime() - ts } else { /* destination processor */ ts = MPI_Wtime() receive message from a source processor node_.tr[j] = tr[j] + MPI_Wtime() - ts unpack message } } compute tavg from node_tr of each node Tr [ j ] = tavg compute Tmax = max{Tr[j]}, Train = min{Tr[j]}, Tmed = median{Tr[j]}, Tavg = average{Tr[j]} Figure 3.14: Steps in measuring the data transfer time. communication case compared with our algorithm and does not consider the all- to-all communication case with different message sizes. The experiments were performed in the same m anner as discussed in Subsection 3.4.1. The data sets used in these experiments are the same as those used in the previous subsection. The data transfer tim e of each communication step is measured first. Then the total data transfer tim e is computed by summing up the measured tim e for all the communication steps. The methodology for measuring the tim e is shown in Figure 3.14. The redistribution 3 R lg(18,9, 78) is an example of the non all-to-all communica tion case. The messages in each communication step are of the same size. The total num ber of communication steps is 36 using our algorithm, where as the total num ber of steps is 78 using the Caterpillar algorithm. Therefore, the data transfer time of our algorithm is theoretically 50% of th at of the Caterpillar algorithm. In the 65 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. experimental results (see Figure 3.15(a)), the data transfer tim e of our algorithm is between 48.13% and 57.61% of th a t of the Caterpillar algorithm. Figure 3.15 (b) and (c) show the experim ental results for two other non all-to-all communication examples. Similar reductions in tim e were observed in these experiments. Figure 3.16 reports the experim ental results for examples of all-to-all com munication with different message sizes. The data transfer tim e in the all-to-all communication case is sensitive to network contention since every source processor communicates with every destination processor. For 3?i6(18, 8, 78), both algorithms have the same num ber of steps (78). W ithin a superblock, a third of the messages are two blocks in size and two thirds are one block in size. The Caterpillar algo rithm does not attem pt to send equal-sized messages in each communication step. Therefore, the data transfer tim e for a step is determ ined by the time to transfer the largest message in that step. Theoretically, the data transfer tim e of our algo rithm is reduced by 33% when compared with th at of the Caterpillar algorithm. In our experiments with large message sizes, we observed up to 19.82% reduction. W ith small messages, both algorithms have approxim ately the same performance since the start-up tim e dominates the data transfer tim e. Similar experim ental results are shown in Figure 3.16(b) and (c) for two other scenarios. 3 .4 .3 Schedule co m p u ta tio n tim e The tim e for computing the schedule in the Caterpillar algorithm as well as in our algorithm is negligible compared with the total redistribution tim e. Even though the schedule in the Caterpillar algorithm is simpler than ours, the Caterpillar al gorithm needs tim e for index com putation to identify the blocks to be packed in a communication step. This tim e is approximately the same as our schedule compu tation tim e. The schedule com putation tim e of the bipartite matching scheme [23] is m uch higher than th at of the Caterpillar algorithm and th at of our algorithm. It can be a significant fraction of the data transfer time. To compute the schedule of 66 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. D ata Transfer Tim e (msec) P = 18, K = 9, Q = 78 on the IBM SP2 150.0 $ — $-Our algorithm □ 1 "B Caterpillar algorithm 100.0 50.0 0.0 1 10x10s Total Array Size 15x10s 150.0 $1 i & O u r a l g o r i t h m E H — B C a t e r p i l l a r a l g o r i t h m o CD ( 0 £ 0) 100.0 E c S Q 0.0 10x10s 20x10s Total Array Size (a) 3*16(18, 9,78) (b) 3 * 16(30,15, 66) P = 40, K = 25, Q = 56 on the IBM SP2 $— ^ fO u r a lg o r ith m g— B C a te r p illa r a lg o r ith m o o £ 0 ) j l 50.0 0 C O c CO I- 0.0 5x10s 10x10s Total Array Size 15x10® (c) K16(40,25,56) Figure 3.15: D ata transfer tim e for non all-to-all communication examples. 67 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Data Transfer Time (m sec) P = 18, K = 8, Q = 78 on the IBM SP2 P = 30, K = 8, Q s* 66 on the IBM SP2 %— st&Our alg o rith m a — a C aterp illar algorithm 100.0 O 0 W E 0 E ‘ c o c 0 I — 5 0 .0 0 0 Q 0.0 2 0 x 1 0( 1 0 x 1 0 ' < Total Array Size J jS — N^Our algorithm B - H 3 C aterp illar algorithm 100.0 5 0 .0 0.0 1 0 x10 5 x10 6 Total Array Size (a) (18,8, 78) (b) ft16(30,8,66) P = 40, K = 18, Q = 56 on the IBM SP2 100.0 %— H&Our algorithm B — B C aterpillar algorithm O 0 E C D E i- 0 Td c 0 H 0 0 Q 5 0 .0 0.0 2 0 x 1 0 E 10x10® Total Array S ize (c) 5 ftls(40,18, 56) Figure 3.16: D ata transfer tim e for all-to-all communication examples with differ ent message sizes. 68 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table 3.5: Comparison of schedule com putation tim e (psecs) with data transfer time Case Array Size (X106) Data Transfer (DT) Our algorithm Bipartite M atching Schedule (O ST) Ratio(%) (O ST/DT) Schedule (BST) Ratio(%) (BST/DT ) (a) 9 t16(18,9,78) 0.809 21.5 0.165 0.767 7.59 35.30 8.09 49.5 0.165 0.333 7.59 15.33 16.2 71.4 0.165 0.231 7.59 10.63 (b) 9 t]6(30,15,66) 1.43 20.2 0.144 0.713 7.74 35.34 14.3 47.7 0.144 0.302 7.74 14.83 28.5 69.6 0.144 0.211 7.74 11.12 (c) 9 t16(40,25,56) 0.896 19.60 0.173 0.883 10.18 51.94 8.96 35.4 0.173 0.489 10.18 28.76 17.9 46.80 0.173 0.370 10.18 21.75 Note: Array size is the number o f array elem ents in single precision. the bipartite m atching scheme in our experim ents, we used a unit weight bipartite matching code down-loaded from [88]. In Table 3.5, the data transfer tim e in 3 non all-to-all communication examples is compared with the schedule com putation tim e of our algorithm and that of the bipartite matching scheme. The schedule com putation tim e of bipartite m atching scheme can be observed to be as much as 50% of the data transfer tim e. On the other hand, the schedule com putation tim e of our algorithm is less than 1% of the data transfer time. This makes our algorithm attractive for run-tim e data redistribution. 69 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. C hapter 4 Static D ata R eorganization: B lock D a ta Layout Conventional data layouts (row-major or column-major order) are used by compil ers. If a 2-dimensional array is stored in row-major order and accessed along the column or vice versa, there is a m ism atch between the data layout and the data access pattern, even though the data access pattern is not changed during the com putation. This mism atch increases the num ber of cache and Translation Look-aside Buffer (TLB) misses, resulting in overall performance degradation. Therefore, we suggest that the data layout should be reorganized before the com putation begins and does not change during the com putation, thereby m atching the data access pattern during the computation [70, 73, 71, 69]. This data reorganization is called static data reorganization. In this chapter, we present static data reorganization using block data layout for dense m atrix applications. In block data layout, a m atrix is partitioned into sub-matrices called blocks. D ata elements within each block are m apped onto contiguous memory locations. These blocks are arranged in row-major order. First, we derive the lower bound on the num ber of TLB misses for any data layout when a m atrix is accessed along rows and then columns. We show that block data layout intrinsically achieves this lower bound on the number of TLB misses. We present TLB and cache performance analysis using block data layout and tiling in concert. 70 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. On the basis of our TLB and cache analysis, we propose an analytical bound for optimal block size in block data layout. The rest of this chapter is organized as follows. Section 4.1 summarizes previous work related to this chapter. Section 4.2 describes block data layout and gives an analysis of its TLB performance. Section 4.3 discusses the TLB and cache performance when tiling and block data layout are used in concert. A block size selection algorithm based on this analysis has been described. Section 4.4 shows simulation and experimental results. 4.1 R elated W ork To improve the effective memory hierarchy performance, various hardware solu tions have been proposed [13, 26, 33, 43, 79]. Recent processors such as Intel Merced [89] provide increased program m er control over data placem ent and move ment in a cache-based memory hierarchy, in addition to providing some memory streaming hardware support for media applications. To exploit these features, it is im portant to understand the effectiveness of control and data transformations. Along with hardware solutions, compiler optim ization techniques have received considerable attention [65, 64, 86]. As the memory hierarchy gets deeper, it is critical to m anage the data efficiently. To improve data access performance, one of the well-known optimization techniques is tiling. Tiling transform s loop nests so th at tem poral locality can be better exploited for a given cache size. However, tiling focuses only on the reduction in the num ber of capacity cache misses by de creasing the working set size. The cache in most state-of-the-art machines is either direct-m apped or small set-associative. Thus, it suffers from considerable conflict misses, thereby degrading the overall performance [20, 50]. To reduce conflict misses, copying [50, 92] and padding [66, 83] techniques with tiling have been pro posed. However, most of these approaches mainly target the cache performance, 71 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. paying less attention to the Translation Look-aside Buffer (TLB) performance. As the problem size becomes larger, TLB performance becomes more significant. If TLB thrashing occurs, the overall performance will be drastically degraded [87]. Hence, both the TLB and the cache m ust be considered in optimizing application performance. Most previous optim izations, including tiling, concentrate on a single-level cache [20, 50, 66, 83, 92]. M ulti-level memory hierarchy has been considered only by a few researchers. For improving the multi-level memory hierarchy performance, a new compiler technique is proposed in [102] th at transform s loop nests into a recursive form. However, only multi-level caches were considered in [84, 102] w ith no emphasis on the TLB. It was proposed in [62] that the cache and TLB perfor mance be considered in concert to select the tile size. In this analysis, the TLB and cache were assumed to be fully-set associative. However, the cache is direct or small set-associative in most of the state-of-the-art platforms. Some recent works [15, 45, 50, 72, 74, 92] have proposed changing the data layout to m atch the data access pattern, in order to reduce cache misses. It was proposed in [45] that both data and loop transform ation be applied to loop nests for optimizing cache locality. In [15], conventional (row or column-major) layout is changed to a recursive data layout, referred to as M orton layout, which matches the access pattern of recursive algorithms. This data layout was shown to improve the memory hierarchy performance. This was confirmed through experiments; we are not aware of any formal analysis. The ATLAS project [101] autom atically tunes several linear algebra imple m entations. It uses block data layout with tiling to exploit tem poral and spacial locality. Input data, originally in column m ajor layout, is re-mapped into block data layout before the com putation begins. The combination of block data lay out and tiling has shown high performance on various platforms. However, the 72 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. selection of the optim al block size is done empirically at compile tim e by running several tests with different block sizes. 4.2 B lock D a ta Layout and TL B P erform ance In this chapter, we assume the architecture param eters to be fixed (e.g. cache size, cache line size, page size, TLB entry capacity, etc.). The following notations are used in this chapter. Stib denotes the num ber of TLB entries. Pv denotes virtual page size. It is assumed th at the TLB is fully set-associative with Least-Recently- Used(LRU) replacement policy. Block size is B x B, where it is assumed B 2 = kPv. S ci is the size of the ith level cache. Its line size is denoted as LC i -. Cache is assumed to be direct-m apped and its replacement policy is also LRU. In Section 4.2, we analyze the TLB performance of block data layout. We show th at block data layout has better intrinsic TLB perform ance than conventional data layouts. 4 .2.1 B lock d a ta layout To support multi-dimensional array representations, m ost programming languages provide a m apping function, which converts an array index to a linear memory address. In current programming languages, the default data layout is row-major or column-major, denoted as canonical layouts [19]. Both row-major and column- m ajor layouts have similar drawbacks. For example, consider a large m atrix stored in row-major layout. Due to large stride, column accesses can cause cache conflicts. Further, if every row in a m atrix is larger than the size of a page, column accesses can cause TLB trashing, resulting in drastic performance degradation. In block data layout, a large m atrix is partitioned into sub-matrices. Each sub-m atrix is a B x B m atrix and all elements in the sub-m atrix are m apped onto 73 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. r 0 1 4 5 8 9 12 13 2 3 6 7 10 11 14 15 16 17 20 21 24 25 28 29 18 19 22 23 26 27 30 31 32 33 36 37 40 41 44 45 34 35 38 39 42 43 46 47 48 49 52 53 56 57 60 61 50 51 54 55 58 59 62 63 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 40 33 34 35 36 37 38 39 41 42 43 44 45 46 47 48 56 49 57 50 58 51 52 53 54 55 59 60 61 62 63 0 1 4 5 16 17 20 21 2 3 , 6 7 18 19,. f-2 23 8 9 '12 13 24 s 25 S 28 29 10 11 14 15 26 27 30 31 32 33 36 37 48 49 52 53 34 3 5 - CO ccjw . 39 50 5 |. 54 55 40 4 f |" |4 45 56 5* *60 61 42 43 46 47 58 59 62 63 (a) Row-major layout (b) Block data layout (c) Morton data layout Figure 4.1: Various data layouts: block size 2 x 2 for (b) and (c) contiguous memory locations. The blocks are arranged in row-major order. An other data layout of recent interest is M orton data layout [15]. Morton data layout divides the original m atrix into four quadrants and lays out these sub-matrices con tiguously in the memory. Each of these sub-m atrices is further recursively divided and laid out in the same way. At the end of recursion, elements of the sub-m atrix are stored contiguously. This is similar to the arrangem ent of elements of a block in block data layout. Morton data layout can thus be considered as a variant of the block data layout. They only differ in the order of blocks. Figure 4.1 shows block data layout and Morton data layout with block size 2x2. Due to the similarity, the following TLB analysis holds true for Morton d ata layout also. 4.2 .2 T L B perform ance o f block d a ta layout In this section, we present a lower bound on the TLB misses for any data layout. We discuss the intrinsic TLB performance of block data layout using a generic access pattern. We give an analysis on the TLB performance of block data layout and show improved performance compared wdth conventional layouts. Throughout this section, we consider an N x N array. 74 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 4.2.2.1 A low er bou n d on T L B m isses In general, most m atrix operations consist of row and column accesses, or perm u tations of row and column accesses. In this section, we consider an access p attern where an array is accessed first along all rows and then along all columns. The lower bound analysis of TLB misses incurred in accessing the data array along all the rows and all the columns is as follows. T heorem 5 For accessing an array along all the rows and then along all the columns, the asymptotic minimum number of TLB misses is given by 2 P ro o f: Consider an arbitrary mapping of array elements to pages. Let Ak = {i | at least one element of row i is in page k }. Similarly, let B% = {j| at least one element of column j is in page k }. Let = \Ak\ and bk = \Bk\. Note th at a*, x bk > Pv. Using the m athem atical identity th at the arithm etic mean is greater than or equal to the geometric mean ( ak + bk > 2\/Pf ), we have: scattered. The number of TLB misses in accessing all rows consecutively and then O(Stib) is the num ber of page entries required for accessing row i (column j) that are already present in the TLB. Page k is accessed ak times by row accesses, thus, Let Xi (yj) denote the num ber of pages where elements in row i (column j) are all columns consecutively is given by Tmiss > Y2iLi{x i - 0 ( S tib))+Ylf=i(yj~0(Stib))- ffiLi Xi = Ylkfi ak- Similarly, Uj — Jfkfi bk- Therefore, the total num ber of TLB misses is given by 75 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. As the problem size (N ) increases, the num ber of pages accessed along one row (column) becomes larger than the size of TLB (5h&)- Thus the number of TLB entries that are reused is reduced between two consecutive row (column) accesses. Therefore the asym ptotic minimum num ber of TLB misses is given by 2 ^ = . © We obtained a lower bound on TLB misses for any layout when data are ac cessed along all rows and then along all columns. This lower bound of TLB misses also holds when data is accessed along an arbitrary perm utation of all rows and columns. C o ro lla ry 3 For accessing an array along an arbitrary permutation of row and column accesses, the asymptotic minimum number of TLB misses is given by 2^p=. 4.2.2.2 T L B p e rfo rm a n c e In this section, we consider the same access pattern as discussed in Section 4.2.2.1. Consider a given N x N array stored in a canonical layout. W ithout loss of generality, canonical layout is assumed to be row-major layout. During the first pass (row accesses), the memory pages are accessed consecutively. Therefore, TLB misses caused by row accesses is equal to jr . During the second pass (column accesses), elements along the column are assigned to N different pages. Hence, a column access causes N TLB misses. Since N 3> Stib, all N column accesses result in N 2 TLB misses. The total number of TLB misses caused by all row accesses and all column accesses is thus jf- + N 2. Therefore, in canonical layout, TLB misses drastically increase due to column accesses. Compared with canonical layout, block data layout has better TLB perfor mance. The following theorem shows that block data layout minimizes the number of TLB misses. 76 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. T h e o re m 6 For accessing an array along all the rows and then along all the columns, block data layout with block size \fP f x fP f minimizes the number of TLB misses. P ro o f: Suppose the block size B 2 = kPv. Two cases (for k > 1 and k < 1) are discussed separately. C ase I: k > 1. We consider three scenarios for this case. 1. ^ > Stib Accesses to the first row cause ^ TLB misses. However, these entries cannot be reused since Stib is small. Therefore, TLB misses caused by row accesses is Trow = x N. Similarly, TLB misses caused by column accesses are Tool — % • k ■ N. Therefore, the total num ber of TLB misses is N 2 N 2 N 2 1 r- To minimize the total TLB misses dTmiss _ N 2 1 /i I'i dk a/2 Pv a Jk k Therefore, as k decreases, the total num ber of TLB misses decreases. The total number of TLB misses is minimized when k — 1. Note that when B = y/Pf {k — 1), the num ber of TLB misses is 2 which is the lower bound given by Theorem 5. 9 K < s tlb B — k In this scenario, both column and row access can reuse TLB entries. There fore, the total number of TLB misses is iV2 N 2 T . — o u — 9 ___ J-miss ~ ^ g 2 ~~ p ’ 77 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. This is equal to twice the num ber of TLB misses caused by all row accesses in canonical layout. Therefore, this will be the minim um num ber of TLB misses for such an access pattern. 3 - ¥ < f < S m In this scenario, only row accesses can reuse TLB entries accessed in the pre vious row accesses. TLB misses for row accesses are Trow = Therefore, the total number of TLB misses is , N 2 7 N 2 N 2 N 2 r~ Tmiss — k — + k — - — -f As k decreases, TLB misses also decrease. The num ber of TLB misses for block data layout is minimized when k approaches 1. Note that this scenario will reduce to scenario 2 when k = 1. Therefore, the m inim um number of TLB misses in this scenario is the same as th at in scenario 2. C ase II: k < 1. Three scenarios are discussed as follows: 1 ib B k The first row access causes k p TLB misses. These entries cannot be reused in the next row access. TLB misses caused by row accesses are Trow = k ^ - N . On the other hand, the first column access causes p TLB misses, since all the elements in each block are stored in one page. The TLB misses caused by column accesses is Tcoi = p • N. Therefore, the total num ber of TLB misses is N 2 A'2 N 2 1 r Tmzss = k ~ + -^- = + V k )- To minimize the total TLB misses dTmiss N 2 1 1 — — = x — 7 = x (1 — — dk \J2 Pv \/k k 78 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Therefore, as k increases, the total num ber of TLB misses decreases. The total num ber of TLB misses is minimized when k — 1, B = Again, the minimum num ber is 2^ = , equal to the lower bound given by Theorem 5. 2 . f < S tlb In this scenario, both row and column access can reuse TLB entries. There fore, the total num ber of TLB misses is T m is s = 2k— = 2 — - • O H ~ 2 7 l This is equal to twice the num ber of TLB misses caused by all row accesses in the canonical layout. Therefore, it is the minimal num ber of TLB misses caused by all row ( l s < pass) and then all column (2nd pass) accesses. 3 . sm < f < ^ In this scenario, only row accesses can reuse TLB entries accessed in the previous row accesses. TLB misses of row accesses is denoted as: Trow — kgg. Therefore, the total num ber of TLB misses is _ u N2 N2 - N 2 N2 -I m iss '— k + „ T B 2 B Pv j k P v As k increases, TLB misses decrease. Like scenario 3 in Case I, the minimum number of TLB misses in this scenario is obtained when k = 1, and this num ber is the same as that in the previous scenario. According to the above analysis, block data layout with block size \fP~v x \[P~V minimizes the total num ber of TLB misses. As the problem size (N) increases, this minimum number asymptotically approaches the lower bound given by Theorem 5. 0 In general, the num ber of TLB misses for a B x B block data layout is k^g- + gg. It is reduced by a factor of ^yj) when compared with canonical layout. 79 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. When B = yfPf (k = 1), this num ber approaches the lower bound shown in Theorem 5. This theorem holds true even when data in block data layout is accessed along an arbitrary perm utation of all rows and columns. C o ro lla ry 4 For accessing an array along an arbitrary permutation of rows and columns, block data layout with block size yfPf x y/Pf minimizes the number of TLB misses. Similar to Theorem 6 and Corollary 4, the num ber of TLB misses is minimized when blocks are stored in Morton data layout and elements are accessed along rows and columns. C o ro lla ry 5 For accessing an N x N array along along all the rows and then along all the columns (or along an arbitrary permutation of rows and columns), Morton data layout with block size y/~P( x yfPf minimizes the number of TLB misses. To verify our analysis, simulations were perform ed using the SimpleScalar sim ulator [12]. It is assumed that the page size is 8K B yte and the data TLB is fully set-associative with 64 entries (similar to the data TLB in UltraSparc 2.) Double precision data points are assumed. The block size is set to 32. Table 4.1 shows the comparison of TLB misses using block data layout with using canonical layout. Ta ble 1 (a) shows the TLB misses for the “first all rows and then all columns” access. For small problem sizes, TLB misses with block data layout are considerably less than those with canonical layout. This is due to the fact th at TLB entries used in a column (row) access are almost fully reused in the next c.olumn(row)access. For a problem size of 1024 x 1024, a 504.37 times improvement in the number of TLB misses is obtained with block data layout. This num ber is much less than the lower bound obtained from Theorem 5. This is because the TLB entries are reused for this problem size. For larger problem sizes the TLB entries cannot be reused. The total num ber of TLB misses approaches the lower bound. For these large problem 80 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table 4.1: Comparison of TLB misses (a) Along all rows and then all columns Layout 1024 2048 4096 Block Layout 2081 81794 1196033 Morton Layout 2072 274473 1081466 Canonical Layout 1049601 4198401 16793601 (b) A rbitrary perm utation of row and column accesses Layout 1024 2048 4096 Block Layout 64140 273482 1080986 Morton Layout 64257 273477 1080955 Canonical Layout 1053606 4208690 16822675 (c) A rbitrary perm utation of all rows followed by arbitrary perm utation of all columns accesses Layout 1024 2048 4096 Block Layout 64501 274473 1080465 M orton Layout 64813 274472 1081469 Canonical Layout 1053713 4208681 16822395 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. sizes, TLB misses with block data layout are upto 16 times less compared w ith canonical layout. To verify Corollary 3 and 4, two sets of access patterns were simulated: an arbitrary perm utation of all rows and columns, and an arbitrary perm utation of all rows followed by an arbitrary perm utation of all columns. W ith these access patterns, TLB entries referenced during one row(column) access are not reused when accessing the next row(column). The num ber of TLB misses with block data layout approaches the lower bound on TLB misses. The results are shown in Table 4.1 (b) and (c). Morton data layout shows a performance similar to block data layout. Even though block data layout has better TLB performance compared with canonical layouts with generic access pattern, it alone does not reduce cache misses. The data access pattern of tiling m atches well with block data layout. In the following section, we discuss the performance improvement of TLB and caches when block data layout is used in conjunction with tiling. 4.3 T ilin g and B lock D a ta Layout Tiling is a well-known optim ization technique th at improves cache performance. Tiling transforms the loop nest so th at tem poral locality can be better exploited for a given cache size. Consider an N x N m atrix m ultiplication represented as Z = X Y . The working set size for the usual 3-loop com putation is N 2 + 2N. For large problems, the working set size is larger than the cache size, resulting in severe cache thrashing. To reduce cache capacity misses, tiling transforms the m atrix m ultiplication to a 5-loop nest tiled m atrix m ultiplication (TMM) as shown in Figure 4.2(a). The working set size for this tiled com putation is B 2 + 2B. To efficiently utilize block data layout, we consider a 6-loop TMM as shown in Figure 4.2(b) instead of a 5-loop TMM. 82 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. for kk=0 to N by B for jj=0 to N by B for i=0 to N for k=kk to min(kk+B-1, N) r = X(i,k) for j=jj to min(jj+B-1,N) Z(i,j} += r*Y(k,j) (a) 5-loop tiled matrix multiplication for jj = 0 to N by B for kk=0 to N by B for ii=0 to N by B for i=ii to min(ii+B-1,N) for k=kk to min(kk+B-1,N) r = X (i , k) for j=jj to min(jj+B-1,N) Z(i,j) += r*Y(k,j) (b) 6-loop tiled matrix multiplication Figure 4.2: Tiled m atrix m ultiplication 4.3.1 T L B perform ance In this section, we show the TLB perform ance improvement of block data layout with tiling. To illustrate the effect of block data layout on tiling, we consider a generic access pattern abstracted from tiled m atrix operations. The access pattern is shown in Figure 4.3. The tile size is equal to B. Figure 4.3. W ith canonical layout, TLB misses will not occur when accessing consecutive tiles in the same row, if B < Stib- Hence, the tiled accesses along the rows generate ir- TLB misses. This is the m inim um num ber of TLB misses incurred in accessing ■ » v all the elements in a m atrix. However, tiled accesses along columns cause consider able TLB misses. B page table entries are necessary for accessing each tile. For all tiled column accesses, the total num ber of TLB misses is Tco| — B x ^ x ^ 83 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. B B 5 ! -te-/ B B w w /I w w (a) Tiled row access (b) Tiled column access Figure 4.3: Tiled accesses It is reduced by a factor of B compared with the num ber of TLB misses for all column accesses without tiling (see Section 4.2.2). The total num ber of TLB misses are further reduced when block data layout is used in concert with tiling, as shown in Theorem 7. Throughout this chapter, the block size of block data layout is assumed to be the same as the tile size so th at the tiled access pattern m atches block data layout. In block data layout, a 2-dimensional block is m apped onto 1-dimensional contiguous memory locations. A block extends over several pages, as shown in Figure 4.4 for an example of block size B 2 = 1.7 Pv. To analyze TLB misses for column accesses using block data layout, the average number of pages in a block is required. L e m m a 2 Consider an array stored in block data layout with block size B x B, where B 2 = kPv. The average number of pages per block is given by k + 1. P ro o f: For block size kPVl assume that k — n + / , where n is a non-negative integer and 0 < / < 1. The probability th at a block extends over n + 1 contiguous pages is 1 — / . The probability th at a block extends over n + 2 contiguous pages is / . Therefore, the average num ber of pages per block in block data layout is given by: (1 — / ) x (n + 1) + / x (n + 2) = k + 1. 0 84 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. B 2 = 1 . 7 P v Q.3P (a ) ov er 2 p a g e s B 2 = 1.7 Pv ---- m ---------------- -— £------ ------------ 0 0.3P (b ) ov er 3 p a g e s Figure 4.4: Blocks extending over page boundaries T h e o re m 7 Assume that an N x N array is stored using block data layout. For tiled row and column accesses, the total number of TLB misses is (2 + £ 1\NL k> Pv ' P ro o f: Blocks in block data layout are arranged in row-major order. So, a page overlaps between two consecutive blocks th at are in the same row. The page is continuously accessed. The number of TLB misses caused by all tiled row ac cesses is thus TT-, which is the minimum num ber of TLB misses. However, no page i v overlaps between two consecutive blocks in the same column. Therefore, each block along the same column goes through (k + 1) different pages according to Lemma 2. The num ber of TLB misses caused by all tiled column accesses is thus Tcoi = (k + 1) x © x ^ = (k + l ) ^ r - Therefore, the total TLB misses caused by all row and all column accesses is Tmiss = (2 + © For tiled access, the num ber of TLB misses using canonical layout is where B = y/kPf. Using Theorem 7, compared with canonical layout, block data layout reduces the num ber of TLB misses by = B 2k+i ■ 85 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table 4.2: TLB misses for all tiled row accesses followed by all tiled column accesses Layout 1024 2048 4096 Block Layout 2081 12289 49153 Canonical Layout 33794 139265 561025 To verify our analysis, simulations for tiled row and column accesses were per formed using the SimpleScalar simulator. The simulation param eters are equal to those in Section 4.2. A 32 x 32 block size was considered. The block size is the same as the page size. Table 4.2 shows TLB misses for 3 different cases. For problem sizes of 2048 x 2048 and 4096 x 4096, the num ber of TLB misses conform our analysis in Theorem 7. The number of TLB misses with block data layout is 91% less than th at with canonical layout. For a problem size of 1024 x 1024, TLB misses with block data layout is 2081, which is very close to the minimum number of TLB misses (2048). This is a special case in which each block starts on a new page. A similar analytical result can be derived for real applications. Consider the 5-loop TMM with canonical layout in Figure 4.2 (a). Array Y is accessed in a tiled row pattern. On the other hand, arrays X and Z are accessed in a tiled column pattern. A tile of each array is used in the inner loops (i, k,j). The number of TLB misses for each array is equal to the average num ber of pages per tile, multiplied by the number of tiles accessed in the outer loops (kk,jj). The average number of pages per tile is B + jr. Therefore, the total number of TLB misses is given by: 2A/3( ^ + ^ r ) + Y2(T + -T). Consider the 6-loop TMM on block data layout as shown in Figure 4.2 (b). A B x B tile of each array is accessed in the inner loops (i , k , j ) with block layout. The number of TLB misses for each array is equal to the average number of pages per block multiplied by the number of blocks accessed in the outer loops (ii, kk,jj). 86 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 200 to "g 150 C B -100 BO 50 28 ■ From simulation H Estimated 3000 32 36 Block size 40 2500 C O ■ O c ^ 1500 05 < 1 5 O ) ■| 1000 m 500 28 ■ Tiling+BDL U Tiling only 32 36 40 Block size Figure 4.5: Comparison of TLB misses from simulation and esti m ation Figure 4.6: Comparison of TLB misses using tiling+BD L and tiling only According to Lemma 2, the average num ber of pages per block is | r + 1(= k + 1). Therefore, the total number of TLB misses (T M ) is T M ' I f_ T „ + i = 2Ar3 ^ 1 \ B P V + B 3 1 + A T 2 (4.1) Compared with the 5-loop TMM with canonical layout, TLB misses decrease by a factor of 0 (B ) using the 6-loop TMM. Note th at the 6-loop TMM uses block data layout. To verify our TLB miss estim ation, simulations on the 6-loop TMM were per formed. The problem size was fixed at 1024 x 1024. Simulation param eters were the same as those in Section 4.2. Figure 4.5 compares our estimations (given by Eq. (4.1)) with the simulation results. Figure 4.6 shows that block data layout reduced TLB misses considerably compared with tiling. 87 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 4.3.2 C ache p erform an ce For a given cache size, tiling transform s the loop nest so th at the tem poral locality can be better exploited. This reduces the capacity misses. However, since most of the state-of-the-art architectures have direct-m apped or small set-associative caches, tiling can suffer from considerable conflict misses th at degrade the overall performance. Figure 4.7 (a) shows cache conflict misses. These conflict misses are determined by cache param eters such as cache size, cache line size and set- associativity, and runtim e param eters such as array size and block size. Perfor mance of tiled computations is thus sensitive to these runtim e param eters. r /f / \ CACHE (a) Canonical layout (b) Block data layout Figure 4.7: Example of conflict misses If the data layout is reorganized from a canonical layout to a block layout (assuming tile size is same as block size) before tiled com putations start, the entire data that is accessed during a tiled com putation will be localized in a block. As shown in Figure 4.7 (b), a self interference miss does not occur if the block is smaller than the cache since all elements in a block can be stored in contiguous memory locations. In general, cache miss analysis for direct mapped cache with canonical layout is complicated because the self interference misses cannot be quantified easily. Cache performance analysis of tiled algorithm was discussed in [50}. The cache performance of tiling with copying optim ization was also presented. We observe th at the behavior of cache misses for tiled access patterns on block layout is similar Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. to that of tiling with copying optim ization on canonical layout. We have derived the total num ber of cache misses for 6-loop TMM (which uses block data layout) as follows. Individual levels of cache are not considered explicitly, as this analysis is applicable to all cache levels. We consider a tiled program th at consists of nested loops. Each loop level is denoted by the loop index i, j, I, etc. Arrays referenced by the program are denoted as u, v, etc. W ithin an iteration of a loop I, a portion of an array v (called the footprint Fp(v)) is referenced. The body of the loop I will be executed R(v) times, where R(v) is the reuse factor. Let (i) ICMi(v ) denote the num ber of intrinsic cache misses [50] caused by accessing array v during the first iteration of loop I; (ii) SCMi(v ) denote the number of self-interference misses when array v is accessed in one iteration of loop l\ (iii) C IM iv ) denote the num ber of cross-interference misses between array v and other arrays for an iteration of loop I. The num ber of cache misses caused by array v for one iteration of loop I is thus: CMtiv) = ICMtiv) - SCMt(v) + R(v) x {SC M ^v) + CIM (v)} (4.2) C IM (v ) in the above equation can be calculated as: CIM{v) = ICMi{v) x PrCF(v), (4.3) where P rC F (v ) denotes the probability of conflict between one element of array v and elements of other arrays for loop 1. It is given by PrCF(v) = ^2 Provcj(v ,u ), U ^V where Provcf(v,u) is the probability th at an element of array v falls into the footprint of the array tt, accessed with a stride (su) in the cache. For simplicity, 89 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. it is assumed th at an elem ent of array v does not conflict with elements in two or more arrays at the same tim e. The cache misses of array v is com puted as follows: CM(v) = NIO(l) x CM,(v), (4.4) where N 10(1) denotes the total num ber of iterations of outer loops. The total number of misses incurred by accessing all arrays is the sum of misses incurred in accessing individual arrays (][V CM(i)). The above cache miss equation (Eq.( 4.4)) is applicable to any data layout with nested loops. But the factors ( SCMi(v ), Provcf(v,u)} R(v), etc.) cannot be quantified unless the data layout and loop structure are known. For block data layout, we can easily quantify SCMi(v) and Provcj(v,u ) in the above equations. The num ber of self-interferences can be derived by considering three ranges of block sizes, (i) W hen the block size is less than the cache size, there is no self-interference, (ii) W hen the block size is larger than twice the cache size, there is no reused element in cache, resulting in jr- self-interferences misses, (iii) W hen the block size is in between the above ranges, 2 ( self-interference misses occur. Hence, 0 for B < for B < y/% SC M l( v ) = l for y/Sl < B < \P ISl f-c for ^ /W c < B for ^ /W c < B For loop /, Fp(v) elements of array v are accessed with a stride (sv). The average number of cache lines occupied by Fp(v) elements is otherwise 90 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table 4.3: Param eters of TMM Array Reuse Factor Footprint i k 3 i k 3 X (i,k ) B B 1 Y ( k , j ) N B 2 B ZKU) B B B During a tiled com putation, a block of array v is accessed in loop I. Hence, ICMi(y) is equal to the num ber of cache lines, NCL(v). For loop /, array u is accessed with stride(3„) whose footprint size is Fp(u). It occupies N C L (u ) cache lines in the cache. The probability of conflicting with array u is , , NCL(u) Provcf(v,u) = s . Therefore, the cache misses of array v on block data layout is CM{v) = NI0{1) x {NCL(v) - SCMi(v ) + R(v) x {SCM ^v) + CIM{v))} (4.5) where CJM (v) = NCL(v) x ■£ N C L W R U^V c Consider the 6-loop TMM shown in Figure 4.2(b). The reuse factors and foot print sizes of arrays X , Y and Z can be determined. The values are shown in Table 4.3. For example, consider an array Y in loop i. B 2 elements of Y are referenced in each iteration of loop i. These B 2 elements are reused N times. NIO(l) can be obtained directly from the code (Figure 4.2(b)). For example, 91 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. NIO(i) = N 3/ B 3. According to Eq.(4.5), the num ber of cache misses for Y and Z are as follows: CM{ Y) CM( Z) f + 0 + B*) for B < v s ; for \/S~c < B < s flS l for y/2Sc < B N 3 N*}l (x £ \ (B + 2L) Lc \ B B Sr. In the 6-loop TM M , each element of array X is immediately allocated to a register. So, its probability of conflicts with other arrays is 0. Thus, the number of cache misses for array X is C M (X ) = N 3 B L r The total num ber of cache misses for the 6-loop TMM with block data layout is thus: C M J2 C M (v ) = C M {X ) + C M (Y ) + C M (Z ) g { i (2 + + i + iMir} forB<vs; E l f + f - » + 2 - S + t ‘ } f o r ^ < B < v ^ ( 4 - 6 ) I1 + 5 + I1 + E) (EE1 )} forV2S:<B N3 Lc The above analysis focuses on the access pattern of 6-loop TMM. Because m atrix multiplication is the kernel of many linear algebra computations, the analysis can be generalized or directly applied to other linear algebra applications. To verify the cache miss estimations, we conducted simulations using Sim- pleScalar for 6-loop TMM with block data layout. The problem size was fixed at 1024 x 1024. A 16KByte direct m apped cache was assumed (similar to LI 92 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 50 0 ■ From simulation @ Estimated 36 40 Block size 44 Figure 4.8: Comparison of cache misses from simulation and estim ation for 6-loop TMM data cache in UltraSparc II). Figure 4.8 compares our estim ated values (given by Eq. (4.6)) with the simulation results. 4 .3 .3 B lock size selectio n To test the effect of block size, experim ents were performed on several platform s. Figure 4.9 shows the execution tim e of a 6-loop TMM with size 1024 x 1024 on UltraSparc II (400 MHz) as a function of block size. It can be observed th at block size selection is significant for achieving high performance. W ith canonical data layout, tiling technique is sensitive to problem and tile sizes. Several GCD based tile size selection algorithms [20, 27, 50] were proposed to optimize tiled com putation. However, their performance is still sensitive to the problem size. In [62], TLB and cache performance were considered in concert. This approach showed better performance than algorithms th at separately con sider cache or TLB. However, all these approaches are based on canonical data layout. On the other hand, ATLAS [101] utilizes block data layout. However, the 93 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 12 24 36 48 60 72 Block size Figure 4.9: Execution tim e of TMM of size 1024 x 1024 best block size is determined empirically at compile tim e by evaluating the actual performance of the code with a wide range of block sizes. In a multi-level memory hierarchy system, it is difficult to predict the execution tim e (T exe) of a program. B ut, T exe is proportional to the total miss cost of TLB and cache. In order to minimize T exe, we will evaluate and minimize the total miss cost for both TLB and /-level caches. We have: i M C = T M • M tlh + ^ C M i H i+1 (4.7) i — 1 where M C denotes the total miss cost, CMi is the number of misses in the ith level cache, T M is the number of TLB misses, Hi is the cost of a hit in the Ith level cache, and M t\b is the penalty of a TLB miss. The (/ + l)th level cache is the main memory. It is assumed that all data reside in the main memory (CMi+i = 0). Using the derivative of M C with respect to the block size, we can find the optimal block size that minimizes the overall miss cost. 94 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. For a simple 2-level memory hierarchy that consists of only one level cache and TLB, the total miss cost (denoted as M C tc 1) in Eq. (4.7) reduces to: MCui = T M • Mm + C M • H2, where H2 is the access cost of m ain memory. In the above estim ation, Mtib and C M are substituted with Eq.(4.1) and Eq.(4.6), respectively. Using the derivative of M C , the optim al block size (Btc 1) th at minimizes the total miss cost caused by LI cache and TLB misses is given as Btci ~ ^2L\cM tib | o | 3Lci+2Z/^1 Sd H2) Sd 4 H2 (4.8) We now extend this analysis to determ ine a range for optim al block size in a multi-level memory hierarchy th a t consists of TLB and two levels of cache. The miss cost is classified into two groups: miss cost caused by TLB and LI cache misses and miss cost caused by L2 misses. Figure 4.10 (a) and (b) show the miss cost estim ated through Eqs.(4.1) and (4.6). Fig. 4.10(a) is the separated TLB, LI, and L2 miss cost, using UltraSparc II param eters. Fig. 4.10(b) shows the variance of the estim ated total miss costs as the ratio between LI cache miss penalty (H2) and L2 cache miss penalty (Hs) varies. Using Eq.(4.8), we discuss the total miss cost for 3 ranges of block size: L e m m a 3 For B < Btcl, M C (B ) > M C {Btcl). P ro o f: Using the derivatives of TLB and cache miss equations (Eq.( 4.1) and (4.6) ), it can be easily verified th at < 0 and d C d g 7 < 0 for B < Btci- This is shown in Figure 4.10(a). For B < B tci, TLB, LI, and L2 miss cost increase as block size decreases, thereby increasing the total miss cost. Therefore, the optim al block size cannot be in the range B < Btci ■ 0 95 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 0, 0 20 40 60 80 100 Block size 0 , 0 20 40 60 80 100 Block size (a) Miss cost of TLB, LI, and L2 cache is obtained using Eq.(4.8)) (b) Total miss cost with various L2 miss penalty Figure 4.10: Miss cost estim ation for 6-loop TMM (UltraSparc II param eters) L e m m a 4 For B > ^ S 7 X , M C (B ) > MC(y/S7i). P ro o f: In the range B > y/Sci , TLB miss cost is optimized by tiling and block data layout. However, the change in TLB miss cost is negligible as the block size increases. Since block size is larger than LI cache size, self-interferences occur in this range. The num ber of LI cache misses drastically increases as shown in Figure 4.10(a). For y/S ~ ci < B < \Z2Sci, the ratio of derivatives of Eq.( 4.6) for LI and L2 misses is as follows: //., dCMz 123 J D Let B 2 — aSci (1 < a < 2). Note th at Lc2 -C Sc2. H3 Lc1 Sc2~ 2 a S cl > 96 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. In a general memory hierarchy system, ^ 1 since S ci < © Sc 2- Also, jfifi > 1 H3 H 2 ' r r dC M , > i and x/25'ci > f?§. Therefore, v d C M z dB miss Thus, although the number of L2 cache misses decreases (dC d < 0), the total cost increases for y/Sci < B < a/ 2 because the increase in LI cache miss cost is larger than the decrease in L2 cache miss cost. For B > y/2Sci, there is no reuse in LI cache. Thus, the LI cache miss cost saturates. Figure 4.10(b) shows the change of the total miss cost as the ratio of ^ varies. Even though L2 miss penalty is 40 times th at of LI miss penalty, T M ( B ) > TM {^/ScX) for B > a/ 25' c i , because LI self-interference miss cost is dominantly large for B > yj2Sc\ . Therefore, the optimal block size cannot be in the range B > \/S c\. © T h e o re m 8 The optimal block size B opt satisfies B tci < B opt < a/Sc i . P ro o f: This follows from Lemma 3 and 4. Therefore, an optim al block size that minimizes the total miss cost is located in Btd T B opt < y S d - (4.9) We select a block size th at is a m ultiple of L ci (LI cache line size) in this range. © To verify our approach, we conducted simulations using UltraSparc II param eters (Table 4.4). Figure 4.11 shows the simulation results of 6-loop TMM using block data layout. As discussed, the number of TLB and L2 misses decreased as block size increases. Also, the minimum num ber of LI misses was obtained for B = 36 and then drastically increased for B > 45. Figure 4.12 shows the total miss cost. For UltraSparc II, B tc\ = 32. 2, a / S f i = 45.3, and L c X = 4. Theorem 8 suggests the range for optim al block size is to be 36-44. Simulation results show that the optim al block size for this architecture was 44. 97 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. T L B m isses (millions) 250 200 o & 150 CO 0 ) CO CO £ 100 o J Z o o 50 32 44 56 8 20 32 44 56 Block size Block size (a) TLB miss (b) LI miss 20 15 10 <M 8 20 32 44 56 Block size (c) L2 miss Figure 4.11: Simulation results of 6-loop TMM Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. C/5 o o > » o sz o C/5 o o C/5 CO i9 o H - 1600 1400 1200 1000 800 600 400 200 0 JL 8 Search range of ATLAS. 20 Our range 32 Block Size 44 56 S TLB miss cost □ L1 miss cost H L2 miss cost Figure 4.12: Total miss cost for 6-loop TMM We also tested ATLAS on UltraSparc II. Through a wide search ranging from 16 to 44, ATLAS found 36 and 40 as the optim al block sizes. These blocks lie in the range given by Eq. (4.9). We further tested 6-loop TMM with respect to different problem and block sizes. For each problem size, we perform ed experiments by testing block sizes ranging from 8-80. In these tests, we found th at the optimal block size for each problem size was in the range given by Eq. (4.9) as shown in Figure 4.13. These experiments confirm th at our approach proposes a reasonably good range for block size selection. 4.4 E xperim ental R esu lts To verify our analysis, we performed simulations and experim ents on the fol lowing applications: tiled m atrix multiplication(TM M ), LU decomposition, and Cholesky factorization(CF). The performance of tiling with block data layout 99 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 1600 1500 C D .M l 400 W ©1300 XI £1200 1100 1000, O ur R ange ° 0 20 40 Block Size 60 Figure 4.13: O ptim al block sizes for 6-loop TMM (tiling-fBDL) is compared with other optim ization techniques: tiling with copy- ing(tiling+copying), and tiling with padding(tiling+pa,dding). For tiling+BDL, the tile size (of the tiling technique) is chosen to be the same as the block size of the block data layout. Input and output is in canonical layout. All the cost in performing data layout transform ations (from canonical layout to block data layout and vice versa) is included in the reported results. As stated in [50], we observed th at the copying technique cannot be applied efficiently to LU and CF applications, since copying overhead offsets the performance improvement. Hence we do not consider tiling+copying for these applications. In all our simulations and experiments, the data elements are double-precision. 4 .4 .1 Sim ulations To show the performance improvement of TLB and caches using tiling+BDL, sim ulations were performed using the SimpleScalar simulator [12]. The problem size was 1024 x 1024. Two sets of architecture param eters were used: UltraSparc II and Pentium III. The param eters are shown in Table 4.4. 100 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. T L B m isses (millions) ■ Tiling+Copying E 3 Tiling+BDL EaTiling+Padding 32 36 40 Block Size 1 2 0 ■ Tiling+Copying m Tiling+BDL HTiling+Padding .2 80 32 36 Block Size (a) TLB misses (b) LI misses 14 12 'm o 1 0 8 m H i I 3 28 ■ Tiling+Copying E 3 Tiling+BDL EgTiling+Padding 32 36 Block Size 40 (c) L2 misses Figure 4.14: Simulation results for TMM using UltraSparc II param eters 101 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 1000 H 900 O O 800 e ■ 2 700 JL 600 O 500 8 400 E 300 s g 200 100 ® L2 miss cost □ L1 miss cost ■ TLB miss cost i i i i i i i i i : ! ! . ■ y m u m T iling+ C opying Tiling+BDL T ilin g + P ad d in g Figure 4.15: Total miss cost for TMM using UltraSparc II param eters Figure 4.14 compares the TMM simulations of different techniques, based on UltraSparc II param eters. Tiling+BDL has less LI and L2 cache misses when compared with other techniques. Block size 32 leads to increased LI and L2 cache misses for block data layout because of the cache conflicts between different blocks. Tiling+BDL reduced 91-96% of TLB misses as shown in Figure 4.14(a). This confirms our analysis presented in Section 4.3.1. Figure 4.15 shows the total miss cost (calculated from Eq. (4.7)) for TMM using block size 40 x 40. LI, L2, and TLB miss penalties were assumed to be 6, 24, and 30 cycles, respectively. This figure shows th at tiling+BD L results in the smallest total miss cost and th at the TLB miss cost with tiling+BD L is negligible compared with LI and L2 miss costs. Figure 4.12 shows the effect of block size on the total miss cost for TMM using tiling+BDL. As discussed in Section 4.3.3, the best block size (44) is in the range 36-44 suggested by our approach. Figure 4.16 presents simulation results for LU using Intel Pentium III param eters. Similar to TMM, the number of TLB misses for tiling+BDL was almost negligible compared with that for tiling+padding as shown in Figure 4.16(a). For both techniques, LI and L2 cache misses were reduced considerably because of 102 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. T L B m isses (millions) 14 ■ Tiling+BDL H Tiling+Padding 12 10 8 6 4 2 0 16 24 32 40 48 Block size 120 f ■ Tiling+BDL BTiling+Padding 100 80 60 40 20 0 16 24 32 40 Block size 48 (a) TLB misses (b) LI cache misses 1 2 0 1 0 0 c 80 o 60 E 40 E C M 20 0 ■ Tiling+BDL BTiling+Padding 16 24 32 40 Block size 48 (c) L2 cache misses Figure 4.16: Simulation results for LU using Pentium III parameters 103 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 800 jg 700 o S ' 600 c ■ E 500 ^ 400 - 8 300 O ) •| 200 3 100 O ^ 0 U L2 miss cost E3L1 miss cost ■ TLB miss cost 16 Block size 800 700 600 500 400 300 200 100 0 SL 2 miss cost E 3 L 1 miss cost ■ TLB miss cost 28 32 36 Block size 44 48 (a) Tiling+Padding (b) Tiling+BDL Figure 4.17: Effect of block size on LU decomposition using Pentium III param eters 4-way set-associativity. For tiling+padding, when the block size was larger than LI cache size, the padding algorithm in [66] suggested a pad size of 0. There is es sentially no padding effect, thereby drastically increasing LI and L2 cache misses. Figure 4.17 shows the block size effect on total miss cost using tiling+padding and tiling+BDL. Tiling+padding reduced LI and L2 cache miss costs considerably. However, TLB miss costs were still significantly high, affecting the overall perfor mance. As discussed in Section 4.3.3, the suggested range for optim al block size is 32-44. Simulations validate th a t the optim al block size achieving the smallest miss cost locates in the range selected using our approach. 4 .4 .2 E x ecu tio n on real p latform s To verify our block size selection and the performance improvements using block data layout, we performed experim ents on several platforms as tabulated in Ta ble 4.4. gcc compiler was used in these experiments. The compiler optim ization flags were set to “-fo m it-fra m e -p o in te r -03 - f u n r o ll- lo o p s ” . Execution tim e was the user processor tim e measured by sys-ca.il c lo c k (). All the data reported 104 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 120 M '100 C D © E P 60 S 40 o © UJ 2 0 $ 0 0 1200 1400 1600 Problem Size Figure 4.18: Execution tim e comparison of various techniques for TMM on Pen tium III here is the average of 10 executions. The problem sizes ranged from 1000 x 1000 to 1600 x 1600. Figure 4.18 shows the comparison of execution tim e of tiling+BD L with other techniques. The performance of tiling+TSS (tile size selection algorithm [20]) shown in this figure selects block size based on GCD com putation. Tiling solves the cache capacity miss problem but it cannot avoid conflict misses. Conflict misses are strongly related to the problem size and block size. This makes tiling sensitive Table 4.4: Features of various experim ental platforms Platforms Alpha 21264 UltraSparc II UltraSparc III Pentium III Speed (MHz) 500 400 750 800 LI cache Size (KB) 64 16 64 16 Line (Byte) 64 32 32 32 Associativity 2 1 4 4 L2 cache Size (KB) 4096 2048 4096 512 Line (Byte) 64 64 64 32 Associativity 1 1 4 4 TLB Entry 128 64 •512 64 Page (KB) 8 8 8 4 Associativity 128 64 2 4 105 Tiling+TSS Tiling Only Tiling+Padding Tiling+Copying Tiling+BDL Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 35 12.5 30 20 40 Block Size 80 100 40 B lock S iz e (a) Alpha 21264 (b) UltraSparc II LL111 100 40 60 Block Size 15 03 14 o 03 ^13 03 E ~12 ,| S 1 1 0 3 X LU10 (c) UltraSparc III 20 40 60 Block Size (d) Pentium III 80 Figure 4.19: Effect of block size on TMM to problem size. As discussed on Section 4.3.2, block data layout greatly reduces conflict misses, resulting in smoother performance compared with others. The effect of block size on tiling+BD L is shown in Figures 4.19-4.21. Various problem sizes were tested and results on all these problems showed similar trends as in Figures 4.19-4.21. As an illustration, the results for problem size of 1024 x 1024 are shown. As shown in Figures 4.19-4.21, the optim al block sizes for Pentium III, UltraSparc II, Sun UltraSparc III and Alpha 21264 are 40, 44, 76, and 76 respectively. All these numbers are in the range given by our block size selection 106 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 7r 1 1 .9 9.5 lo 6 0 Block Size 8 0 4 0 100 30 40 Block Size 50 60 (a) Alpha 21264 (b) UltraSparc II 9.2 ' ■ = 8.6 100 Block Size < n © 5.5 C D E ~ 5 c .9 " ■ * — i =3 O ® 4.5 LU to (c) UltraSparc III 30 4 0 50 Block Size (d) Pentium III 60 Figure 4.20: Effect of block size on LU decomposition algorithm. For example, the range for best block size on Alpha 21264 is 64-78. This confirmed th at our block size selection algorithm proposes a reasonable range. As discussed earlier, block sizes 32 and 64 should be avoided (for use with block data layout) because the performance degrades due to conflict misses between blocks. Figures 4.22-4.24 show the execution tim e comparison of tiling+BDL with tiling+copying and tiling+padding. In these figures, block size for tiling+BDL was given by our algorithm discussed in Section 4.3.3. The tile size for the copy ing technique was given by the approach in [50]. The pad size was selected by 107 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Execution tim e (secs) Execution tim e (secs) 2 5 ■•.95 100 60 Block Size 0 ) 84.5 3 0 E += 4 c .2 " ■ 4 — ' =5 O $ 3 .5 L U 30 4 0 50 Block Size 60 (a) Alpha 21264 (b) UltraSparc II 2.15 2.05 1.95 1.85 1 50 60 30 40 Block Size 100 40 Block Size (c) UltraSparc III (d) Pentium I I I Figure 4.21: Effect of block size on Cholesky factorization 108 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 40 — Tiling+Padding Tiling+Copying --- Tiling+BDL .35 S 3 0 £ 2 5 1400 Size 1600 1200 Problem Size Tiling+Padding Tiling+Copying Tiling+BDL 60 £ 5 0 § 4 0 20 1 1?)00 1200 Problem 1400 Problem Size 1 6 0 0 (a) Alpha 21264 (b) UltraSparc II 45 Tiling+Padding Tiling+Copying Tiling+BDL .40 35 § 2 5 8 1 2 0 15 1400 Problem Size 1600 1200 Problem 70 — Tiling+Padding Tiling+Copying --- Tiling+BDL £ 4 0 § 3 0 1?)00 1200 Problem 1400 Problem Size 1600 (c) UltraSparc III (d) Pentium III Figure 4.22: Execution tim e of TMM 109 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. — T iling+Padding Tiling+BDL 1?)00 1600 1400 S iz e 1200 P ro b le m S iz e 35 -30 = 20 UJ10 — Tiling+Padding - - - Tiling+BDL 1^00 1600 1400 S iz e 1200 P ro b le m S iz e (a) Alpha 21264 (b) UltraSparc II 40 £ 2 5 Tiling+Padding Tiling+BDL 1?>00 1600 1400 S iz e 1200 P ro b le m S iz e 0,15 — T iling+Padding - - - Tiling+BDL 1$00 1600 1400 S iz e 1200 P ro b le m S iz e (c) UltraSparc III (d) Pentium III Figure 4.23: Execution tim e of LU decomposition 110 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. E10 — Tiling+Padding Tiling+BDL Tiling+Padding Tiling+BDL iioo 1?)00 1600 1200 Problem Size 1400 1400 Problem Size 1200 1600 (a) Alpha 21264 (b) UltraSparc II LLI — Tiling+Padding --- Tiling+BDL — Tiling+Padding Tiling+BDL 1?)00 1$00 1200 1400 Problem Size 1600 1600 1200 Problem Size 1400 (c) UltraSparc III (d) Pentium III Figure 4.24: Execution tim e of Cholesky factorization the algorithm discussed in [66]. Tiling+BDL technique is faster than using other optim ization techniques, for almost all problem sizes and on all the platforms. 4 .4 .3 B lock d a ta layout and M orton d a ta layout Recently nonlinear data layouts have been considered to improve memory hierarchy performance. One such layout is the Morton data layout(MDL) as defined in Section 4.2.1. Similar to block data layout, elements within each block are m apped onto contiguous memory locations. However, Morton data layout uses a different 111 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table 4.5: Comparison of execution tim e of TMM on various platforms: All tim es are in seconds. (a) Pentium III (b) UltraSparc II Size iterative+BDL recursive+MDL 1024 10.37 10.98 1280 20.43 20.64 1408 27.06 28.21 1600 39.77 43.78 2048 83.27 87.64 Size iterative+BDL recursive+MDL 1024 18.87 21.80 1280 36.17 40.63 1408 48.76 53.70 1600 70.44 81.61 2048 149.65 170.86 order to map blocks as shown in Figure 4.1. This order m atches the access p attern of recursive algorithms. In this section, we compare the performance of recursive algorithms using MDL (recursive+M DL) with iterative tiled algorithms using BDL (iterative-fBDL), for m atrix m ultiplication and LU decomposition. We show th at the performance of recursive+M DL is comparable w ith that of iterative+B D L if the block size of MDL lies in the optim al block size range for BDL as given by our algorithm (Eq. (4.9) in Section 4.3.3). However, if the block size of MDL is outside this range, recursive+M DL is slower than iterative+BD L. Similar to block data layout, block size for Morton layout also plays an im por tant role in the performance. However, due to recursion, the choice of block sizes is limited. For an N x N m atrix, if the depth of recursion is d, the block size of MDL is given by B mdl — Such a block size can lie outside the optimal range given by our approach. Our experim ent results show th at this degrades the overall performance. Experiments using TMM and LU were performed on U ltraSparc II and Pentium III. Table 4.5 shows the execution tim e comparison of MM using iterative+BD L with recursive+MDL. For iterative+B D L, we selected the block size according to the algorithm discussed in Section 4.3.3. For recursive+M DL, we tested various recursion depths (resulting in various basic block sizes) and used the best for comparison. For problem size 1280 x 1280 and 1408 x 1408, optimal block sizes 112 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table 4.6: Comparison of execution tim e of LU decomposition on various p lat forms: All tim es are in seconds. (a) Pentium III Size iterative+BDL recursive+MDL 1024 4.15 4.43 1280 8.10 8.10 1408 10.85 11.57 1600 15.85 18.44 2048 33.58 35.90 (b) U ltraSparc II Size iterative+BDL recursive+MDL 1024 8.77 9.94 1280 18.97 18.54 1408 22.76 22.45 1600 33.51 35.58 2048 75.30 81.66 for recursive+MDL were 40 and 44 respectively, which were in the range given by our algorithm, 36-44. Both the layouts showed com petitive performance for these cases. For problem size 1600 x 1600, recursive+M DL was up to 15.8% slower than iterative+BDL. Among possible choices of 25, 50, and 100, the performance of recursive+MDL was optimized at block size 25, where 25 = ^|jr. This is because it is outside the optim al range specified by our algorithm. Table 4.6 shows the execution tim e comparison of tiled LU decomposition using BDL and recursive LU decomposition [102] using MDL. These results confirm our analysis. 113 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. C hapter 5 D ynam ic D a ta R eorganization: D ynam ic D ata Layouts Many scientific and signal processing applications consist of several com putational stages, with each com putational stage having a different data access pattern [8, 57, 58]. In general, the computation is performed assuming th at the data layout in the memory is static1 -, i.e., it does not change during the computation. However, a fixed data layout may not match the data access pattern in all computation stages. This m ism atch incurs an expensive cache miss in uni-processor platforms, resulting in performance degradation. To improve the performance, we propose to change the data layout dynamically in order to m atch the access pattern [72, 74, 76, 75]. We call this dynamic data reorganization. In state-of-the-art signal transform packages, large signal transforms are de composed into smaller transforms with a different data access stride by taking advantage of their inherent divide-and-conquer property [29, 32, 36, 42, 61]. The working set size of signal transform com putations can be reduced to fit into the cache. Hence, the cache performance can be implicitly optimized by considering only the working set size. However, as the com putation of these factorized trans forms proceeds, the stride of the data access is changed. Thus, the stride data Hn this chapter, we call such an approach as the static data layout (SDL) approach. 114 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. access pattern does not m atch the data layout. Therefore, the overall perform ance drastically degrades for large signal transforms. In this chapter, we apply the dynam ic data reorganization approach to im ple ment large high performance signal transform s including the Fast Fourier Trans form (FFT) and the W alsh-Hadam ard Transform (W HT). We show th at the data layout in memory is dynamically reorganized during the execution of small tran s forms in a factorization in order to reduce the num ber of cache misses and improve the effective processor-memory bandw idth. This is called dynamic data layout (DDL). We also develop a search algorithm, considering the stride of data ac cess in memory as well as the size of signal transforms, which finds an optim al factorization with the minimum execution time. In Section 5.1, we describe previous approaches that improve the performance of memory hierarchy and optimize the performance of FF T and W HT. In Section 5.2, we discuss the factorization of signal transforms and provide the motivation for dynamic data layout based on an analytical model of the cache behavior for a factorized signal transform com putation. We describe our dynamic data layout approach, and a cache-conscious factorization in Section 5.3. The performance improvements obtained using our approach are illustrated in Section 5.4. 5.1 R elated W ork In this section, we briefly discuss general optim ization techniques for improving memory hierarchy performance. These techniques fall into two classes: access reordering and static data layout modification (i.e. before the com putation begins). We then summarize optim ization techniques for FFT and W HT by exploiting low level architectural and compiler features. 115 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 5.1.1 M em ory hierarchy p erform an ce op tim iza tio n s To improve memory hierarchy perform ance, various manual and autom ated op timization techniques have been developed for uniprocessors and parallel sys tems [37, 51, 97]. In general, the data access during a com putation can be char acterized by two properties: spatial and tem poral locality [38]. To exploit these properties, the data access order is changed by reordering the com putation. This is called control transformation. These optim ization techniques, such as tiling, loop fusion, loop interleaving, and loop interchange, enhance the tem poral locality of data accesses. After reordering a com putation, the number of cache misses is reduced, thereby improving memory hierarchy performance. However, the stride to access data is not changed since reordering a com putation does not change the data layout. In state-of-the-art platform s, large strides to access data result in cache conflict misses since cache is direct-m apped or small set-associative cache. To further improve the performance, techniques such as copying and padding are used in concert with control transform ation techniques. By copying the scattered data elements in the memory onto a contiguous buffer, copying optim ization [51] ensures that the data in the tem poral buffer are m apped onto distinct locations in the cache without cache conflict. Hence, spatial locality can be further im proved, thereby reducing cache conflict misses. Although cache conflict misses are reduced by this method, the perform ance improvement resulting from copy ing m ethod cannot alleviate the overhead of copying, if data is repeatedly copied from (to) memory to (from) the buffer during the computation. In padding op tim ization [66, 83], dummy elements are inserted into the array to avoid cache conflict. This m ethod copies the original data array into the padded array. During com putation, the overhead of the index computation needed to access the array is high since data elements are not stored contiguously. Therefore, this overhead can degrade the overall performance. 116 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Some recent work [15, 45, 92] has proposed changing the d ata layout to m atch the data access pattern of dense linear algebra applications. This is called data transformation. This reduces cache misses. It was proposed in [45] th at both data and loop transform ation be applied to loop nests for optimizing cache locality. In [15], conventional (row or column-major) layout is changed to a recursive data layout before the com putation, in order to m atch the access pattern of recursive al gorithms. This data layout was shown to improve memory hierarchy performance. The ATLAS project [101] autom atically tunes several linear algebra implementa tions. It uses block data layout with tiling to exploit tem poral and spatial locality. Input data, originally in column m ajor layout, is re-m apped onto block data layout before the com putation begins. The combination of block data layout and unrolled tiled codes has shown high performance on various platforms. In [15,101], the data layout is reorganized before com putation begins. However, the data layout is not changed until com putation finishes. Signal transformations such as DFT and W HT consist of several com putation stages. As the com puta tion proceeds, strides between consecutively accessed data increase. Therefore, a static data layout might not necessarily be optim al for all the com putation stages. Similarly, copying and padding is difficult to be applied to signal transform ap plications. Therefore, static data layout results in large num ber of cache misses because of these stride accesses. To optimize the performance, we propose th at the data layout be reorganized to m atch the data access pattern as the computation proceeds. 5.1.2 F F T and W H T perform ance o p tim iza tio n s Various optimization techniques have been developed to improve the performance of FF T on vector and parallel computers [1, 6, 29, 31, 35, 49, 98]. The six-step approach by Bailey attem pts to reduce the number of accesses to the external 117 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. memory or “Solid State Disk” (SSD) by restructuring the com putation and per forming efficient m atrix transpose [6]. Wadleigh developed a seven-step approach, which is a modification of the six-step approach for optimizing cache performance on cache-coherent multiprocessors [98], The num ber of cache misses are reduced by blocking the com putation and transpose operations. Gannon et al. studied the performance of a shared memory hierarchy in vector multi-processors using an algebraic framework [31]. They used an approach similar to the copying optim iza tion. These approaches were targeted towards optimizing the memory performance on vector and parallel computers. Recently a high performance FF T package has been developed at M IT, known as FFTW [29]. This software architecture consists of a set of straight line un rolled codes, known as codelets. These codelets com pute small DFTs obtained from a factorization. To compute a large size D FT, a set of these codelets are combined using divide-and-conquer. The m ethod of combining these codelets is represented in a binary tree. To achieve the best performance on a target plat form, we should search all possible trees to find an optim al tree with the minimum execution tim e. However, since the search space comprising all such trees is large, dynamic programming was used to efficiently find an optim al tree. In this ap proach, it is assumed that all FFTs of the same size have the same performance. It is based on a cache oblivious algorithm [30] where the cache is assumed to be fully set-associative. In a factorized D FT such as FFTW , codelets require stride access for input (or output) data. As these strides increase, the performance of codelets degrades [61]. This degradation results from cache conflict misses between stride accessed data, since state-of-the-art platforms consist of direct-m apped or small set-associative cache. Even if DFTs are of the same data size, D FT performance depends on the data access pattern. To reduce cache conflict misses, an interleaving optimization is proposed in A CFFT [32]. Two or four small codelets with stride accesses are 118 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. interleaved and then merged into one new codelet. Hence cache lines are fully utilized before being replaced. However, this approach increases the codelet size, resulting in register explosion. A technique similar to FFTW was used to implement W HT [42]. In this im plementation, algorithmic choices are represented internally in a tree structure. This implementation can support iterative and recursive data structures, as well as combinations of both. Externally algorithmic choices are described by a simple grammar, which can be parsed to create different algorithms th at can be executed and timed. A parser is provided for reading W HT expressions and translating them into a tree data structure. Bailey’s six-step F F T transposes data only once between a given factorization. This was developed for distributed memory systems. However, transposing data is not considered for uni-processors since the overhead of transposing data can be larger than its performance gain [7]. Our approach is analogous to the six- step FFT, but is focused on optimizing the performance of signal transforms on a uniprocessor rather than on a vector or parallel processor. Also, we determ ine when and how many times data layout should be reorganized for a factorized com puta tion of a signal transform on a target platform . We find an optim al factorization th at includes these reorganizations. Based on these, we develop an approach to automatically compute cache-conscious factorized signal transforms to minimize the total execution tim e including the data reorganization overheads. 5.2 Factorized Signal Transform s 5.2.1 F actorization o f sign al tran sform s A linear discrete signal processing transform is usually expressed as a m atrix-vector product, x -> M.x, where x is the (sampled) signal and M is a transform m atrix. Examples for discrete signal transforms are the DFT (discrete Fourier transform), 119 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Figure 5.1: Factorization tree obtained by recursively applying Cooley-Tukey al gorithm DCT (discrete cosine transform ), W HT (W alsh-Hadamard transform ), etc. Using the inherent divide-and-conquer property of signal transforms, a transform m atrix can be factorized into a product of sparse m atrices. These sparse matrices are highly structured. They can be represented in concise m athem atical form as a tensor product of a smaller transform and an IV x iV ' identity m atrix (I(N)). For example, the general Cooley-Tukey DFT [59] can be expressed as D FT(iV ) = L(JV, N2) • (1{N2) © D FT(iV 1)) • T ( A T , W ) ■ (D FT (A r 2) < 8 > I(W )), (5.1) where N = Ni x N 2, D FT(iY ) is the DFT of size N , ® is a tensor product operator, L(Ar, Ar i) is the stride perm utation m atrix, and T(Ar, Ni) is the twiddle m atrix [96]. Another example of a factorization is for a W HT of size 2": t W H T (2 ") = n (l(2"1+- +n<~1) < g > W H T (2 "‘) ® I(2ni+1+- +nt)) , 2 — 1 120 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. factorized DFT Figure 5.2: Examples of factorization tree for 16-point DFT where n = n\ + • • • nt. The above factorization m ethod can be applied repeatedly to the smaller signal transforms. This chain of factorization is represented in a tree [42], called as a factorization tree, shown in Figure 5.1. For a given transform , alternative trees can be obtained by applying different factorizations at each node as shown in Figure 5.2. In a tree, a node represents a transform of a specific size. In this chapter, we call such nodes as factorized nodes. Leaf nodes (nodes without children) in a tree represent un-factorized transforms, denoted as leaf node transforms. A transform (root node in a tree) is computed by executing the leaf node transforms and combining all such child node transforms to realize the root transform. The divide-and-conquer m ethod reduces the size of a leaf node transform in the factorization tree, thereby reducing the working set size of the com putation. Thus, the required data can potentially fit in the cache and each leaf node can have good cache performance. FFTW [29] developed at M IT and the W HT package [42] from CMU exploit the divide-and-conquer property and achieve high performance. These packages are developed based on the cache-oblivious algorithm, where the cache is assumed to be fully-set associative and the replacement policy is ideal. The performance of a factorized transform is assumed to be dependent only on its size. The performance of DFT (assuming a fully set-associative cache) can be analyzed using the I/O complexity analysis of the “red-blue pebbling game” [39]. 121 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. The lower bound on cache misses for an ./V-point F F T 2 is 0 ( N log2 N / log2 C), where C is the cache size. 5.2.2 C ache p erform an ce o f th e co m p u ta tio n o f factorized sign al tran sform s In this subsection, we discuss the cache performance of a leaf node with stride data access, which is involved in the com putation of a large signal transform factorized by the divide-and-conquer m ethod. For the sake of illustration, we consider D FT as an example signal transform throughout this chapter. We consider an Appoint D FT, factorized as Ah x Ah using the Cooley-Tukey algorithm (Eq.(5.1)). A tensor product of a small DFT and an identity m atrix can be implemented as a small D FT with a stride access. First, Ah Ah-point DFTs are performed with stride Ah followed by a twiddle m ultiplication. Following these computations, Nj Ah-point DFTs are performed with unit stride. A stride perm u tation follows to correct the order of the transformed data. Figure 5.3 illustrates the com putation of Cooley-Tukey algorithm and the new tree representation where each node is represented with its size and stride. The divide-and-conquer m ethod is recursively applied to the child nodes until the size of leaf nodes in a tree be comes smaller than the cache size. Also, stride accesses are involved in leaf node computations. To understand the cache behavior, we consider a two level memory hierarchy consisting of cache and m ain memory. The size of all param eters is represented in term s of the number of data points. Let C denote the size of the cache and B denote the size of the cache line. To simplify our analysis, we assume the cache to be direct-mapped. The num ber of cache lines in a direct m apped cache is given by C(B. Note th at most of the state-of-the-art platforms have either direct-m apped 2Throughout this chapter, N is assumed to be a power of 2. 122 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Computation Com putation stride M emory stage M emory Stage Permutation M emory FFT FFT (N,1) FFT Output In p u t FFT FFT FFT Figure 5.3: Com putation of /V-point DFT using the Cooley-Tukey algorithm or small set associative caches. The analysis for a k-way set associative cache is similar to th at for direct-m apped cache. To show the im pact of a data access stride on cache performance, we consider a leaf node DFT with stride access in a factorization tree. A leaf node is an Ap point DFT with stride S , where Ni < C . The cache behavior for performing two successive DFTs is illustrated in Figure 5.4, where Ni is assumed to be 4. ® Case I: S — i and Case II: S > 1 and N[ x S < C The number of compulsory cache misses caused by an Af- point DFT with stride S is m in(Ni x S/B, Ni). As shown in Figure 5.4, the entire data points of an Appoint D FT are m apped onto the cache without any conflict. When a successive Appoint D FT needs to be performed, its data points already exist in cache lines, since the data in these cache lines were fetched by the previous DFT. There is spatial reuse of data. 123 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Cache Lines • o o o ■ □ □ □ T s 1 • □ • □ • □ • □ 1st 3rd S 2nd 4th Case I : S =1, N, xS<=C C a s e ll: S =8, N, x S <= C Case H I : S =16, N, xS>C Cache Miss Cache Hit Q a DFT Q successive DFT - - . Data Block Figure 5.4: Cache behavior for two successive DFTs (Ar /=4,Z?=4,C'=32) • Case III: S > 1 and Ni x S > C When the stride is large such th at Ni x S > C , there are conflict misses in addition to compulsory misses, even though Ni < C. These conflict misses have significant im pact on the performance. As shown in Figure 5.4, the block containing the 1st (2nd) data point and the block containing the 3rd (4th) d ata point will be m apped onto the same cache line. Therefore, conflict misses occur in the com putation of a single Appoint D FT. Furthermore, the spatial reuse in the com putation of successive DFTs is also lost. For a successive DFT com putation, the I s* and 2nd blocks are accessed again. However, the l s < (2nd) block was replaced by the 3rd (4th) block in the previous D FT computation due to conflict misses. Therefore, these stride data accesses result in cache pollution: cache lines are replaced before all data points in the cache line are fully utilized [2]. Cache pollution and conflicts lead to 124 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. considerable performance degradation in the com putation of successive leaf node DFTs. To verify the above analysis, we performed simulations on a 32-point DFT w ith various strides using the SUN shade simulator [90]. For simulations, we used L2 cache param eters of UltraSPARC I, where C — 21 5 and Ni = 2s. Actual executions of a 32-point D FT with various strides were also performed on UltraSPARC I and Alpha 21264. Simulation and experim ental results are shown in Figure 5.5. W hen the stride is 2U , Nt x S ( 25 x 2n ) > C'(215). The execution tim e and the num ber of cache misses drastically increase as shown in Figure 5.5 (a) and (b). These performance degradations result from considerable cache conflict misses and cache pollution caused by the strides larger than 210, as discussed in Case III. Such large stride accesses are usually incurred in a large factorized D FT computation. To improve the performance, we discuss a high level optim ization in the next section. 5.3 C ache-C onscious Factorization o f Signal T ran sfo rm s In the previous section, we explained the effect of large stride on cache performance. To improve performance, we propose an approach where the data layout in the memory is dynamically reorganized between the com putation stages of leaf node DFTs in a tree. Also, we propose a dynamic programming based search algorithm to find an optimal factorization tree that has the minimum overall execution time. This approach considers the stride of the data access as well as the size of the DFTs to be computed. 125 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 1 0 8 co c o 6 c o 0 CO CO * £ 0 s z o c d o CM 4 2 0 S 20 S (a) Number of cache misses on Ultra- (b) Execution time on UltraSPARC I SPARC I 40 o 0 C O Z 3 C O 0 £ c 20 o 3 o 0 X U J s (c) Execution time on Alpha 21264 Figure 5.5: Effect of stride access to compute 32-point DFT: (s ~ log2 Stride) 126 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Figure 5.6: Strides resulting from Cooley-Tukey factorization 5.3.1 D y n a m ic d a ta layout As discussed in Section 5.2.2, each node in a tree is represented as a D FT with stride access (Figure 5.3). Figure 5.6 shows the factorization tree where a node represents a D FT with its size and stride. As explained in Section 5.2.2, these stride accesses can cause cache conflict and pollution in leaf node com putations, thereby resulting in performance degradation. This is due to the m ism atch between the data access pattern and the data layout in memory. To improve the performance of a factorized D FT, data layout should be dynamically reorganized to m atch the data access pattern of the com putation. Note that data refers to the input and output data as well as interm ediate results. As shown in Figure 5.7, data layout in the memory is reorganized between com putation stages, thereby reducing the num ber of cache misses incurred during the computation. We call this the dynamic data 127 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Data Computation Data Computation Data Reorganization Stage Reorganization Stage Reorganization Q . 1 5 o DDL Factorization Tree (N, 1) f S 'N I Sm S ni : New stride modified from S^i S ^ : New stride modified from S n2 N = N, ‘ N2 Figure 5.7: Dynamic data layout approach for the Cooley-Tukey algorithm layout (DDL) approach. After performing computations, reverse reorganizations are performed to restore the data in the natural order. To show the effectiveness of the DDL approach, we present a simple example of a 256-point DFT, decomposed as 16 x 16 using the Cooley-Tukey algorithm. In Figure 5.8, 16 x 16 data points are arranged in row-major order. Cache is assumed to be direct-mapped. For the sake of illustration, the cache size is assumed to be 64 points and the line size is assumed to be 4. In Cooley-Tukey com putation, first sixteen 16-point DFTs have stride 16. Figure 5.8(a) shows the d ata points accessed by a 16-point DFT with stride 16. Every 4th data point in the com putation of the 16-point DFT is m apped onto the same cache line. So, 16 data points are mapped onto only 4 cache lines, resulting in conflict misses. However, after reorganizing the data layout (Figure 5.8(b)), the entire 16 data points for a D FT are mapped 128 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. onto contiguous locations in the cache, resulting in no conflict. Thus, the num ber of cache misses is reduced and each cache line is fully utilized, thereby improving the overall performance. In a factorization tree, our approach can be applied at any node. However, data reorganization is an overhead on performance, since it requires memory ac cesses. In order to improve the overall performance, the num ber of nodes where reorganizations are applied should be minimized. We also need to determine the nodes at which DDL approach has to be applied in a factorization tree. In our DDL approach, a tree called as DDL factorization tree describes the fac torization for nodes with their size and stride, as shown in Figure 5.7. Changing the data layout causes to change the data access stride in a node. For a particu lar factorization, there are several different DDL factorization trees since various layouts can be applied to the nodes in the tree. Among all the possible trees for a factorization, we should find a factorization tree with the minimum execution tim e. Such a tree is called a cache-conscious factorization tree. To find such a tree, we make the following assumptions: ® All DFTs of the same size and stride have the same computation cost. • At each node, k different data layouts (strides) are considered, where k > 1. No other layout is considered. We need to define a cost model for an A'-point DFT based on the execution tim e using all possible DDL factorization trees. Consider an A'-point DFT with stride Sjv, which is a node labeled (N ,S n ) in Figure 5.6. Its computation cost is rep resented as D F T (N , S n )- For a given Ah x N% factorization of N , consider that Ar i-point DFT is computed with stride SjV i and A ppoint D FT is computed with stride Sn2. LsN i (LsN2) denotes the original data layout for Ah (Ah)-point DFT 129 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Cache Line Size Direct-mapped Cache 16-point DFT Cache size / with stride 16 , Conflict J J L _ 16 points (a) Before applying DDL approach Cache 16-point DFT Line Size with stride 1 Direct-mapped Cache Cache size (b) After applying DDL approach Figure 5.8: D ata access pattern for a 16 x 16 factorization Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. 1 6 points with stride (Sn2)- The minimum cost including data reorganizations for a given factorization is as follows: min [D r(N ,LsNl, L s- ) + N 2 x D F T {N u S Nl) + Dr(N, Ls - , LSn?) ■ S jV i , S n 2 + T (N ,N 1) + N 1 x D F T (N 2,S~n2) + D r(N ,L s ~ N2,L SN2) l (5.2) where Sjvj {Sn2) denotes a new stride for the N\ (A ^-point DFT using the reor ganized data layout L s~n (Ls^ ) . T ( N ,N i ) is the cost incurred in twiddle factor multiplications. D r(N, LsN., L S~ N ) is the cost of reorganizing the N data points used for Appoint D FTs in the layout LsN. to the layout Ls~ n_, where i — 1 or 2. The cost of reorganizing data of size N involves O (^ ) memory accesses, where B is the cache line size. W hen LsN. = L S' n in D r(N, L sn., L s~n ), no data re organization is performed. If k different strides are considered for both and S n2, the complexity of evaluating Eq.(5.2) (for a given factorization) is 0 ( k 2). In the following subsection, we will develop a search algorithm to find an optim al factorization tree w ith the minimum execution tim e for an N -point DFT. 5.3.2 Search for an op tim al fa cto riza tio n An Appoint D FT can be represented using m any possible factorization trees, where N = 2n. Optimizing the performance of a D FT becomes a search problem of finding an optim al tree with the minimum execution tim e over a large space of possible factorization trees. The search algorithm of the previous approaches [29, 42] considered only the size of DFTs. The search space is 0 (5 n jv?^2) [42]. Since we consider the stride as well as the size of DFTs, the search space in our approach is much larger than that in the previous approaches. Exhaustive search for an optim al factorization tree is impractical. To efficiently search for an optimal tree, we use dynamic programming. In dynamic programming, a DDL factorization tree is built bottom -up. Optimal factorizations for DFTs smaller than the DFT 131 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. (2"-',SW ))) [ ( 2 \ S n3)) I (2T '2, SN 1) \ l ( ^ ,S N 2)j V (n-1) factrorizations k different strides for each node Figure 5.9: Different DDL factorization trees for 2n-point DFT under consideration are already determ ined by dynamic programming. The root node of the iV-point D FT is factorized into two children as shown in Figure 5.9. If the factorization tree is built bottom -up considering only the size of DFTs, the complexity is 0 ( n 2). As discussed in subsection 5.3.1, if k different strides are considered at each node, the complexity to find a cache-conscious factorization tree for a given factorization would be 0 ( k 2). Therefore, the overall search complexity using the DDL approach is 0 (k 2n2). In the following, the size of a node refers to the size of the D FT to be performed at the node (see Figure 5.3). The search space for an optim al factorization tree can be further reduced based on the following property: P r o p e r ty 1 The stride of a (child) node in the tree depends on the stride of its parent node, its position (left or right), and the size of its sibling. Based on this property, the strides of the left and the right nodes are as follows: sN l = sp SN 2 = Sp x Ah, where (< S jv 2) is the stride of the left (right) node, Ni ( ) is the size of the left (right) node, and Sp is the stride of the parent. This property helps to reduce our 132 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. search space. As shown in Figure 5.6, node (JVi, S ^ ) is recursively decomposed to node (Adi, < 5 jv u ) and node (Ad2, S jv 12) . The stride of a child node is related to that of its parent node, the size of its sibling, and its location in the tree. As shown in Figure 5.6, the root in a tree always has unit stride. Thus, optim al trees determined by our dynamic program m ing have unit stride. Based on Property 1, the right subtree also has unit stride. In dynamic programming, the optim al cost of performing AVpoint DFT with unit stride has already been determ ined before searching for an optim al tree for the Appoint DFT. Thus, the right subtree is an optim al tree. However, the left subtree, which is an Ad-point D FT, has stride accesses. Hence, we apply the DDL approach only to the left subtree. Based on this, we can simplify our cost model for dynamic programming. Consider a D FT of size N (= 2n). This DFT can be factorized as Ad(= 2”1) x Ad(= 2n~ni), where ni = 0, • • •, n — 1. The cost of 2n-point D FT using an optim al factorization tree can be computed as follows: D F T ( 2",1) = min [D r (2 \L s 2ni ,L s - nx) + 2n"ni x D F T (2n\ SW ) 7 l \ + Dr(2n, l s;ni,L s2ni) + T(2n,2n') + 2™ x D F T (2n~ni, 1)] , (5.3) where N = 2"1 x 2 n~ni. D F T (N , S ) represents the optimal cost to compute an FFT of size N with stride S. Furtherm ore, according to the analysis in Section 5.2.2, DDL approach can be beneficial only when the product of the size of the DFT and the stride is greater than the cache size (N x S > C). Thus, for the size of a D FT smaller than the cache size, a factorization tree with data reorganization may not be an optimal solution because of its reorganization overhead. For the size of a D FT larger than the cache size, factorization trees with data reorganizations can lead to faster DFT im plementations. We apply the DDL approach only to transforms, whose sizes are equal to or larger than the cache size. 133 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. /* To find an optimal factorization tree for a 2i-point DFT */ FindOptTree (int i, tree *OptTree) { minTime = Infinity for j = 0 to i-1 /* i different factorizations */ for 1 = 0 to k-1 /* k different strides */ S1 /* 1th different stride of the left subtree*/ Tree = GenerateTree(OptTree[j],OptTree[i-j] , Time = FFT_Measure(Tree) if (Time < minTime) OptTree[i] = Tree minTime = Time end if end for end for } Figure 5.10: Search algorithm considering dynamic data layout To find an optim al factorization tree for N (= 2n)-point D FT, dynamic pro gramming requires n iterations. Consider the ith iteration to build an optimal factorization tree for a 2!-point DFT, where i = 1, • • ■ ,n . The optim al factoriza tion trees for all DFTs smaller than a 2*-point DFT have already been determined in earlier iterations. During the ith iteration, we search all possible DDL factoriza tions. The algorithm (FindOptTree) to find the optim al tree for a 2*-point DFT is described in Figure 5.10. In the algorithm, G enerateT ree generates a new tree considering the size of both child nodes and the stride of the left child node. The function FFT_Measure measures the FFT execution tim e for a given tree including data reorganization. If the execution tim e is less than minTime, the OptTree and minTime are updated with current Tree and Time. W hen N is a power of 2, sizes and strides in various possible factorization trees are also powers of 2. As shown in Figure 5.10, (i — 1) different factorizations and k different strides for the left subtree of the root node are considered. The overall complexity of our search algorithm to find an optimal factorization tree of a 2"-point DFT using dynamic programming is thus 0 (k n 2). In our im plem entation in Section 5.4, we used k = 2, then the complexity is 0 (n 2). The search algorithm for finding the optimal factorization trees in implementing W HT is also similar. 134 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table 5.1: Some alternate factorization trees FFT Size Layout Factorization Trees Exec. Time (msecs) ct [1 ,ct [4 ,ct [1 ,ct [3 ,ct [3 ,ct [3,3]]]]]] 379.6640 ct [2, ct [4 ,ct [3, ct [3, ct [3,3]]]]] 307.9280 Static ct[3,ct[i,ct[i,ct[i,ct[3,ct[3,ct[3,3]]]]]]] 322.0800 ct[ct[4,4],ct[4,ct[3,3]]] 481.6560 ct [ct [3, ct [3,3] ], ct [3, c t [3,3]]] 491.9040 21 8 ct [ct [4 ,ct [3,3]] ,ct [4,4]] 444.5680 ct [ 1, ct ddl [c t [2, c t[3, ■ 4 ] ] ,c t [4, ■ 4 ] ] ] 227.8960 ct [2 ,ctddl [ct [3,4] ,ct [2 ,ct [3,4]]]] 227.4080 Dynamic ct[3,ctddl[ct[3,4],ct[4,4]]] 210.3280 ct ddl [ct [4,4], ct [ct [3,4], 3]] 148.1080 ctddl[ct[2,ct[3,4]],ct[2 ,ct[3,4]]] 131.1500 ctddl [ct [ct [3,4], 3] ,ct [4,4]] 145.3020 ct [2 ,ct [3 ,ct [3,ct [3 ,ct [3 ,ct [3,3]]]]]] 1899.296 ct [3 ,ct [4,ct [1 ,ct [3 ,ct [3 ,ct [3,3]]]]]] 1826.096 Static ct[4,ct[4,ct[3,ct[3,ct[3,3]]]]] 1577.216 ct [ct [3,ct [3,3]] ,ct [2,ct [3,ct [3,3]]]] 2819.664 ct [ct [4 ,ct [3,3]] ,ct [4 ,ct [3,3]]] 2652.768 22° ct [ct [2 ,ct [3 ,ct [3,3]]] ,ct [3 ,ct [3,3]]] 3213.968 ctddl] 1 ,ctddl[ct[ct[3,4] ,3] ,ct [2,ct [3,4]]]] 1431.792 ctddl [2 ,ctddl[ct [2 ,ct [3,4]] ,ct [2 ,ct [3,4]]]] 1466.928 Dynamic ctddl[3,ctddl [ct [2, ct [3,4]] ,ct [4,4]]] 1306.864 ctddl [ct [2, ct [3,4]] ,ct [3 ,ct [4,4]]] 935.9840 ctddl [ct [ct [3,4], 3] ,ct [ct [3,4] ,3]] 660.7520 ctddl [ct [3 ,ct [4,4]] ,ct [2 ,ct [3,4]]] 931.1040 Our approach is a high level optim ization that exploits the characteristics of the memory hierarchy by analyzing the data access pattern. Our approach can be applied on top of low-level optimized codes such as those employed in FFT W [29] or the CMU package [42], thus making our approach portable. 5.3 .3 A ltern a te factorization s u sing d yn am ic d ata layout Table 5.1 shows the performance of some factorization trees that include the opti mal factorization tree using static and dynamic data layouts. In static data layout (SDL) approach, the data layout does not vary during the computation. We search possible factorization trees considering only the size of DFTs to find an optim al Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. factorization tree using SDL approach. In dynamic data layout (DDL) approach, data reorganizations are applied to com putation nodes. Experim ents were per formed on Alpha 21264 (clock speed of 500 MHz and L2 cache size of 2M B - 21 8 data points). The compiler and its optim ization options used in the experim ents are shown in Table 5.3. The optim al performance of a D FT in each data layout is represented in bold font. In Table 5.1, “n ” denotes a 2”-point un-factorized DFT. Cooley-Tukey algorithms using static and dynamic data layouts, which are factorized into 2"1 x 2n2, are represented as “ct[nl, n2]” and “ctddlfral, n2]” , respec tively. As shown in Table 5.1, we found experimentally th at an alternate optim al factorization tree using dynamic data layout can have better performance than the optimal factorization tree using static data layout. We noticed th at the optim al factorization trees using static data layout were close to a right-most tree. (Right most trees consist of a leaf node on the left side and a right subtree which is also a right-most tree.) However, the optim al factorization tree using dynamic data layout is close to the balanced tree, where subtrees on both sides are expanded together. As shown in factorizations for the 220-point D FT, some trees include the term “ctddl” twice, since they involve two reorganizations. It affects execution tim e since reorganization itself is an overhead during the computation. Therefore, it is im portant to find an optim al tree including the cost of reorganizations, in order to achieve high performance. 5.4 E xp erim en tal R esu lts To show the cache performance improvements using our technique, we conducted several cache simulations. We compare the cache miss rates for FFT implemented using the DDL and SDL approaches. Further, we apply our approach to recent FFT and W HT packages developed at CMU. We perform experiments on several platforms and present performance comparison of our implementations with the 136 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Cache size = 512 K B, Line size - 64 Bytes, Direct-mapped FFT size = 256 K points, Cache size = 512 K B, D irect-m apped 6.0 □ — 0 S D L © D D L 5.5 5.0 4.5 4.0 I 3 - 5 c3 £ 3 0 2.5 2.0 1.5 1.0 0.5 0.0 18 19 20 17 21 12 13 14 15 16 n (FFT size = 2") Figure 5.11: Miss rates for various FFT sizes 6.0 1 3 — E 3 S D L © © D D L 5.5 5.0 4.5 4 .0 3.5 2.0 0.5 0.0 512 1024 128 Line size (Bytes) 256 32 Figure 5.12: Miss rates for various cache line sizes previous packages [42, 36]. Also, we show th at the optim al factorization trees from the DDL approach are different from those in the SDL approach. 5.4.1 Sim u lation resu lts We performed simulations using the cache sim ulator in the SUN Shade simula tor [90]. Cache is assumed to be 512 K B and direct mapped. A data point is a double complex num ber (16 Bytes). The capacity of a 512 K B cache is 21 5 data points. We studied two different cases to illustrate cache performance. First, the cache miss rates for different sized FFTs (N ) were studied. Second, we performed simulations by varying the cache line size (B ) for a fixed size FF T (N). Figure 5.11 shows the cache miss rate as the size of the FF T is varied. For small FFTs (less than 214-point), the DDL approach shows similar performance as the SDL approach. As the size increases from 21 4 to 216, the miss rate increases significantly from 0.695% to 4.081% for the SDL approach. However, the miss rate for the DDL approach increases from 0.560% to 3.228% only. When the size of the FF T becomes larger than the cache size, the miss rate for the DDL approach is much lower than th at for the SDL approach. 137 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Figure 5.12 shows the cache miss rate as the cache line size is varied. As cache line size increases, the cache miss rate decreases. Since the DDL approach changes the stride in the FFT com putation to unity, it utilizes the cache line more efficiently than the SDL approach, resulting in lower cache miss rate. W hen the cache line size is 64 bytes, which is the cache line size in most state-of-the-art platform s, the cache miss rate is 3.98% using the SDL approach and 2.96% using the DDL approach. The DDL approach has a 25% lower miss rate compared with the SDL approach. 5.4.2 E x p erim en ta l resu lts In this subsection, we report experim ental results conducted on four state-of-the- art platforms. Tables 5.2 and 5.3 summarize the relevant architectural param eters, compilers, and optim ization options for various platforms used in our experiments. To obtain accurate execution times, com putations were repeated until the overall execution tim e was larger than 1 second. The measured tim e is the wall clock tim e using c lo c k () function. The total execution tim e was obtained by deducting the loop overhead from this time. The average execution tim e is reported. Throughout this subsection, N denotes the size of F F T or W HT, N = 2n for some positive integer n. In experiments, we used the FF T package [36] provided by CMU, which is called FF T SDL in this paper. FFT SDL uses the optimized code modules (codelets) from FFTW as leaf node FFTs. Its design is based on an FFT package presented in [25], which is different from FFTW . We applied our DDL approach as a high- level optim ization over the F F T SDL. We call our modified F F T package as FFT DDL. For experiments, all data points in a. DFT are double-precision complex numbers (16 bytes). To show performance improvement using the DDL approach, we compare the performance of F FT DDL with that of FFT SDL. Also, we present 138 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table 5.2: Param eters of the platforms Processor Alpha 21264 MIPS R10000 Pentium 4 UltraSPARC III Clock (MHz) 500 195 1700 750 LI Cache Size {KB) 64 32 8 64 Line (Bytes) 64 32 64 32 L2 Cache Size (KB) 4096 4096 256 4096 Line (Bytes) 64 64 64 64 Table 5.3: Compiler and optim ization options used in the experiments Processor Compiler Optimization Flags Alpha 21264 gcc version egcs-2.91.66 -06 -fomit-frame-pointer -pedantic MIPS R10000 MIPSpro Compilers: Ver. 7.2.1 -rlO O O O -0 3 -Ofast -mips4 Pentium 4 gcc version 2.96 -06 -fomit-frame-pointer -pedantic UltraSPARC III gcc version 2.8.1 -06 -fomit-frame-pointer -pedantic the relative performance improvement of FFT DDL compared with FFTW , using the following formula: M F L O P S o f F F T D D L - M F L O P S o f F F T W M F L O P S o f F F T W x 100(%). For these experiments, we used the recent FFTW (version 2.1.3). Figures 5.13-5.16 show the performance of FF T DDL in comparison with FF T SDL and FFTW , on 4 different platforms. Figure 5.13(a) shows the performance of F F T SDL and FFT DDL on A lpha 21264 platform. For small problems (N less than 213), all the data points required for the FF T can reside in the cache. It is not necessary to reorganize the data layout. Our search algorithm selects the same tree as the tree used in the SDL approach. Thus, F F T DDL showed the same performance as FFT SDL. For sizes between 21 3 and 218, the required data can reside in the L2 cache. In this range, our approach produces a marginal gain as it reduces the LI cache misses. But as the problem size exceeds the size of the L2 cache, FFT DDL achieves a speed-up of up to 2.23x. Figure 5.13(b) shows the relative performance of FFT DDL compared with FFTW on Alpha 139 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. M F L O P S MFLOPS L1 cache size 1 2 cache size 400 200 i 22 Problem Size (n) 120 ll_ < 5 s O O t u_ o < 1 3 O c r a E o Q . 0 1 _ > js a > c c -30 Problem Size (n) (a) Comparison with FFT SDL (b) Comparison with FFTW Figure 5.13: Performance of FF T DDL on Alpha 21264 L 1 cache size L2 cache size 300 • ----® FFT DDL -*F F T S D L 200 100 Problem Size (n) £ 200 t 160 120 80 40 -40 Problem Size (n) (a) Comparison with FFT SDL (b) Comparison with FFTW Figure 5.14: Performance of FF T DDL on MIPS R10000 140 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. MFLOPS MFLOPS 1 2 cache size 600 » -® FFT DDL -*■ FFT SDL 400 200 10 Problem Size (n) g t U . 40 a > § Q Q o c I o t Q . J -10 a: -20 Problem Size (n) (a) Comparison with FFT SDL (b) Comparison with FFTW Figure 5.15: Performance of F F T DDL on Pentium 4 L1 cache size L2 cache size 300 • ---- • FFT DDL FFT SDL 200 100 Problem Size (n) 240 t u. 200 160 120 u . o o c to e o Q. c c -40 Problem Size (n) (a) Comparison with FFT SDL (b) Comparison with FFTW Figure 5.16: Performance of FFT DDL on UltraSPARC III 141 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. • — #WHT DDL * —*WHTSDL ^ 6 0 0 5 4 0 0 3 200 0 20 15 Problem Size (n) 10 8 0 0 • • WHT DDL WHT SDL c n 8 < s > c 6 00 c 'o CL 0 h - c o 5 o s 200 U J Problem Size (n) (a) Aipha 21264 (b) MIPS R10000 3 0 0 • — •W H T DDL *•••-* WHT SDL tn t n c ~ 200 o £L $ CL 0 1 .§ i- 1 1 0 0 1 3 a C D X LU Problem Size (n) 8 0 0 • ®WHT DDL &WHT SDL 6 0 0 200 2 5 Problem Size (n) (c) Pentium 4 (d) UltraSPARC III Figure 5.17: Performance comparison of W HT implementations 21264. If the size of the FF T is less than 213, the performance of FFT DDL was worse than FFTW . However, for sizes between 21 3 and 218, FF T DDL was superior by approximately 30%. If the size is larger than the L2 cache size, the relative performance of FF T DDL is up to 98% better compared with FFTW . Figure 5.14(a) shows the performance of F F T SDL and F F T DDL on MIPS R10000. Figure 5.14(b) shows the relative performance of F F T DDL over FFTW . For FF T size between 21 2 and 21 8 (i.e. for sizes larger than the LI cache size and smaller than the L2 cache size), the relative performance gain was around 10%. 142 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. It is smaller than the performance improvement on Alpha 21264, since the cache line size of MIPS R10000 is smaller than th at of Alpha 21264. As discussed in Subsection 5.4.1, our approach is less effective for smaller cache lines. On Pentium 4 and UltraSPARC III, FFT DDL shows improved performance when the problem size is larger than L2 cache size, as shown in Figures 5.15 and 5.16. We also applied our approach to the W HT package developed at CMU [42], which is called W HT SDL in this paper. W HT SDL consists of a set of straight unrolled codes for small W HT, which are similar to the codelets in FFTW . Our WHT package is called W HT DDL. In all our experiments, the data points are double precision (8 Bytes). Figure 5.17 shows the com putation tim e per point on 4 different platforms. We achieved performance gains in W HT similar to those in FFT. Table 5.4 shows the optimal trees chosen by dynamic programming for W HT DDL and W HT SDL on Alpha 21264. In Table 5.4, “n ” is a leaf node in a tree. It represents a straight line unrolled code for a 2”-point W HT. “w ht[nl,n2]” represents a tree of the W HT, factorized as 2ni x 2"2 using the SDL approach. “w htddl[nl, n 2 f represents the (factorized) W HT tree using the DDL approach. For a W HT smaller than 214, all data points reside in the cache. So the W HT is ef ficiently com puted without any cache conflict misses. Hence, the search algorithm of W HT DDL selects the same tree as that selected by W HT SDL. For problems larger than 214, W HT DDL achieves better performance than W HT SDL. Table 5.5 also shows the comparison of optim al trees of F F T SDL with th at of FFT DDL on MIPS R10000. In the table, “n ” is denoted as a 2”-point un-factorized DFT. Cooley-Tukey algorithms using static and dynamic data layouts, which are factor ized into 2"1 x 2n2, are represented as “ct[nl, n2]” and “ctddl[nl, n2],;, respectively. 143 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Table 5.4: Optim al factorizations of W HT on A lpha 21264 WHT (n) WHT SDL [42] WHT DDL 11 wht [3,wht[4,4]] wht [3, wht [4,4]] 12 wht[4,wht[4,4]] wht [4, wht [4,4]] 13 wht[5,wht[4,4]] wht[5,wht[4,4]] 14 wht [1, wht [5, wht [4,4]]] whtddl[wht[3,4],wht[3,4]] 15 wht [5, wht [2, wht [4,4]]] whtddl[wht [4,3], wht [4,4]] 16 wht [5,'wht [3, ■ wht [4, ■ 4 ] ] ] whtddl[wht[4,4],wht[4,4]] 17 wht [1 .wht [5 wht [3. wlit [4.!]]]] whtddl[wht[4,4] ,wht[4,5]j 18 wht [1 .wlit [1, wht [5, wht [3 ,wht[4,4]]]]] wht ddl [wht [4,5],wht [4,5] ] 19 wht [4, wht [5, wht [2, wht [4,4]]]] wht ddl [wht [4,5], wht [2, wht [4,4]]] 20 wht [4 wht [5, wht [3, wht [4,4]]]] whtddl[wht[2,wht[4,4]],wht[2,wht[4,4]]] 21 wht [ 1, wht [4, wht [5, wht [3, wht [4,4]]]]] whtddl[wht [2, wht [4,4]] ,wht [3, wht [4,4]]] 22 wht[3,wht[4,wht[5,wht[2,wht[4,4]]]]] whtddl[wht [3, wht [4,4]], wht [3, wht [4,4]]] Table 5.5: Optim al factorizations of FFT on MIPS R10000 FFT (n) FFT SDL FFT DDL 12 ct[4,ct[4,4]] ct [4 ,ct [4,4]] 13 ct[5,ct[4,4]] ct[5,ct[4,5]] 14 ct[4,ct[5,5]] ctddl[ct[3,4],ct[3,4]] 15 ct [5,ct [5,5]] ctddl[ct[3,4] ,ct[4,4]] 16 ct[5,ct[3,ct[4,4]]] ct[5,ct[4,ct[3,4]]] 17 ct[5,ct[4,ct[4,4]]] ctddl[ct[4,3] ,ct[5,5]] 18 ct[2,ct[6,ct[5,5]]] ctddl[6,ctddl[6,6]] 19 ct[4,ct[l,ct[4,ct[4,6]]]] ctddl[ct[4,5] ,ct[5,5]] 20 ct [3, ct [5 ,ct [4 ,ct [4,4]]]] ctddl [ct [5,5} ,ct [5,5]] 21 ct [4 ,ct [3, ct [4 ,ct [4,6]]]] ctddl [ct [3,4] ,ctddl[ct [3,4] ,ct [3,4]]] 22 ct[4,ct[ct[3,ct[3,6]],6]] ctddl [ct [2 ,ct [4,5]] ,ct [2 ,ct [4,5]]] Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. C hapter 6 C onclusion and Future R esearch In this dissertation, we have proposed the use of data reorganization to m atch the data access pattern and the data layout in memory. This approach improves the performance of deep memory hierarchy systems. Our data reorganization approach is applied at two different memory hierarchy levels: (1) reducing the num ber of rem ote memory accesses and (2) reducing the num ber of cache misses. The main contributions can be summarized as follows. • Efficient algorithm for data redistribution between processor sets: We developed an efficient algorithm for block-cyclic data redistribution be tween processor sets, based on a generalized circulant m atrix formalism. Our algorithm minimizes the num ber of communication steps, and also avoids des tination node contention in each communication step. The network band width is fully utilized by ensuring that messages of the same size are trans ferred in each communication step. Therefore, the total data transfer cost is minimized. The schedule and index com putation costs are also im portant in performing run-tim e redistribution. The schedule and index computation using our algorithm is significantly faster than the bipartite m atching scheme in [23]. Our schedule and index computation tim e is negligible as compared with the data transfer tim e. This makes our algorithm to be suitable for run-tim e data redistribution. Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. • Static data reorganization in uni-processors: We studied a critical problem in the performance of algorithms on state-of- the-art machines th at employ multi-level memory hierarchy. We presented a lower bound on the num ber of TLB misses for any data layout and showed that block data layout achieved this bound. The num ber of TLB misses using block data layout in concert with tiling was considerably smaller than th a t using copying or padding technique. We showed th at block data layout with tiling leads to an improvement in the overall memory hierarchy performance, in comparison with other schemes with tiling. Further, we proposed a tight range for block size using our performance analysis. These analyses were validated by our simulation and execution results. • Dynamic data reorganization in uni-processors: We developed an efficient approach based on cache-conscious algorithm de sign techniques. This approach is a high level optim ization technique th at dynamically reorganizes the data layout to improve cache performance. Our approach exploits the characteristics of the memory hierarchy based on cache behavior analysis for the data access patterns of factorized signal transforms. Our approach was applied to the state-of-the-art F F T and W HT packages to improve the performance. To verify our dynamic data layout approach, we performed simulations and experiments. In the simulations, the dynamic data layout approach achieved up to a 25% improvement in cache miss rate compared with the static data layout approach. Our experimental results showed th at the implementation of F F T using the dynamic data layout ap proach achieved up to 2.97x performance improvement over the state-of- the-art FF T packages. Our W HT package using the dynamic data layout approach also showed up to 3.52x performance improvement compared with the state-of-the-art packages. 146 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. Although static and dynamic data reorganization algorithms are applied to specific applications on uni-processor platforms, they can be used to im plem ent other scientific and signal processing applications [17, 68, 5, 3, 4] including d ata intensive and m ulti-m edia applications. These techniques can also be used to develop parallel im plem entations of applications. Several im portant research issues remain to be explored: 1. Recent multi-processor systems and cluster computers have deep memory hierarchy systems as discussed in Section 2. In these platform s, the d ata distribution pattern affects the num ber of rem ote memory accesses as well as the number of local memory accesses. To optimize the overall performance, remote memory accesses and local memory accesses should be addressed in concert. Thus, we can optimize the overall performance of the parallel imple m entation of an application. Even though the num ber of remote memory ac cesses is reduced by data redistribution in this dissertation, the redistributed data layout may not improve the performance of local memories and may also lead to a degradation in this performance. Therefore, it is necessary to de velop an approach th at optimizes the overall memory hierarchy performance using data reorganization. 2. To achieve high performance in multi-processor platform s, we have proposed an efficient 1-dimensional array redistribution algorithm between processor sets. In signal processing and scientific applications, data is often represented in the form of multi-dimensional arrays, for example, a m atrix or a data cube. Therefore, it is necessary to develop efficient redistribution algorithms for multi-dimensional arrays [23, 55]. In a multi-dimensional array distribution, a processor assignment to each dimension is specified as a Cartesian repre sentation, w rhich is called process topology. Similar to a 1-dimensional array distributed in block-cyclic pattern, a multi-dimensional array is distributed 147 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. and redistributed in block-cyclic pattern along each dimension. Each di mension has a different block-cyclic pattern. Furtherm ore, the num ber of processes in each dimension can be changed in the multi-dimension redis tribution algorithm, which results in a change of the process topology. Our 1-dimensional array redistribution algorithm can be extended for an efficient parallel im plem entation of a multi-dimensional array redistribution between processor sets. 3. Recently, power-aware computing has received considerable attention. In state-of-the-art platforms, performance improvement is generally accompa nied by an increase in the power dissipation. Especially, the power dissi pated by the main memory represents a considerable portion of the power budget [-52, 95]. In general, the power consumed in the main memory is proportional to the number of memory accesses as well as the bus traffic. Therefore, in order to improve the power efficiency, data in the main m em ory should be efficiently managed during com putation. As discussed in this dissertation, the mism atch between the data layout and the data access p at tern increases the number of accesses to the m ain memory as well as the bus traffic. To reduce the power dissipation, it is critical to utilize efficiently the memory hierarchy system. We can thus apply our experience and knowl edge of the memory hierarchy system in this dissertation to develop efficient power-aware algorithms. 4. Reconfigurable computing has also received considerable attention. Espe cially, the Field Programmable Gate Array (FPG A ) is a flexible and attrac tive platform to implement signal processing kernels [11, 22, 48]. By exploit ing the flexibility of FPGAs, several efforts have addressed models to optimize different metrics for various signal processing kernels: performance [60] and power efficiency [16]. To find an optim al im plem entation considering these 148 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. metrics, domain specific exploration has been proposed [63]. In this disser tation, we developed the software architectures for FFT and W HT packages considering dynamic data layout. To optimize the performance, our search algorithm determines an optim al factorization and its data layout, which are dependent on the platform. This search algorithm can be applied to domain specific exploration for im plem entation of signal processing kernels including FFT and WHT. 149 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. R eference List [1] R. C. Agarwal, F. G. Gustavson, and M. Zubair. A High Performance P ar allel Algorithm for 1-D FFT. Supercomputing ’9 f, pages 34-40, 1994. [2] A. Ailamaki, D. D eW itt, M. D. Hill, and M. Skounakis. Weaving Relations for Cache Peformance. Proceedings of the 27th International Conference on Very Large Data Base (VLDB), 2001. [3] J. Bae, S. Choi, N. Park, and V. K. Prasanna. Design of an Architecture for Low Bit-Rate Videoconferencing System. Multimedia Technology and Applications Conference, March 1996. [4] J. Bae, N. Park, S. Choi, and V. K. Prasanna. Design of a Low Bit-rate Videoconferencing System. Wescon Conference, October 1996. [5] J. Bae, N. Park, S. Choi, and V. K. Prasanna. Hardware/Software Code sign for a Low B it-R ate Videoconferencing System. Proceedings of Eleventh International Conference on Systems Engineering, July 1996. [6] D. H. Bailey. FFTs in External or Hierarchical Memory. Journal of Super computing, 4, March 1990. [7] D. H. Bailey. Unfavorable Strides in Cache Memory Systems. Scientific Programming, 4, 1995. [8] K. G. Beauchamp. Applications of Walsh and Related Functions. Academic Press, 1984. [9] P. B. Bhat, Y. W. Lim, and V. K. Prasanna. Issues in Using Heterogeneous HPC Systems for Embedded Real Time Signal Processing Applications, in Proceedings of the 2nd International Workshop on Real-Time Computing Systems and Applications, October 1995. [10] L. Blackford, J. Choi, A. Cleary, E. D ’Azevedo, J. Demmel, I. Dhillon, J. Dongarra, S.Hammarling, G. Henry, A. Petitet, K. Stanley, D. Walker, and R. C. Whaley. ScaLAPACK Users’ Guide. SIAM Publications, 1997. 150 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [11] K. Bondalapati and V. K. Prasanna. Loop Pipelining and Optim ization for Reconfigurable Architecture. Reconfigurable Architectures Workshop (RAW), May 2000. [12] D. Burger and T. M. Austin. The SimpleScalar Tool Set, Version 2.0. Tech nical Report 1342, University of W isconsin-Madison Com puter Science De partm ent, June 1997. [13] J. B. Carter, W. C. Hsieh, L. B. Stoller, M. R. Swanson, L. Zhang, E. L. Brunvand, A. Davis, C.-C. Kuo, R. Kuram kote, M. A. Parker, L. Schaelicke, and T. Tateyama. Impulse: Building a Sm arter Memory Controller. Proceed ings of the Fifth International Symposium on High Performance Computer Architecture (HPCA-5), pages 70-79, January 1999. [14] J. Chame, M. Hall, and J. Shin. Compiler Transformations for Exploiting Bandwidth in PIM-Based Systems. Proceedings of Solving the Memory Wall Workshop, held in conjunction with the International Symposium on Com puter Architecture, June 2000. [15] S. Chatterjee, V. V. Jain, A. R. Lebeck, S. M undhra, and M. Thottethodi. Nonlinear Array Layouts for Hierarchical Memory Systems. Proceedings of the 13th A C M International Conference on Supercomputing (ICS ’ 99), June 1999. [16] S. Choi, V. K. Prasanna, and J. Jang. Minimizing Energy Dissipation of M atrix M ultiplication Kernel on Virtex-II. Proceedings of SPIE, ITcom, Reconfigurable Technology: FPGAs and Reconfigurable Processors for Com puting and Communication, 2002. [17] Y. Chung, K. Park, W. Hahn, N. Park, and V. K. Prasanna. Performance of On-Chip Multiprocessors for Vision Tasks. Lecture Notes in Computer Science 1800 (Workshop on Parallel and Distributed Computing in Image Processing, Video Processing, and Multimedia (PDVIM 2000)), May 2000. [18] Y.C. Chung, C.H. Hsu, and S.W. Bai. A Basic-Cycle Calculation Technique for Efficient Dynamic D ata Redistribution. IEEE Tans, on Parallel and Distributed Systems, 9(4), April 1998. [19] M. Cierniak and W. Li. Unifying D ata and Control Transformations for Dis tributed Shared-Memory Machines. Proceedings of the SC M SIG P LAN ’95 Conference on Programming Language Design and Implementsion, pages 205-217, June 1995. [20] S. Coleman and K. S. McKinley. Tile Size Selection Using Cache Organi zation and D ata Layout. Proceedings of the SIG PLAN ’95 Conference on Programming Language Design and Implementation, June 1995. 151 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [21] D. E. Culler, J. P. Singh., and A. G upta. Parallel Computer Architecture: A Hardware/Software Approach. Morgan Kaufmann Publisher Inc., 1999. [22] A. Dandalis and V. K. Prasanna. Singal Processing using Reconfigurable System-on-Chipe Platform s. International Conference on Engineering of Re- configurable Systems and Algorithms, June 2001. [23] F. Desprez, J. Dongarra, A. Petitet, C. Randriam aro, and Y. Robert. Scheduling Block-Cyclic Array Redistribution. IEEE Trans, on Parallel and Distributed Systems, 9(2), Feburary 1998. [24] E. A. Dinic. Algorithm for Solution of M aximum Flow in a Network with Power Estim ation. Soviet Math Dokl., 11:1277-1280, 1970. [25] S. Egner. Zur Algorithmischen Zerlegung.stheorie Linearer Transformationen mit Symmetric. Ph.d thesis, Universitat Karlsruhe, 1997. [26] R. Espasa, J. Corbal, and M. Valero. Command Vector Memory Systems: High Performance at Low Cost. Technical Report UPC-DAC-1998-8, Uni versitat Polit‘ecnica de Catalunya, 1998. [27] K. Esseghir. Improving D ata Locality for Caches. M aster’s thesis, Dept, of Computer Scienece, Rice University, September 1993. [28] I. Foster. Designing and Building Parallel Programs: Concepts and Tools for Parallel Software Engineering. Addison-Wesley Co., 1995. [29] M. Frigo and S. G. Johnson. FFTW : An Adaptive Software Architecture for the FFT. International Conference on Acoustics. Speech, and Signal Processing 1998 (ICASSP 1998), 3, 1998. [30] M. Frigo, C. Leiserson, H. Prokop, and S. Ram achandran. Cache-Oblivious Algorithms. Proc. of the 40th Symp. on the Foundations of Computer Sci ence, 1999. [31] D. Gannon and W. Jalby. The Influence of Memory Hierarchy on Algo rithm Organization: Program m ing FFTs on a Vector Multiprocessor. The Characteristics of Parallel Algorithms. The M IT Press, 1987. [32] K. S. Gatlin and L. Carter. Faster FFTs via Architecture Cognizance. Inter national Conference on Parallel Architectures and Compilation Techniques (PACT2000), October 2000. [33] A. Gonzalez, C. Aliagas, and M. Valero. A D ata Cache with M ultiple Caching Strategies Tuned to Different Types of Locality. Proc. International Conference on Supercomputing, pages 338-347, July 1995. 152 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [34] R. P. Grag and I. Sharapov. Techniques for Optimizing Applications: High Performance Computing. Sun Microsystems Press, Decemmber 2001. [35] S. K. S. Gupta, C.-H. Huang, P. Sadayappan, and R. W . Johnson. Imple menting Fast Fourier Transforms on Distributed-M em ory Multiprocessors Using D ata Redistributions. Parallel Processing Letters, 4(4):477-488, 1994. [36] G. Haentjens. An Investigation of Recursive F F T Implementations. Mas te r’s thesis, Dept, of Electrical and Com puter Engineering, Carnegie Mellon University, 2000. [37] M. W. Hall, J. M. Anderson, S. P. Amarasinghe, B. R. Murphy, S.-W. Liao, E. Bugnion, and M. S. Lam. Maximizing M ultiprocessor Performance with the SUIF Compiler. IEEE Computer, December 1996. [38] J. L. Hennessy and D. A. Patterson. Computer Architechture: A Quantative Approach. Morgan Kaufmann Publisher Inc., second edition, 1996. [39] J.-W. Hong and H. T. Kung. I/O Complexity: the Red-Blue Pebbling Game. Proceedings of the 13th Annual AC M Symposium on Theory of Computing (STOC), pages 326-333, 1981. [40] Kai Hwang and Zhiwei Xu. Scalabe Parallel Computing. McGraw-Hill, 1998. [41] J. .JaJa and K. Ryu. The Block Distributed Memory Model for Shared Memory Multiprocessors, in Proceedings of International Parallel Processing Symposium, pages 752-756, 1994. [42] J. Johnson and M. Piischel. In Search of the Optim al W alsh-Hadamard Transform. International Conference on Acoustics, Speech, and Signal Pro cessing 2000 (ICASSP 2000), June 2000. [43] T. L. Johnson, M. C. Merten, and W. W. Hwu. Run-tim e Spatial Locality Detection and Optimization. Proceedings of the 30th International Sympo sium on Microarchitecture, December 1997. [44] E.T. Kalns and L.M. Ni. Processor M apping Techniques Toward Efficient D ata Redistribution. Proc. of International Parallel Processing Symposium, April 1994. [45] M. Kandemir, A. Choudhary, J. Ram anujam , and P. Banerjee. Improving Locality Using Loop and D ata Transformations in an Integrated Framework. Proceedings of the 31st IE E E/A C M International Symposium on Microarchi tecture, November 1998. 153 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [46] S.D. Kaushik, C.-H. Huang, .J. Ram anujam , and P. Sadayappan. M ultiphase Array Redistribution: Modeling and Evaluation. Proc. of Intl. Parallel Pro cessing Symposium, pages 441-445, 1995. [47] C. Koelbel, D. Loveman, R. Schreiber, G. Steele Jr., and M. Zosel. The High Performance Fortran Handbook. The M IT Press, 1994. [48] D. K um ar and K. Parhi. Performance Trade-off of DCT Architectures in Xilinx FPGAs. The 33rd Asilomar Conference on Signals, Systems, and Computers, 1999. [49] V. Kum ar, A. Grama, A. G upta, and G. Karypis. Introduction to Paral lel Computing: Design and Analysis of Algorithms. Benjamin Cummings, November 1993. [50] M. Lam, E. Rothberg, and M. E. Wolf. The Cache Performance and Op timizations of Blocked Algorithms. Proceedings of the Fourth International Conference on Architectural Support for Programming Languages and Oper ating Systems (ASPLOS-IV), April 1991. [51] M. S. Lam, E. E. Rothberg, and M. E. Wolf. The Cache Performance and Optim izations of Blocked Algorithms. ASPLO S IV, April 1991. [52] A. R. Lebeck, X. Fan, H. Zeng, and C. Ellis. Power Aware Page Allocation. In Proceedings of Ninth International Conference on Architecture Support for Programming Languages and Operating System (ASPLOS IX), pages 27-38, June 1998. [53] Y. W. Lim, P. B. Bhat, and V. K. Prasanna. Efficient Algorithms for Block- Cyclic Redistribution of an Array. Proc. IEEE Symposium on Parallel and Distributed Processing, October 1996. [54] Y. W. Lim and V. K. Prasanna. Scalable Portable Im plementations of Space- Time Adaptive Processing. 10th Inter. Conf. High Perf. Comp., 1996. [55] Y.W. Lim, N. Park, and V.K. Prasanna. Efficient Algorithms for Multi- Dimensional Block-Cyclic Redistribution of Arrays. In Proceedings of the 1997 International Conference on Parallel Processing, 1997. [56] Wai-Sum Lin, Rynson W. H. Lau, Kai Hwang, Xiaola Lin, and Paul Y. S. Cheung. Adaptive Parallel Rendering on Multiprocessors and W orkstation Clusters. IEEE Transactions on Parallel and Distributed Systems, 11(3), March 2001. [57] M. Linderman and R. Linderman. Real-Time STAP Demonstration on an Embedded High-Performance Computer. National Radar Conference, 1997. 154 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [58] W. Liu and V. K. Prasanna. Utilizing the Power of High-performance Com puting. IEEE Signal Processing, Septem ber 1998. [59] C. Van Loan. Computational Frameworks for the Fast Fourier Transform, volume 10 of Frontiers in Applied Mathmetics. SIAM, 1992. [60] W. Luk, N. Shirazi, and P. Y. K. Cheung. Modeling and Optim izing Run tim e Reconfiguable Systems. IEEE Symposium on FPGAs for Custom Com puting Machines, 1996. [61] D. Mirkovic, R. Mahassom, and L. Johnsson. An Adaptive Software Library for Fast Fourier Transforms. Proceedings of the 2000 International Confer ence on Supercomputing, May 2000. [62] N. Mitchell, K. Hogstedt, L. Carter, and J. Ferrante. Quantifying the Multi- Level N ature of Tiling Interactions. International Journal of Parallel Pro gramming, 1998. [63] S. Mohanty, S. Choi, J. Jang, and V. K. Prasanna. A Model-based Method ology for Application Specific Energy Efficient D ata P ath Design using F P GAs. IEEE 13th International Conference on Application-specific Systems, Architectures and Processors (ASAP 2002), July 2002. [64] D. Padua. Outline of a Roadm ap for Compiler Technology. IEEE Computing in Science & Engineering, Fall 1996. [65] D. Padua. The Fortran I Compiler. IEEE Computing in Science & Engi neering, January/Febrary 2000. [66] P. R. Panda, H. Nakamura, N. D utt, and A. Nicolau. Augmenting Loop Tiling with D ata Alignment for Improved Cache Performance. IEEE Trans actions on Computers, 48(2), Feburary 1999. [67] C. H. Papadim itriou and K. Steiglitz. Combinatorial Optimization: Algo rithms and Complexity. Prentice Hall, 1982. [68] N. Park, J. Bae, and V. K. Prasanna. Synthesis of VLSI Architectures for Tree-Structured Image Coding. Proceedings of IEEE International Confer ence of Image Processing, October 1996. [69] N. Park, B. Hong, and V. K. Prasanna. Tiling, Block D ata Layout, and Mem ory Hierarchy Performance. Technical Report USC-CENG 01-05 (subm itted to IEEE Transactions on Com puters), Departm ent of Electrical Engineering, USC, September 2001. 155 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [70] N. Park, B. Hong, and V. K. Prasanna. Cache Performance Analysis for Block Size Selection in the ATLAS Library. HPCMO Users Group Confer ence 2002, 2002. [71] N. Park, Bo Hong, and V. K. Prasanna. Analysis of Memory Hierarchy Performance of Block D ata Layout. In Proceedings of the 2002 International Conference on Parallel Processing, 2002. [72] N. Park, D. Kang, K. Bondalapati, and V. K. Prasanna. Dynamic D ata Lay outs for Cache-conscious Factorization of DFT. Proceedings of International Parallel and Distributed Processing Symposium 2000 (IPDPS 2000), April 2000. [73] N. Park, W. Liu, V. K. Prasanna, and C. S. Raghavendra. Efficient M atrix M ultiplication Using Cache Conscious D ata Layouts. HPCMO Users Group Conference 2000, 2000. [74] N. Park and V. K. Prasanna. Cache Conscious W alsh-Hadam ard Transform. International Conference on Acoustics, Speech, and Signal Processing 2001 (ICASSP 2001), May 2001. [75] N. Park and V. K. Prasanna. Dynamic D ata Layouts for Cache-Conscious Signal Transforms. Technical Report USC-CENG 02-05 (subm itted to IEEE Transactions on Signal Processing), Departm ent of Electrical Engineering, USC, June 2002. [76] N. Park and V. K. Prasanna. Optimizing FFT Performance on Cache-based Processors. HPCMO Users Group Conference 2002, 2002. [77] N. Park, V. K. Prasanna, and C. S. Raghavendra. Efficient Algorithms for Block-Cyclic Array Redistribution between Processor Sets. Proceedings of the 1998 AC M /IEE E SC 98 Conference, November 1998. [78] N. Park, V. K. Prasanna, and C. S. Raghavendra. Efficient Algorithms for Block-Cyclic Array Redistribution between Processor Sets. IEEE Trans actions on Parallel and Distribution Systems, 10(12) :1217— 1240, December 1999. [79] D. Patterson, T. Anderson, N. Cardwell, R. Fromm, K. Keeton, C. Kozyrakis, R. Thomas, and K. Yelick. A Case for Intelligent DRAM: IRAM. IEEE Micro, April 1997. [80] L. Prylli and B. Tourancheau. Fast Runtim e Block Cyclic D ata Redistribu tion on Multiprocessors. Journal of Parallel and Distributed Computing, 45, August 1997. 156 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [81] S. Ramaswamy and P. Banerjee. Autom atic Generation of Efficient Ar ray Redistribution Routines for D istributed Memory M ulticomputers. Proc. of 5th Symp. Frontiers of Massively Parallel Computation, pages -342-349, Feburary 1995. [82] S. Ranka, R. Shankar, and K. Alsabti. Many-to-Many Personalized Commu nication with Bounded Traffic, in Symposium on the Frontiers of Massively Parallel Computation, pages 20-27, February 1995. [83] G. Rivera and C.-W. Tseng. D ata Transformations for Elim inating Conflict Misses. AC M SIGPLAN Conference on Programming Language Design and Implementation (P L D I’98'), June 1998. [84] G. Rivera and C.-W. Tseng. Locality Optim izations for Multi-Level Caches. Proceedings of IEEE Supercomputing’99(SC’ 99), November 1999. [85] J. Sanchez, A. Gonzalez, and M. Valero. Static Locality Analysis for Cache Management. The 1997 International Conference on Parallel Architectures and Compilation Techniques (PACT91), November 1997. [86] V. Sarkar and G. R. Gao. O ptim ization of Array Accesses by Collective Loop Transformations, the Proceedings of the 1991 International Conference of Supercomputing, June 1991. [87] A. Saulsbury, F. Dahgren, and P. Stenstrom. Receny-based TLB Preload ing. The 27th Annual International Symposium on Computer Architec- ture(ISCA), June 2000. [88] J. C. Setubal. Sequential and Parallel Experim ental Results with B ipartite M atching Algorithms. Technical Report IC-96-09, Institute of Computing, State University of Campinas (Brazil), 1996. [89] H. Sharangpani. Intel Itanium Processor M icroarchitecture Overview. Mi croprocessor Forum, October 1999. [90] SHADE Simulator, http://w w w .sun.com /m icroelectronics/shade/. [91] J. Suh and V. K. Prasanna. Parallel Implementations of Synthetic Aperture Radar on High Performance Computing Platforms. International Conference on Algorithms and Architectures for Parallel Processing, December 1997. [92] O. Temam, E. D. Granston, and W. Jalby. To Copy or Not to Copy: A Comile-Time Technique for Assessing W hen D ata Copying Should be Used to Elim inate Cache Conflicts. Proceedings of IEEE Supercomputing’93(SC,93), November 1993. 157 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission. [93] R. Thakur, A. Choudhary, and G. Fox. Runtim e Array Redistribution in HPF Programs. Proc. of Scalable High Performance Computing Conference, pages 309-316, May 1994. [94] R. Thakur, A. Choudhary, and J. Ram anujam . Efficient Algorithms for Array Redistribution. IEEE Trans, on Parallel and Distributed Systems, 7(6):58T— 594, June 1996. [95] V. Tiwari, S. Malik, and A. Wolfe. Compilation Techniques for Low Energy: An Overview. In Proceedings of the 1994 IEEE Symposium on Lowe Power Electronics, pages 38-39, October 1994. [96] R. Tolimieri, M. An, and C. Lu. Algorithms for discrete Fourier transforms and convolution. Springer, 1997. [97] E. Torrie, M. Martonosi, C.-W. Tseng, and M. W. Flail. Characterizing the Memory Behavior of Compiler-Parallelized Applications. IEEE Transactions on Parallel and Distributed Systems, December 1996. [98] K. R. Wadleigh. High Performance FFT Algorithms for Cache-Coherent Multiprocessors. The International Journal of High Performance Computing Applications, Vol. 13(2): 163-171, Summer 1999. [99] D.W. Walker and S.W. O tto. Redistribution of Block-Cyclic D ata D istri butions Using M PI. Concurrency .-Practice and Experience, 8(9):707-728, November 1996. [100] C.-L. Wang, P.B. Bhat, , and V.K. Prasanna. High-Performance Computing for Vision. Proceedings of IEEE, 84:931-946, July 1996. [101] R. C. Whaley and J. Dongarra. Automatically Tuned Linear Algebra Soft ware (ATLAS). Proceedings of S C ’98, November 1998. [102] Q. Yi, V. Adve, and K. Kennedy. Transforming Loops to Recursion for Multi-Level Memory Hierarchies. AC M SIG PLAN 2000 Conference on Pro gramming Language Design and Implementation (PLDI 2000), June 2000. 158 Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Linked assets
University of Southern California Dissertations and Theses
Conceptually similar
PDF
Energy and time efficient designs for digital signal processing kernels on FPGAs
PDF
A unified mapping framework for heterogeneous computing systems and computational grids
PDF
Compression, correlation and detection for energy efficient wireless sensor networks
PDF
Hierarchical design space exploration for efficient application design using heterogeneous embedded system
PDF
Energy efficient hardware-software co-synthesis using reconfigurable hardware
PDF
Energy -efficient information processing and routing in wireless sensor networks: Cross -layer optimization and tradeoffs
PDF
Application-specific external memory interfacing for FPGA-based reconfigurable architecture
PDF
An efficient design space exploration for balance between computation and memory
PDF
Energy latency tradeoffs for medium access and sleep scheduling in wireless sensor networks
PDF
Alias analysis for Java with reference -set representation in high -performance computing
PDF
Adaptive routing services in ad-hoc and sensor networks
PDF
Dynamic voltage and frequency scaling for energy-efficient system design
PDF
Adaptive video transmission over wireless fading channel
PDF
Functional testing of constrained and unconstrained memory using march tests
PDF
A CMOS frequency channelized receiver for serial-links
PDF
Error resilient techniques for robust video transmission
PDF
Design and analysis of server scheduling for video -on -demand systems
PDF
Efficient acoustic noise suppression for audio signals
PDF
Data driven control and identification: An unfalsification approach
PDF
Contributions to efficient vector quantization and frequency assignment design and implementation
Asset Metadata
Creator
Park, Neungsoo
(author)
Core Title
Improving memory hierarchy performance using data reorganization
School
Graduate School
Degree
Doctor of Philosophy
Degree Program
Computer Engineering
Publisher
University of Southern California
(original),
University of Southern California. Libraries
(digital)
Tag
Computer Science,engineering, electronics and electrical,OAI-PMH Harvest
Language
English
Contributor
Digitized by ProQuest
(provenance)
Advisor
Prasanna, Viktor (
committee chair
), Raghavendra, Cauligi S. (
committee member
), Zimmermann, Roger (
committee member
)
Permanent Link (DOI)
https://doi.org/10.25549/usctheses-c16-271018
Unique identifier
UC11334993
Identifier
3093966.pdf (filename),usctheses-c16-271018 (legacy record id)
Legacy Identifier
3093966.pdf
Dmrecord
271018
Document Type
Dissertation
Rights
Park, Neungsoo
Type
texts
Source
University of Southern California
(contributing entity),
University of Southern California Dissertations and Theses
(collection)
Access Conditions
The author retains rights to his/her dissertation, thesis or other graduate work according to U.S. copyright law. Electronic access is being provided by the USC Libraries in agreement with the au...
Repository Name
University of Southern California Digital Library
Repository Location
USC Digital Library, University of Southern California, University Park Campus, Los Angeles, California 90089, USA
Tags
engineering, electronics and electrical